Data Structure – Trie and Radix Tree

Trie

Tries are an extremely special and useful data-structure that are based on the prefix of a string. They are used to represent the “Retrieval” of data and thus the name Trie.

A Trie is a special data structure used to store strings that can be visualized like a graph. It consists of nodes and edges. Each node consists of at max 26 children and edges connect each parent node to its children. These 26 pointers are nothing but pointers for each of the 26 letters of the English alphabet A separate edge is maintained for every edge. Strings are stored in a top to bottom manner on the basis of their prefix in a trie. All prefixes of length 1 are stored at until level 1, all prefixes of length 2 are sorted at until level 2 and so on.

Here is a visual representation of the set of words [‘CAB’, ‘CAR’, ‘CAN’] stored in a Trie data structure. You traverse from root towards it’s child nodes until you reach the EOW (End of Word) marker that denotes that a word from the set.

But in real, the node that has multiple characters at a level has multiple characters within itself. ‘B’, ‘R’, ‘N’ have a common prefix ‘CA’. The characters ‘B’, ‘R’, ‘N’ are all children of ‘A’ and are stored in a single node. Each character points to a new node that can contain other children or EOW.

Strengths

  1. Sometimes Space-Efficient: If you’re storing lots of words that start with similar patterns, tries may reduce the overall storage cost by storing shared prefixes once.
  2. Efficient Prefix Queries: Tries can quickly answer queries about words with shared prefixes, like:
    • How many words start with “choco”?
    • What’s the most likely next letter in a word that starts with “strawber”?

Weaknesses

  1. Usually Space-Inefficient: Tries rarely save space when compared to storing strings in a set. ASCII characters in a string are one byte each. Each link between trie nodes is a pointer to an address—eight bytes on a 64-bit system. So, the overhead of linking nodes together often outweighs the savings from storing fewer characters.
  2. Not Standard: Most languages don’t come with a built-in trie implementation. You’ll need to implement one yourself.

Tries v/s Sets

Say you were implementing a spell checker. You’ll look for each word to see if it appears in Merriam-Webster’s dictionary.

You could put all the dictionary words in a trie. Or, you could put them in a set.

Both options have the same average-case lookup complexity: O(k), where kk is the number of characters in the lookup string:

  • For the trie, you’d have to walk from the root of the trie through kk nodes, one character at a time.
  • For the hash set, you have to compute a hash value from all kk characters of the string in order to index into the underlying array.
    So, if they have the same complexity, which one should you use?

Use a trie if you want to quickly find words starting with the same prefix. In our spell checker, this might be useful for suggesting corrections (i.e.: fixing “chocolatr” to “chocolate”). The only way to do this with a hash set would be to iterate through all the words, in O(n) time.

Use a hash set if you just need to check if a string is present or you’re optimizing for space. In most cases, a hash set will take up fewer bytes than a trie. And, hash set lookups will probably be faster than trie lookups—trie nodes can be scattered throughout memory, which isn’t cache friendly.

Radix Trees

A radix tree is like a trie, but it saves space by combining nodes together if they only have one child. Here’s what a radix tree with “Maria”, “Mariana”, and “David” looks like.

Radix trees are more cache friendly than tries, since the characters within a single node are usually stored in an array of characters, adjacent to each other in memory.

Implementing Trie data structure using Python

Before we begin with the code, here’s a visual representation of how the nodes for trie are joined together. Each letter in a node points to the memory location of the child node. The last nodes in the trie do not have any characters and the EOW is flagged as True.

Real time execution

class Node:
    def __init__(self):
        self.children = {}
        self.endOfWord = False

def insertWord(root, word):
  '''
  Loop through characters in a word and keep adding them at a new node, linking them together
  If char already in node, pass
  Increment the current to the child with the character
  After the characters in word are over, mark current as EOW
  '''
  current = root
  for char in word:
    if char in current.children.keys():
      pass
    else:
      current.children[char] = Node()
    current = current.children[char]
  current.endOfWord = True

def allWords(prefix, node, results):
  '''
  Recursively call the loop
  Prefix will be prefix + current character
  Node will be position of char's child
  results are passed by reference to keep storing result

  Eventually, when we reach EOW, the prefix will have all the chars from starting and will be the word that we need. We add this word to the result
  '''
  if node.endOfWord:
    results.append(prefix)
  for char in node.children.keys():
    #print char, node, node.children
    allWords(prefix + char, node.children[char], results)
  
def searchWord(root, word):
  '''
  Loop through chars of the word in the trie
    If char in word is not in trie.children(), return
    If char found, keep iterating
  After iteration for word is done, we should be at the end of word. If not, then word doesn't exist and we return false.
  '''
  current = root
  search_result = True
  for char in word:
    if char in current.children.keys():
      pass
    else:
      search_result = False
      break
    current = current.children[char]
  if not current.endOfWord:
    search_result = False
  return search_result

def getWordsWithPrefix(prefix, node, prefix_result):
  '''
  We loop through charcters in the prefix along with trie
  If mismatch, return
  If no mismatch during iteration, we have reached the end of prefix. Now we need to get words from current to end with the previx that we passed. So call allWords with prefix
  '''
  current = node
  for char in prefix:
    if char in current.children.keys():
      pass
    else:
      return
    current = current.children[char]
  allWords(prefix, current, prefix_result)

root = Node()
words = ['bed', 'ben']
words = ['bed', 'bedlam', 'bond', 'bomber', 'bombay']
for word in words:
  insertWord(root, word)

results = []
prefix = ''
allWords(prefix, root, results)
# prefix will be added to every word found in the result, so we start with ''
# results is empty, passed as reference so all results are stored in this list
print 'All words in trie: {}\n\n'.format(results)

search_word = 'bomb'
search_result = searchWord(root, search_word)
print 'Search {} in {}: {}'.format(search_word, words, search_result)

search_word = 'bomber'
search_result = searchWord(root, search_word)
print 'Search {} in {}: {}'.format(search_word, words, search_result)

prefix_result = []
prefix = 'be'
getWordsWithPrefix(prefix, root, prefix_result)
print '\n\nWords starting with {}: {}'.format(prefix, prefix_result)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.