Burst Trie - Eishaaya/GMRDataStructuresTest GitHub Wiki

Remember Trie? The prefix tree you made back before graphs? If so, you probably also remember how the bottom of the tree would often be formed from pretty chains of nodes, practically linked-lists. However, if you remember, each trie-node was pretty large, memory-wise, with a collection of references in each. What if we could do something about that? Well, we can.

A burst-trie is just like a standard trie, with one pretty big exception: there are two types of nodes.


Implementation Guide


To begin with, let's introduce the newest node type, the container node:

This node-type stores all char data present within the tree. The boolean used in normal trie isn't present here though, so how do we store said data? Well, each container node has a...container, some sort of simple data structure, like a linked-list, or a BST. The containers will hold a limited amount of values matching a certain set of prefixes. When the container gets to large/inefficient, we burst it to maintain search-times. (we'll get to bursting later)

For this implementation, we'll be using a BST, as it has the best general performance in this case.
Before revealing the dropdowns, try to answer those questions yourself. If need-be, trie reading through a bit more of the wiki, then come back to this question

Why not an AVL?
An AVL, or some other balanced data-structure will be slower at this size, since balancing only really becomes powerful when scale comes into play.
Why not a linked list?
The cost increase of a BST over a linked-list is near-negligible, and since entry to the burst-trie should be near-random, traversal times should be radically improved.
Why not an array/list?
Keeping an array-backed data structure consistently sorted would involve a lot of expensive shifting

One last thing to note: Make sure to exclude duplicates, imagine if we tried to get all matching prefixes from a normal trie and we got duplicate words!


The second type is an internal node:

Just like a normal trie node, it holds a collection of references to other nodes. However, instead of a dictionary, we're going to be using an array, since array accesses are faster, and, more importantly for Burst-Sort, the array is always in order like radix sort. As mentioned, all word data is stored in container nodes, not internal nodes themselves. The point of the NIL value at 0 is to reference a container which holds the singular word reached when we land on this internal node.

Before continuing, here's a good template for your abstract BurstNode. Remember, both your internal and container nodes should inherit from this!

The way we can call the insert, or any other function from a node, and while not knowing what that node is, still getting it to preform the correct functionality.

    // Polymorphic base for the two types of nodes in the Trie
    public abstract class BurstNode
    {
        // The Trie this node belongs to
        internal BurstTrie ParentTrie;
        // The amount of values contained in this node
        public abstract int Count { get; }
        // Creates the Node referencing its parent-trie
        protected BurstNode(BurstTrie parent) => ParentTrie = parent;

        // Abstract recursive insertion function, returns replacement value for back-propagation 
        public abstract BurstNode Insert(string value, int index);
        // Abstract recursive deletion function, returns replacement value for back-propagation    
        public abstract BurstNode? Remove(string value, int index, out bool success);
        // Get a Node containing a defined prefix
        public abstract BurstNode? Search(string prefix, int index);        
        // Gets all items in order recursively
        internal abstract void GetAll();
    }

If you're confused at all about inheritance, polymorphism, or their usage in this data structure, please ask your instructor


Insertion


Note that your insert (as well as your other functions) should be made recursively!

Why should we recurse here?
Firstly, every traversal of the tree is very depth-first, which lends itself really well to recursion. Secondly, by calling the insertions/deletions/searches recursively through the nodes, we allow the nodes to polymorphically call the correct recursive functions. Lastly, since our data will always land in our buckets, and said buckets are pretty large, even if the user plops some nearly identical **essays** in the tree, they'd need so many hundreds of thousands of said essays that they'd run out of heap space before making a path so deep it would overwhelm our stack!

Our root begins as an empty container node, and as we add to it, that data will just go to its internal BST:

However, once we fill it to much, let's say five values for this example, the node will burst, replacing itself with an internal node, running through its values, and inserting them into said internal node.

And then, after adding some more values, they'd land in one of the new containers. Now, if we reached a null, we would generate another new container, but in this example, that just doesn't happen. However, if we were to add one more, we'll need to burst again.

You should see mirrors the burst we did above. Although, take note that the "a" value has landed in the NIL bucket.


From here, searching and deleting should be pretty simple, since it's pretty much the same logic. When deleting, just make sure to delete nodes as you empty them.


Assignments


  • Place a large word-bank into the burst-trie, and use it for auto-complete If you did this previously with your trie, benchmark & compare the two against each-other

  • Required: Make a BurstSort Just like heap tree way back in the day, this is as simple as putting your data into the tree, then removing it. Interestingly, this is a form of Radix Sort, specifically an MSD(Most Significant Digit) Radix Sort! With how we go down a set of arrays, whose radix/length is our inputted alphabet, do you see the similarities?

    Now, while the BSTs containers are really nice at keeping the tree fully sorted and accessible, in the context of a sorting algorithm instead of a sorted data structure, they're less than optimal. To speed things up, we can replace the BSTs with standard unsorted lists, and only sort them once they're fully filled, just like bucket sort! This means we can't really effectively block duplicates, but in the context of a sort, it would be weird to anyways.

  • Benchmark your BurstSort against some other sorts, and try out some different bucket sizes



Closing Notes


⚠️ **GitHub.com Fallback** ⚠️