Suffix Tree Using Ukkonen's Algorithm - TheEvergreenStateCollege/bioinformatics GitHub Wiki
Suffix Tree Using Ukkonen's Algorithm
What is Suffix Tree?
A suffix tree is a tree data structure usually used to store a list of strings. It is also known as the compressed version of a trie. Unlike a trie, each unique suffix in the list is compressed together and represented by a single node or branch in a suffix tree. It is a strong data structure used in different string processing applications, specifically in bioinformatics; for tasks such as searching for patterns in DNA sequences.
The key features are:
- The concatenation of the edge labels on the path from the root to a leaf (i) exactly spells out the suffix of (S) starting at position (i).
- No two edges starting from a node can have edge labels beginning with the same character.
Example
For the string (S = "abcabxabcde"), the suffix tree would contain all suffixes of (S):
- "abcabxabcde"
- "bcabxabcde"
- "cabxabcde"
- "abxabcde"
- "bxabcde"
- "xabcde"
- "abcde"
- "bcde"
- "cde"
- "de"
- βeβ
Applications of Substring
1. Search for substring: Check if a substring exists within the main string. 2. Pattern Matching: locate all occurrences of a pattern within a string. 3. Genomic Analysis: Analyze DNA sequence for common patterns.
Suffix tree built for 2 DNA sequences
Suffix tree configuration
Suffix trees can be created using algorithms such as Ukkonen's method, which builds the tree in linear time in relation to the string length.
Ukkonen's Algorithm: What Is It?
A suffix tree can be efficiently constructed in O(n) time using Ukkonen's algorithm. One character at a time, it adds to the suffix tree slowly, adjusting the tree structure as it goes.
Key features of Ukkonen's Algorithm
Active point: This consists of active_node, active_length and active_edge. They are used to keep track of the current position in the tree. Remainder: The number of suffixes yet to be added to the tree. Suffix Links: Pointers/links used to jump from one node to another.
abcabxabcde
Using Ukkonen's Algorithm
Constructing Suffix Tree for Key Elements in the Images:
Active Node: The active node is the current node from which the algorithm is trying to extend the tree. It is the node where the active point resides.
Active Edge: The active edge is the edge labeled with the character from the string that the active point is currently traversing.
Active Length: The active length is the number of characters matched along the active edge. It indicates how far along the active edge the algorithm is.
Remainder: The remainder is the number of suffixes yet to be added to the tree. It keeps track of how many suffixes are left to be processed in the current phase.
Steps and Operations:
1. Initial Setup:
- String:
abcabxabcde
- The suffix tree is initially empty.
- No active length, node or remainder
-
Step-by-Step Addition:
Iteration 1: Add 'a':
-
The tree after adding 'a':
-
The Suffix tree has no active node, active length, active edge or remainder
Iteration 2: Add 'b':
-
The tree after adding 'b':
-
The Suffix tree has no active node, active length, active edge or remainder
Iteration 3: Add 'c':
- The tree after adding 'c':
- The Suffix tree has no active node, active length, active edge or remainder
Iteration 4: Add 'a':
- The tree after adding 'a':
- The suffix tree has 0 active node, 1 active length, active edge: a i.e traversing from point βaβ
- The remainder is 1, meaning there are still 1 suffix to be added to the tree.
Iteration 5: Add 'b':
- The tree after adding 'b':
- The suffix tree has 0 active node, 2 active length, active edge: a i.e traversing from point βaβ
- The remainder is 2, meaning there are still 2 suffixes to be added to the tree.
Note: Adding New Character:
- When a new character is added, the algorithm checks if it can extend the current path or needs to create a new branch.
- The suffix tree grows by extending existing edges or adding new edges based on the characters in the string and the current active point.
Iteration 6: 1st remaining Suffix;
Current Suffix Tree:
- Nodes: 0, 1, 2, 3, 4, 5
- Edges:
- 0 -> 2 (bcabx)
- 0 -> 3 (cabx)
- 0 -> 4 (ab)
- 4 -> 1 (cabx)
- 4 -> 5 (x)
Explanation:
- The algorithm is currently at the root node (0), on the edge labeled 'b', one character into the edge.
- There is one suffix left to add.
- The suffix tree at this point includes the suffixes: "bcabx", "cabx", "ab", "cabx", "x".
Iteration 7: 2nd remaining Suffix;
Current Suffix Tree:
- Nodes: 0, 1, 2, 3, 4, 5, 6, 7
- Edges:
- 0 -> 2 (bcabx)
- 0 -> 3 (cabx)
- 0 -> 4 (ab)
- 4 -> 1 (cabx)
- 4 -> 5 (x)
- 6 -> 2 (x)
- 6 -> 7 (x)
Explanation:
- The algorithm has completed processing all suffixes for the current character 'b'.
- The active point is reset to the root node (0) with no active edge or length.
- The remainder is 0, indicating all suffixes for the current character have been added to the tree.
Iteration 8: Add 'x':
- The tree after adding 'x':
- The Suffix tree has no active node, active length, active edge or remainder
Iteration 9: Add 'a':
- The tree after adding 'a':
- The suffix tree has 0 active node, 1 active length, active edge: a i.e traversing from point βaβ
- The remainder is 1, meaning there are still 1 suffix to be added to the tree.
Iteration 10: Add 'b':
- The tree after adding 'b':
- The suffix tree has 4 active nodes, 0 active length, 0 active edge
- The remainder is 2, meaning there are still 2 suffixes to be added to the tree.
Iteration 11: Add 'c':
- The tree after adding 'c':
- The suffix tree has 4 active nodes, 1 active length, active edge: βcβ i.e traversing from point βcβ
- The remainder is 3, meaning there are still 3 suffixes to be added to the tree.
Iteration 12: 1st remaining Suffix:
Current Suffix Tree:
- Nodes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
- Edges:
- 0 -> 3 (cabxabcd)
- 0 -> 8 (xbcd)
- 0 -> 4 (ab)
- 0 -> 6 (b)
- 4 -> 9 (c)
- 4 -> 5 (xabcd)
- 6 -> 2 (clabxabcd)
- 6 -> 7 (xabcd)
- 9 -> 1 (abxabcd)
- 9 -> 10 (d)
Explanation:
- The algorithm has matched 1 character 'c' on the active edge.
- The active point is at node 6, edge 'c', with an active length of 1.
- The remainder is 2, indicating there are 2 suffixes left to process for the current character.
Iteration 13: 2nd remaining Suffix:
Current Suffix Tree:
- Nodes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
- Edges:
- 0 -> 3 (cabxabcd)
- 0 -> 8 (xbcd)
- 0 -> 4 (ab)
- 0 -> 6 (b)
- 4 -> 9 (c)
- 4 -> 5 (xabcd)
- 6 -> 2 (clabxabcd)
- 6 -> 7 (xabcd)
- 9 -> 1 (abxabcd)
- 9 -> 10 (d)
- 0 -> 11 (c)
- 11 -> 12 (d)
Explanation:
- The algorithm has matched 1 character ('c') on the active edge.
- The active point is at node 0, edge 'c', with an active length of 1.
- The remainder is 1, indicating there is 1 more suffix left to process for the current character.
Iteration 14: 3rd remaining Suffix:
Current Suffix Tree:
- Nodes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
- Edges:
- 0 -> 3 (cabxabcd)
- 0 -> 8 (xbcd)
- 0 -> 4 (ab)
- 0 -> 6 (b)
- 0 -> 13 (c)
- 13 -> 14 (d)
- 4 -> 9 (c)
- 4 -> 5 (xabcd)
- 6 -> 2 (clabxabcd)
- 6 -> 7 (xabcd)
- 9 -> 1 (abxabcd)
- 9 -> 10 (d)
- 11 -> 12 (d)
Explanation:
-
The current phase has completed, with all suffixes for the new character processed.
-
The active point is at the root node, with no active edge or length.
-
The remainder is 0, indicating readiness for the next phase.
Iteration 15: Add 'c':
- The tree after adding 'd':
- The Suffix tree has no active node, active length, active edge or remainder
Final Iteration: Add 'e':
- The tree after adding 'e':
- The Suffix tree has no active node, active length, active edge or remainder
Note:
The β$β symbol represents the end of the string. It is a unique termination character for the original string ensuring all the suffix and distinct and they each terminate at a βleaf nodeβ. Basically, it marks the end of a string.