iFR Clustering and Hierarchies Construction Algorithm - abi-pangenomics/PanFR GitHub Wiki

This algorithm takes an input of iFRs and their corresponding leaf de Bruijn nodes, clusters them, and generates hierarchies according to leaf sizes. Let’s divide the algorithm into four parts

  1. readFRs()
  2. initDS()
  3. getClusters()
  4. constructHierarchies()

Now let’s observe each in detail

1. Input

Input to the algorithm is a list of iFRs with their corresponding leaf de Bruijn nodes. We define the following terminology:
F – number of iFRs in the input
fi – the ith FR
deg(fi) – number of leaf De Bruijn Nodes corresponding to the ith FR.
L – Σ deg⁡(fi)
V – Number of unique de Bruijn Nodes

2. Initialize the Disjoint Set

In this step, we first initialize the Union-Find Data structure with F. The id[] and size[] arrays are initialized iteratively, making this step O(F)
Id[i] = i, size[i] = 1 ∀i … all sets are of size=1,and root is the element itself

Next for every unique leaf de Bruijn Node, union all the iFRs having the leaf in its leaf set. Consider every iFR-leaf relationship with an arrow. Number of arrows = L. Now reverse all those arrows, and we account for the reverse mapping of leaf-iFR. Thus Number of unions = L (Set) Universe size = F

We know that for a universe of size n, and m union-find operations using quick union with union-by-rank and path compression,
Time complexity = O(n + mα(n+m)) … where alpha is the inverse of Ackermann’s function.

Thus, complexity of this operation = O(F + Lα(F+L)) ≅ O(L)
Note – In the (best) case where all iFRs have disjoint leaf sets, complexity = O(L)

3. Cluster the hierarchies

Up till step 2, we have the hierarchies as array based trees, with each set member having a reference to its parent, ultimately up to the root of the set which has a reference to itself. Now we cluster these together and insert them into tree sets, one per hierarchy. We make the iFRs comparable using number of leafs such that: fi > fj if and only if deg(fi) > deg(fj), fi = fj if and only if deg(fi) = deg(fj) … ∀i,j

In the worst case, all iFRs belong to the same hierarchy, and thus: Time complexity = O(F log(F))

4. Construct the Compressed Hierarchies

Now we actually construct the hierarchies. We iterate over all the unique tree sets, and construct a hierarchy per set. The higher the number of leaves represented by the iFR, the higher it is in the hierarchy. Thus, we iteratively remove the last element from the tree set, and insert it into the hierarchy.

Now let’s examine the actual hierarchy insertion. Initially, the hierarchy starts off with only the last element of the in the tree set, which becomes the root. For every node polled from the set, we check if there is any overlap with the leaf dbg nodes of the root’s children and the polled node.
Case a: Overlap exists The child with which the overlap exists becomes the new root against which the polled node is rechecked via the same process recursively. Case b: No Overlap The node becomes a new direct child of the root

Note A – no partial overlap can exist, either an iFR’s de Bruijn Graph leaves may completely encompass those of another or the sets must be completely disjoint otherwise, due to the original binary tree structure.
Note B – current implementation validates the data by checking that Note A is always true. However, this can be avoided by proceeding if even a partial overlap/non-overlap is found.

Time complexity = O((F^2)*max(deg(fi))). A tighter bound may possibly be found.

Thus the overall time worst case complexity of construction turns out to be

O((F^2)*max(deg(fi)) + F + Lα(F+L))

which seems to be a slightly extreme upper bound, and thus this bound comes with the possibility of being tightened.