Workflow - NoamKaplan/dna-triangulation GitHub Wiki
Note that this software was developed mainly as a proof-of-concept, not as a polished tool/pipeline. While this is expected to change in the future, currently it means some handholding is required.
There are 4 main programs: augmentation_chr_pred.py and augmentation_locus_pred.py can be used to predict the approximate genomic location of a contig in a pre-existing scaffold (we call this "genome augmentation"). karyotype.py and scaffold_chr.py can be used to construct approximate scaffolds given a set of contigs (we call this "de novo scaffolding"). Note that we use the term "contig" loosely in the sense that it is applicable also to scaffolds.
De novo scaffolding
- The first step is to assemble an initial set of contigs or scaffolds using any available assembly software.
- Next, one must map the Hi-C data to the contigs using any available Hi-C mapping/correction pipeline.
- Most Hi-C pipelines bin the data to create an interaction matrix at a given resolution, e.g 40kb. The size of the bin should be determined by the Hi-C resolution and genomic coverage. Binning is also required by the DNA Triangulation software since contigs are assumed to be of equal sizes (this may be changed in the future). There are actually a number of advantages to binning contigs, as explained later.
- Given the Hi-C interaction matrix, we now run karyotype.py in order to partition contigs into putative chromosomes. If the number of chromosomes is known, that may be used as input. Otherwise, the number of chromosomes can be estimated automatically by a form of bootstrapped clustering. It is important not to run this step blindly, especially if the number of chromosomes is not known. We recommend to look at the average clustering step length plot in order to better guess what a reasonable cluster number could be. For example, there may be a strong peak at 2-3 clusters but often we know that the number of chromosomes is larger and can rule out this possibility. The height and shape of the peak can also be used as a measure of confidence. In general, we suggest it is always better to overestimate the number of clusters in order to avoid false-positive merging of chromosomes. Another important point is to avoid a large leave-out (default is 20% leave-out) if there are likely to be small chromosomes.
- After the karyotyping step, each bin will be assigned to a cluster/chromosome number. Now we can use the fact that contigs can contain multiple bins in order to evaluate both the method and the quality of the contigs: we expect bins from the same contig to be assigned to the same cluster/chromosome. We can use this approach to choose an optimal bin size, as well as weed out problematic contigs and bins.
- Next, when we are happy with a set of contigs that we think belong to the same chromosome, we use these as input for scaffold_chr.py, resulting in an approximate position for each bin. Note that these positions are given in an arbitrary units. Once again, we can take advantage of the fact that we have multiple bins per contig in order to assess quality, get contig orientation, and in order to convert relative positions to absolute positions.
Once again, this software is under development and we hope to provide a more streamlined experience in the future.