Tips - PacificBiosciences/FALCON GitHub Wiki

Tips for Restarting Jobs with Falcon

Background

Falcon uses a workflow engine (called pypeFLOW) to track dependencies. That engine will track the progress of all tasks defined in a dependency graph, similar to make. It will only submit jobs where the dependencies are not satisfied.

For small workflows, one can track all files. In the context of genome assembly, because of the design nature of Gene Myers' daligner code, it may not be a good idea to track all output. Therefore, some tasks will generate sentinel files (e.g. job_done) to track the progress.

Restarting the workflow

If you want to re-run the workflow after some jobs fail or with different parameters, you can restart some tasks by deleting the corresponding sub-directories and running fc_run.py again.

Already running jobs

For now, whenever Falcon exits for any reason, it will automatically kill (e.g. qdel) any running jobs. If you encounter a problem, you might need to kill those jobs manually. Then, file a bug here.

The "mypwatcher" directory

Pypeflow distinguishes "tasks" from "jobs". Tasks are done only if their dependencies are satisfied. If not done, a task will be staged for job-submission. The mypwatcher/ directory tracks "jobs" and knows nothing about "tasks". If pwatcher has seen a job finish, then a sentinel will exist somewhere under mypwatcher/exits. In that case, restarting will not cause the job to be re-submitted.

(Someday, this will allow us to re-acquire already running jobs, instead of killing them all on exit.)

The solution is simple: Also rm -rf mypwatcher/.

Examples

Here are some receipts I typically use for my own work:

Regenerate everything, including the error corrected reads

$ rm -rf 0-rawreads/preads/
$ rm -rf 1-preads_ovl/
$ rm -rf 2-asm-falcon
$ rm -rf pwatcher/
$ fc_run.py fc_run.cfg

Redo pread overlaps

$ rm -rf 1-preads_ovl/
$ rm -rf 2-asm-falcon
$ rm -rf pwatcher/
$ fc_run.py fc_run.cfg

Redo overlap to graph to contig

$ rm -rf 2-asm-falcon
$ rm -rf pwatcher/
$ fc_run.py fc_run.cfg      

For this, I typically modify the script run_falcon_asm.*.sh inside 2-asm-falcon instead of deleting the directory. It is useful for testing out different overlap filtering parameters of fc_ovlp_filter.py by changing the bash script.

Other Tips

To get p-read 5'- and 3'- overlap count:

$ fc_ovlp_stats.py --n_core 20 --fofn las.fofn #dump overlap count for las files inside las.fofn using 20 cores

000000000 13329 8 8    
000000002 10096 2 0    
000000003 11647 5 7    
000000004 14689 2 1    
000000005 13854 0 1    

The columns are (1) read_identifier (2) length (3) 5'-overlap_count and (4) 3'-overlap count.

To get a coverage histogram with one line:

$ cat ovlp.stats | awk '{print $3}'  | sort -g | uniq -c 

Check overlap-filtering to understand how it impacts the assembly and get ideas about how to set the parameters for overlap_filtering_setting and fc_ovlp_filter.py.

To get some ideas about how many overlapping jobs are finished:

$ cd 0-rawreads
$ ls -d job_* | wc
     59      59     767
$ find . -name "job*done"  | wc
     59      59    1947

59 of 59 overlap jobs are finished in this example.

Memory usage control

You need to check how much memory and number of core in your cluster. With -t 16 for ovlp_HPCdaligner_option and -s 400 for pa_DBsplit_option, then it takes about 20Gb to 25Gb per daligner job. (But prefer -M to -t.) The daligner is hard-coded to use 4 threads for now. If you have a 48 core blade server, you will need 48/4 * 25Gb = 300Gb RAM to utilize all cores for computation. If you don't have that much RAM, you can reduce the chunk size by reducing the number used for DBsplit -s. The tradeoff is that you will have more tasks, jobs, and files to track.

I also use small -s number to test the job scheduler sometimes. For example, you can create a lot of small jobs for E. coli assembly with -s 50 to test out the configuration.

Local mode

One can try non-distributed mode for a small assembly via job_type=local. Unless you have one big RAM and high core number machine, it is not recommended for larger genome (>100Mb).

Primary and Associated Contigs

  1. What are the differences between a_ctg.fasta and a_ctg_base.fasta?

The a_ctg_base.fasta contains the sequences in the primary contigs that are corresponding to the associated contigs inside a_ctg.fasta. Namely, each sequence of a_ctg_base.fasta is a contiguous sub-sequence of a primary contig. For each sequence inside ``a_ctg_base.fasta, there are some associated contigs in a_ctg.fasta`.

I think it is useful to review how the associate contigs are generated.

The basic ideas can be found https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads page 14, and https://speakerdeck.com/jchin/learning-genome-structrues-from-de-novo-assembly-and-long-read-mapping page 14, 15. https://speakerdeck.com/jchin/learning-genome-structrues-from-de-novo-assembly-and-long-read-mapping Conceptually, if a genome is haploid, then all contigs should be primary contigs. However, there will be still some associated contigs. In such case, the associated contigs are likely generated by (1) some residue sequencing errors and (2) segmental duplications. For the case (1), I expect Quvier can filter out some low quality associated contigs. Since there are more sequences in the primary for blasr to anchor reads and there is no true unique region in the erroneous associated contigs, the coverage on them will be low. We can filter low quality associated contig consensus as there is not much raw reads to support them. For the case (2), potentially, one can group the reads into different haplotype groups and construct assembly graph for each haplotype and generate contigs accordingly.

If a genome is a diploid genome, then most of the associated contigs will be locally alternative alleles. Typically, when there are big structure variations between homologous choromsomes, there will be alternative paths in the assembly graph and the alternative path is corresponded to the associated contigs. In such case, the primary contigs are “fused contigs” from both haplotypes. Currently, I am developing “Falcon unzip” to resolve the haplotypes so we can generate "haplotigs” which are contigs from definite haplotypes.

I made too videos to explain the concept, https://youtu.be/yC1ujdLUT7Q https://youtu.be/vwSyD31eahI

This slide shows how I apply the new method on a synthetic diploid genome to demonstrate how it works. https://twitter.com/infoecho/status/604070162656985088

  1. For a given contig in the alternate fasta, how can I find where it would map to in the primary contig file?

The 2nd field and the 3rd field of the sequence header inside a_ctg.fa indicate the begin node and the end node of the contig. For example, if we have a header like

>000000F-001-01 000941458:E 000486369:E 15593 47559 5 0.9969 0.8447

It means the associated contig 000000F-001-01 starts from node 000941458:E and ends at 000486369:E. Thsee two nodes should be also in the path of the corresponding primary contig. The path of the primary contig is fully specified in the file p_ctg_tiling_path, you can find exact beginning and ending points where the associated contig are attached to the primary contigs. However, the coordinates are not conserved after the Quiver consensus step, it might be necessary to do some quite alignment to recalibrate the attaching points after quiver consensus. In some case, you can even just do quick sequence alignment to find the homologous region in the primary contig of an associated contigs.

  1. Finally, let's say there was a large, repeat rich region on a given chromosome, and a homologous region on a different chromosome. How does FALCON avoid creating chimeric assemblies of these two chromosomes by mistakenly joining the two very similar regions?

Such repeats are typically called as “segmental duplications”. Yes, Falcon will collapse these regions if the overlapper can not distinguish the repeats. As I discuss in 1), (see slide page 14 too), in some case, it is just like the case of a diploid genome, we can potential find the two separate haplotypes. In other case, the repeat is more complicated, there are more than 2 copies, e.g. the middle part of contigs 4006 in page 21 of https://speakerdeck.com/jchin/learning-genome-structrues-from-de-novo-assembly-and-long-read-mapping, then we will need some new algorithms which I am thinking about now to separate the reads into more than two groups to resolve them.

Does falcon have the function of assembling tetraloid genome?

Falcon, as its current form, is a “diploid or polyploid aware assembler”. I believe there is no fully specific definition what a “diploid or polyploid assembler” should deliver yet at the moment of this writing. From the point of the genome assembly research field, it is still quite new. There were a couple of paper published before for diploid assemblies. However, the general strategy is the phasing adding reads on top on earlier assembly step.

To some degree, the current Falcon assembler provides a better way to build that foundation for a full diploid / polyploid assembler. Please refer to this slide deck https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads for some detail. Some technical details of the deck are already obsoleted for a little bit, but the general concept is still applied to most recent code in Falcon.

For a tetraploid genome, depending on the genome structure, I would argue one will get better continuity from the primary contigs if you use Falcon for assembling the genome. However, you will need to do good analysis on both primary and associated contigs (or better, the assembly graph directly) after running Falcon to interpret the results correctly. The primary contigs will be “fused” contigs from all haplotypes unless the differences between haplotypes are big such that the assembler’s overlap segregate them apart already.

There are some prototype work to fully segregate the “fused primary contigs” for diploid case. I just presented the ideas in #SFAF2015 conference. For tetraploid case, it will need some hard-code non-trivial mathematics research work to get it work right.