Links to relevant softwares, papers, and resources - Green-Biome-Institute/AWS GitHub Wiki

Full assembly-related software documentation:

Short read:
- SOAPdenovo
- ABySS
- ALLPATHS-LG
- SPAdes
- SGA
- SparseAssembler
Long read:
- Canu
- Shasta
- NECAT
- Raven
- wtdbg2
- Flye
Mixed short/long read
- MaSuRCA
- WENGAN
Assembly Polishing
- Racon
- Medaka
Assembly QC tools
- BUSCO
- QUAST
Final Checks
- Stopping instances and unmounting storage
- Deleting S3 buckets
- Making sure nothing is running/being paid for

Here is a list of resources that I've found useful or relevant for genome assembly tools and plant genetics:

This has a good collection of many free and open source softwares related to genome assembly: https://bioinformaticshome.com/tools/wga/wga.html Also, always useful for a basic search... wikipedia: https://en.wikipedia.org/wiki/De_novo_sequence_assemblers

For dealing with the SRA command fastq-dump: https://edwards.sdsu.edu/research/fastq-dump/

Related to Arabidopsis Thaliana:

http://arabidopsisresearch.org/index.php/en/

https://www.arabidopsisinformatics.org

1001genomes projects A. Thaliana: http://1001genomes.org/data-center.html

Arabidopsis bioinformatics resources: The current state,challenges, and priorities for the future

European Nucleotide Archive data for A. Thaliana: https://www.ebi.ac.uk/ena/browser/text-search?query=arabidopsis%20thaliana

A. Thaliana current genome annotations: https://gbrowse.arabidopsis.org/cgi-bin/gb2/gbrowse/arabidopsis/

Screen vs Tmux

https://linuxhint.com/tmux_vs_screen/

Some websites that give information on assemblers I've found interesting or have implemented:

Both short and long read:

Flye
SPAdes, but this has only been found useful for shorter genomes..
MaSuRCA

Short-read (Illumina, 454, IonTorrent, etc.):

Long-read (PacBio, Oxford Nanopore):

Canu, which is a newer version of the Celera assembler
Falcon (only for PacBio data)
Miniasm which is usually paired with Minimap and Racon (below)
Raven (invokes Racon as a polishing step)
MECAT & NECAT (looks like same group built both - one for PacBio, one for nanopore)

Sequence Alignment:

Minimap (requires a reference genome)

Consensus Sequence:

Racon, this is an assembly polisher.
Medaka

Assembly QC:

BUSCO Git and BUSCO docs
QUAST

Websites with helpful information: Context for the N50 metric: http://www.acgt.me/blog/2013/7/8/why-is-n50-used-as-an-assembly-metric.html

Assembler Comparison Papers

Some of these are the papers that describe a novel assembler itself, some are just studies comparing them with different data sets or types of data (like bacterial vs plant):

Wu, X., Heffelfinger, C., Zhao, H., & Dellaporta, S. L. (2019). Benchmarking variant identification tools for plant diversity discovery. BMC Genomics, 20(1), 1–15. https://doi.org/10.1186/s12864-019-6057-7

Khan, A. R., Pervez, M. T., Babar, M. E., Naveed, N., & Shoaib, M. (2018). A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evolutionary Bioinformatics, 14. https://doi.org/10.1177/1176934318758650

Huang, Y. T., & Liao, C. F. (2016). Integration of string and de Bruijn graphs for genome assembly. Bioinformatics, 32(9), 1301–1307. https://doi.org/10.1093/bioinformatics/btw011

Chen, Y., Nie, F., Xie, S. Q., Zheng, Y. F., Dai, Q., Bray, T., Wang, Y. X., Xing, J. F., Huang, Z. J., Wang, D. P., He, L. J., Luo, F., Wang, J. X., Liu, Y. Z., & Xiao, C. Le. (2021). Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications, 12(1), 1–10. https://doi.org/10.1038/s41467-020-20236-7

Jung, H., Winefield, C., Bombarely, A., Prentis, P., & Waterhouse, P. (2019). Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes. Trends in Plant Science, 24(8), 700–724. https://doi.org/10.1016/j.tplants.2019.05.003

Shasta Shafin, K., Pesout, T., Lorig-Roach, R., Haukness, M., Olsen, H. E., Bosworth, C., Armstrong, J., Tigyi, K., Maurer, N., Koren, S., Sedlazeck, F. J., Marschall, T., Mayes, S., Costa, V., Zook, J. M., Liu, K. J., Kilburn, D., Sorensen, M., Munson, K. M., … Paten, B. (2019). Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit. BioRxiv. https://doi.org/10.1101/715722

Vaser, R., Sović, I., Nagarajan, N., & Šikić, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research, 27(5), 737–746. https://doi.org/10.1101/gr.214270.116

Ye, C., Ma, Z. S., Cannon, C. H., Pop, M., & Yu, D. W. (2012). Exploiting sparseness in de novo genome assembly From Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Barcelona. 13(Suppl 6), 1–8. http://www.biomedcentral.com/1471-2105/13/S6/S1

Xu, M., Guo, L., Gu, S., Wang, O., Zhang, R., Peters, B. A., Fan, G., Liu, X., Xu, X., Deng, L., & Zhang, Y. (2020). TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience, 9(9), 1–11. https://doi.org/10.1093/gigascience/giaa094

Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., Chitsaz, H., Chou, W. C., Corbeil, J., Fabbro, C. Del, Docking, R. R., Durbin, R., Earl, D., Emrich, S., Fedotov, P., … Korf, I. F. (2013). Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience, 2(1), 1–31. https://doi.org/10.1186/2047-217X-2-10

Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M., Marçais, G., Pop, M., & Yorke, J. A. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms (Genome Research (2012) 22 (557-567)). Genome Research, 22(6), 1196. https://doi.org/10.1101/gr.131383.111.22

Wick, R. R., & Holt, K. E. (2019). Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research, 8, 1–22. https://doi.org/10.12688/f1000research.21782.1

Xia, E., Li, F., Tong, W., Yang, H., Wang, S., Zhao, J., Liu, C., Gao, L., Tai, Y., She, G., Sun, J., Cao, H., Gao, Q., Li, Y., Deng, W., Jiang, X., Wang, W., Chen, Q., Zhang, S., … Wan, X. (2019). The tea plant reference genome and improved gene annotation using long-read and paired-end sequencing data. Scientific Data, 6(1), 1–9. https://doi.org/10.1038/s41597-019-0127-1

Latorre-Pérez, A., Villalba-Bermell, P., Pascual, J., & Vilanova, C. (2020). Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Scientific Reports, 10(1), 1–14. https://doi.org/10.1038/s41598-020-70491-3

Michael, T. P., & VanBuren, R. (2020). Building near-complete plant genomes. Current Opinion in Plant Biology, 54, 26–33. https://doi.org/10.1016/j.pbi.2019.12.009

Murigneux, V., Rai, S. K., Furtado, A., Bruxner, T. J. C., Tian, W., Harliwong, I., Wei, H., Yang, B., Ye, Q., Anderson, E., Mao, Q., Drmanac, R., Wang, O., Peters, B. A., Xu, M., Wu, P., Topp, B., Coin, L. J. M., & Henry, R. J. (2021). Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience, 9(12), 1–11. https://doi.org/10.1093/gigascience/giaa146

Thudi, M., Li, Y., Jackson, S. A., May, G. D., & Varshney, R. K. (2012). Current state-of-art of sequencing technologies for plant genomics research. Briefings in Functional Genomics, 11(1), 3–11. https://doi.org/10.1093/bfgp/elr045

Jung, H., Jeon, M. S., Hodgett, M., Waterhouse, P., & Eyun, S. Il. (2020). Comparative Evaluation of Genome Assemblers from Long-Read Sequencing for Plants and Crops. Journal of Agricultural and Food Chemistry, 68(29), 7670–7677. https://doi.org/10.1021/acs.jafc.0c01647

Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., & Phillippy, A. M. (2017). Canu: Scalable and accurate long-read assembly via adaptive κ-mer weighting and repeat separation. Genome Research, 27(5), 722–736. https://doi.org/10.1101/gr.215087.116

Papers related to long assemblies or plant assemblies:

Koren, S., Walenz, & Soltis, D. E. (2021). Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants, People, Planet, 3(1), 74–82. https://doi.org/10.1002/ppp3.10159

Zhang, X., Chen, X., Liang, P., & Tang, H. (2018). Cataloging Plant Genome Structural Variations. Current Issues in Molecular Biology, 27, 181–194. https://doi.org/10.21775/cimb.027.181

Wicker, T., Schulman, A. H., Tanskanen, J., Spannagl, M., Twardziok, S., Mascher, M., Springer, N. M., Li, Q., Waugh, R., Li, C., Zhang, G., Stein, N., Mayer, K. F. X., & Gundlach, H. (2017). The repetitive landscape of the 5100 Mbp barley genome. Mobile DNA, 8(1), 1–16. https://doi.org/10.1186/s13100-017-0102-3

De La Torre, A. R., Birol, I., Bousquet, J., Ingvarsson, P. K., Jansson, S., Jones, S. J. M., Keeling, C. I., MacKay, J., Nilsson, O., Ritland, K., Street, N., Yanchuk, A., Zerbe, P., & Bohlmann, J. (2014). Insights into conifer giga-genomes. Plant Physiology, 166(4), 1724–1732. https://doi.org/10.1104/pp.114.248708

Hamilton, J. P., & Robin Buell, C. (2012). Advances in plant genome sequencing. Plant Journal, 70(1), 177–190. https://doi.org/10.1111/j.1365-313X.2012.04894.x

Li, F. W., & Harkess, A. (2018). A guide to sequence your favorite plant genomes. Applications in Plant Sciences, 6(3), 1–7. https://doi.org/10.1002/aps3.1030

Good list of plant genome databases: Ong, Q., Nguyen, P., Phuong Thao, N., & Le, L. (2016). Bioinformatics Approach in Plant Genomic Research. Current Genomics, 17(4), 368–378. https://doi.org/10.2174/1389202917666160331202956

Longest sequenced/assembled genome: Meyer, A., Schloissnig, S., Franchini, P., Du, K., Woltering, J., Irisarri, I., Wong, W. Y., Nowoshilow, S., Kneitz, S., Kawaguchi, A., Fabrizius, A., Xiong, P., Dechaud, C., Spaink, H., Volff, J.-N., Simakov, O., Burmester, T., Tanaka, E. M., & Schartl, M. (2021). Giant lungfish genome elucidates the conquest of land by vertebrates. Nature, 590(July 2020). https://doi.org/10.1038/s41586-021-03198-8

Orozco-Arias, S., Isaza, G., & Guyot, R. (2019). Retrotransposons in plant genomes: Structure, identification, and classification through bioinformatics and machine learning. International Journal of Molecular Sciences, 20(15). https://doi.org/10.3390/ijms20153837

Go back to GBI AWS Wiki