Links to relevant softwares, papers, and resources - Green-Biome-Institute/AWS GitHub Wiki
Documentation for the AWS GBI AMI
Full assembly-related software documentation:
- Short read:
- SOAPdenovo
- ABySS
- ALLPATHS-LG
- SPAdes
- SGA
- SparseAssembler
- Long read:
- Mixed short/long read
- Assembly Polishing
- Racon
- Medaka
- Assembly QC tools
- Final Checks
- Stopping instances and unmounting storage
- Deleting S3 buckets
- Making sure nothing is running/being paid for
Here is a list of resources that I've found useful or relevant for genome assembly tools and plant genetics:
This has a good collection of many free and open source softwares related to genome assembly: https://bioinformaticshome.com/tools/wga/wga.html Also, always useful for a basic search... wikipedia: https://en.wikipedia.org/wiki/De_novo_sequence_assemblers
For dealing with the SRA command fastq-dump: https://edwards.sdsu.edu/research/fastq-dump/
Related to Arabidopsis Thaliana:
http://arabidopsisresearch.org/index.php/en/
https://www.arabidopsisinformatics.org
1001genomes projects A. Thaliana: http://1001genomes.org/data-center.html
Arabidopsis bioinformatics resources: The current state,challenges, and priorities for the future
European Nucleotide Archive data for A. Thaliana: https://www.ebi.ac.uk/ena/browser/text-search?query=arabidopsis%20thaliana
A. Thaliana current genome annotations: https://gbrowse.arabidopsis.org/cgi-bin/gb2/gbrowse/arabidopsis/
Screen vs Tmux
https://linuxhint.com/tmux_vs_screen/
Some websites that give information on assemblers I've found interesting or have implemented:
Both short and long read:
Short-read (Illumina, 454, IonTorrent, etc.):
Long-read (PacBio, Oxford Nanopore):
- Canu, which is a newer version of the Celera assembler
- Falcon (only for PacBio data)
- Miniasm which is usually paired with Minimap and Racon (below)
- Raven (invokes Racon as a polishing step)
- MECAT & NECAT (looks like same group built both - one for PacBio, one for nanopore)
Sequence Alignment:
- Minimap (requires a reference genome)
Consensus Sequence:
- Racon, this is an assembly polisher.
- Medaka
Assembly QC:
- BUSCO Git and BUSCO docs
- QUAST
Websites with helpful information: Context for the N50 metric: http://www.acgt.me/blog/2013/7/8/why-is-n50-used-as-an-assembly-metric.html
Assembler Comparison Papers
Some of these are the papers that describe a novel assembler itself, some are just studies comparing them with different data sets or types of data (like bacterial vs plant):
Wu, X., Heffelfinger, C., Zhao, H., & Dellaporta, S. L. (2019). Benchmarking variant identification tools for plant diversity discovery. BMC Genomics, 20(1), 1β15. https://doi.org/10.1186/s12864-019-6057-7
Khan, A. R., Pervez, M. T., Babar, M. E., Naveed, N., & Shoaib, M. (2018). A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evolutionary Bioinformatics, 14. https://doi.org/10.1177/1176934318758650
Huang, Y. T., & Liao, C. F. (2016). Integration of string and de Bruijn graphs for genome assembly. Bioinformatics, 32(9), 1301β1307. https://doi.org/10.1093/bioinformatics/btw011
Chen, Y., Nie, F., Xie, S. Q., Zheng, Y. F., Dai, Q., Bray, T., Wang, Y. X., Xing, J. F., Huang, Z. J., Wang, D. P., He, L. J., Luo, F., Wang, J. X., Liu, Y. Z., & Xiao, C. Le. (2021). Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications, 12(1), 1β10. https://doi.org/10.1038/s41467-020-20236-7
Jung, H., Winefield, C., Bombarely, A., Prentis, P., & Waterhouse, P. (2019). Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes. Trends in Plant Science, 24(8), 700β724. https://doi.org/10.1016/j.tplants.2019.05.003
Shasta Shafin, K., Pesout, T., Lorig-Roach, R., Haukness, M., Olsen, H. E., Bosworth, C., Armstrong, J., Tigyi, K., Maurer, N., Koren, S., Sedlazeck, F. J., Marschall, T., Mayes, S., Costa, V., Zook, J. M., Liu, K. J., Kilburn, D., Sorensen, M., Munson, K. M., β¦ Paten, B. (2019). Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit. BioRxiv. https://doi.org/10.1101/715722
Vaser, R., SoviΔ, I., Nagarajan, N., & Ε ikiΔ, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research, 27(5), 737β746. https://doi.org/10.1101/gr.214270.116
Ye, C., Ma, Z. S., Cannon, C. H., Pop, M., & Yu, D. W. (2012). Exploiting sparseness in de novo genome assembly From Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Barcelona. 13(Suppl 6), 1β8. http://www.biomedcentral.com/1471-2105/13/S6/S1
Xu, M., Guo, L., Gu, S., Wang, O., Zhang, R., Peters, B. A., Fan, G., Liu, X., Xu, X., Deng, L., & Zhang, Y. (2020). TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience, 9(9), 1β11. https://doi.org/10.1093/gigascience/giaa094
Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., Chitsaz, H., Chou, W. C., Corbeil, J., Fabbro, C. Del, Docking, R. R., Durbin, R., Earl, D., Emrich, S., Fedotov, P., β¦ Korf, I. F. (2013). Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. GigaScience, 2(1), 1β31. https://doi.org/10.1186/2047-217X-2-10
Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M., Marçais, G., Pop, M., & Yorke, J. A. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms (Genome Research (2012) 22 (557-567)). Genome Research, 22(6), 1196. https://doi.org/10.1101/gr.131383.111.22
Wick, R. R., & Holt, K. E. (2019). Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research, 8, 1β22. https://doi.org/10.12688/f1000research.21782.1
Xia, E., Li, F., Tong, W., Yang, H., Wang, S., Zhao, J., Liu, C., Gao, L., Tai, Y., She, G., Sun, J., Cao, H., Gao, Q., Li, Y., Deng, W., Jiang, X., Wang, W., Chen, Q., Zhang, S., β¦ Wan, X. (2019). The tea plant reference genome and improved gene annotation using long-read and paired-end sequencing data. Scientific Data, 6(1), 1β9. https://doi.org/10.1038/s41597-019-0127-1
Latorre-PΓ©rez, A., Villalba-Bermell, P., Pascual, J., & Vilanova, C. (2020). Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Scientific Reports, 10(1), 1β14. https://doi.org/10.1038/s41598-020-70491-3
Michael, T. P., & VanBuren, R. (2020). Building near-complete plant genomes. Current Opinion in Plant Biology, 54, 26β33. https://doi.org/10.1016/j.pbi.2019.12.009
Murigneux, V., Rai, S. K., Furtado, A., Bruxner, T. J. C., Tian, W., Harliwong, I., Wei, H., Yang, B., Ye, Q., Anderson, E., Mao, Q., Drmanac, R., Wang, O., Peters, B. A., Xu, M., Wu, P., Topp, B., Coin, L. J. M., & Henry, R. J. (2021). Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience, 9(12), 1β11. https://doi.org/10.1093/gigascience/giaa146
Thudi, M., Li, Y., Jackson, S. A., May, G. D., & Varshney, R. K. (2012). Current state-of-art of sequencing technologies for plant genomics research. Briefings in Functional Genomics, 11(1), 3β11. https://doi.org/10.1093/bfgp/elr045
Jung, H., Jeon, M. S., Hodgett, M., Waterhouse, P., & Eyun, S. Il. (2020). Comparative Evaluation of Genome Assemblers from Long-Read Sequencing for Plants and Crops. Journal of Agricultural and Food Chemistry, 68(29), 7670β7677. https://doi.org/10.1021/acs.jafc.0c01647
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., & Phillippy, A. M. (2017). Canu: Scalable and accurate long-read assembly via adaptive ΞΊ-mer weighting and repeat separation. Genome Research, 27(5), 722β736. https://doi.org/10.1101/gr.215087.116
Papers related to long assemblies or plant assemblies:
Koren, S., Walenz, & Soltis, D. E. (2021). Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants, People, Planet, 3(1), 74β82. https://doi.org/10.1002/ppp3.10159
Zhang, X., Chen, X., Liang, P., & Tang, H. (2018). Cataloging Plant Genome Structural Variations. Current Issues in Molecular Biology, 27, 181β194. https://doi.org/10.21775/cimb.027.181
Wicker, T., Schulman, A. H., Tanskanen, J., Spannagl, M., Twardziok, S., Mascher, M., Springer, N. M., Li, Q., Waugh, R., Li, C., Zhang, G., Stein, N., Mayer, K. F. X., & Gundlach, H. (2017). The repetitive landscape of the 5100 Mbp barley genome. Mobile DNA, 8(1), 1β16. https://doi.org/10.1186/s13100-017-0102-3
De La Torre, A. R., Birol, I., Bousquet, J., Ingvarsson, P. K., Jansson, S., Jones, S. J. M., Keeling, C. I., MacKay, J., Nilsson, O., Ritland, K., Street, N., Yanchuk, A., Zerbe, P., & Bohlmann, J. (2014). Insights into conifer giga-genomes. Plant Physiology, 166(4), 1724β1732. https://doi.org/10.1104/pp.114.248708
Hamilton, J. P., & Robin Buell, C. (2012). Advances in plant genome sequencing. Plant Journal, 70(1), 177β190. https://doi.org/10.1111/j.1365-313X.2012.04894.x
Li, F. W., & Harkess, A. (2018). A guide to sequence your favorite plant genomes. Applications in Plant Sciences, 6(3), 1β7. https://doi.org/10.1002/aps3.1030
Good list of plant genome databases: Ong, Q., Nguyen, P., Phuong Thao, N., & Le, L. (2016). Bioinformatics Approach in Plant Genomic Research. Current Genomics, 17(4), 368β378. https://doi.org/10.2174/1389202917666160331202956
Longest sequenced/assembled genome: Meyer, A., Schloissnig, S., Franchini, P., Du, K., Woltering, J., Irisarri, I., Wong, W. Y., Nowoshilow, S., Kneitz, S., Kawaguchi, A., Fabrizius, A., Xiong, P., Dechaud, C., Spaink, H., Volff, J.-N., Simakov, O., Burmester, T., Tanaka, E. M., & Schartl, M. (2021). Giant lungfish genome elucidates the conquest of land by vertebrates. Nature, 590(July 2020). https://doi.org/10.1038/s41586-021-03198-8
Orozco-Arias, S., Isaza, G., & Guyot, R. (2019). Retrotransposons in plant genomes: Structure, identification, and classification through bioinformatics and machine learning. International Journal of Molecular Sciences, 20(15). https://doi.org/10.3390/ijms20153837