System requirements - nterhoeven/reper GitHub Wiki

System requirements

computational resources

reper was developed and tested on Ubuntu 16.04 64bit. Using Docker or Singularity it should also be possible to run reper on various host systems.

The intended use-case of reper is the analysis of large and complex (plant) genomes. Since these data sets are quite large, we recommend to use an HPC environment or other high CPU and high memory computers. As orientation: The Beta vulgaris data set used in the Tutorial consists of 23 million read pairs (a 6x coverage). The whole pipeline runs about 5-6 hours with 24 Threads and 100G memory.

dependencies

reper has a few dependencies listed below. Most of them are common bioinformatics tools and probably already installed on your system. If you do not want to deal with the setup of the dependencies, you can use Docker or Singularity.

Requirements for running reper:

Tool	version	license	citation
Jellyfish	2.2.6	GPl-3.0	[1]
Trinity	2.4.0	BSD 3-clause	[2]
cd-hit	4.6.7	GPL-2.0	[3], [4]
Bowtie2	2.3.2	Artistic	[5]
samtools	1.4.1	MIT	[6]
blast+	2.2.28	Public Domain	[7], [8]
kmer-filter	0.03	MIT
perl5lib-sam	1.3.5	MIT
perl	5.22.1	GPL

Additional dependencies used in the tutorial:

Tool	version	license	citation
fastq-dump	2.8.2-1	Public domain
R	3.4.2	GPL-3	[10]
ggplot	2.2.21	GPL-2

References

[1] Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 (first published online January 7, 2011) doi:1
0.1093/bioinformatics/btr011

[2] Grabherr MG, Haas BJ, Yassour M, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology. 2011;29(7):644-652. doi:10.1038/nbt.1883.

[3] "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9

[4] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152. doi: 10.1093/bioinformatics
/bts565

[5] Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.

[6] Li H, Mathematical Notes on SAMtools Algorithms

[7] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

[8] Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., & Madden T.L. (2008) "BLAST+: architecture and applications." BMC Bioinformatics 10:421.

[9] Nussbaumer T, Martis MM, Roessner SK, Pfeifer M, Bader KC, Sharma S, Gundlach H, Spannagl M. MIPS PlantsDB: a database framework for comparative plant genome research. Nucleic Acids Res. 2013 Jan;41(Databas
e issue):D1144-51.

[10] R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.