How does reper work? - nterhoeven/reper GitHub Wiki
How does reper work?
The reper pipeline searches for repeats in NGS datasets. This is done in seven steps which are explained below.
counting kmers
This is the first step of reper. Here we use the jellyfish software to count 31-mers across the whole input data set. The resulting jellyfish hash is stored in reads.jf
filter the reads
Using the kmer-filter-script, the reads are now filtered according to these criteria:
- at least 50 % of the read must consist of kmers with counts above 5 times the genomic coverage
- a read pair is kept only if both reads match the filter
The filtered read pairs are stored in
reads.filtered_1.fq
andreads.filtered_2.fq
.
repeat assembly
In this step the filtered reads are assembled using the transcriptome assembler Trinity. In contrast to a genome assembler, Trinity results in different possible variants of a sequence (different Isoforms of a Unigene). This fits well to the behavior of repeats, which are also present in multiple variants. However, the resulting sequences (Trinity.fasta
) are not the "correct" set of repeats in the genome. They rather show a set of possible variants that may be present in the genome.
clustering
Since the relationship between sequences determined by Trinity (isoforms and unigenes) does not fit perfectly to repeats, we cluster the sequences additionally using cd-hit-est. The clustering parameters are chosen as follows:
- The alignment of two sequences must cover at least 90 % of the shorter sequence
- The alignment must have an identity of at least 80 %
For each cluster, the longest sequence is chosen as representative. They are stored in the file Trinity.fasta.exemplars
.
classification
The exemplar sequences are then classified by aligning them to a reference database. This is done using blast. The resulting alignments are analyzed and the class is chosen in the following way:
- all alignments of an exemplar are categorized in three confidence groups: high (e-value <= 1e-3), medium (e-value <=1) and low.
- the confidence groups are analysed from high to low until a decision is reached.
- If there are sequences in the high confidence group, a majority vote decides the class
- If there are no sequences in the high confidence group, a majority vote in the medium group decides the class. There must be at least three alignments voting for the same class here. Otherwise the class is "Unknown".
- If there are also no alignments in the medium confidence group, the same as above is done for the low confidence group, except that there are five agreed votes necessary.
The results are store in the file Trinity.fasta.exemplars.classified
quantification
The next step is the quantification of the repeats. This is achieved by mapping the original reads to the assembled repeat regions. This is done using bowtie2 and samtools. The mapping results are then used to build the repeat landscape in the next step.
build landscape
This is the last step of the reper pipeline. It combines the results of the previous steps to create an overview of the repeat landscape. This includes various information about the detected repeats like the assigned class and the amount of bases it consists of in the genome.
The output is split into three levels of detail (sequence, cluster and class). Each is stored in a tab-separated file that look like this:
repeat-landscape_by-class.tab: class name, read count, number of bp, number of sequences, genomic part in Mbp
class count numBp numSeqs genomPart[Mbp]
Retroelement 4090 5358 8 0.07
LINE 2259 1744 3 0.04
DNA 423424 189228 236 7.06
repeat-landscape_by-cluster.tab: cluster ID, read count, cluster size in bp, cluster size in num sequences, genomic part in Mbp, assigned class
clstrID count size[bp] size[numSeqs] genomPart[Mbp] class
1989 308 339 1 0.01 Unknown
138 2775 4835 2 0.05 DNA/En-Spm
2493 1724 494 2 0.03 LTR/Gypsy
3041 671 239 1 0.01 LTR/Gypsy
396 5000 4266 10 0.08 LTR
repeat-landscape_by-sequence.tab: sequence ID, read count, sequence length in bp, genomic Part in Kbp, ID of assigned cluster, assigned class
seqID count length[bp] genomPart[Kbp] clstrID class
TRINITY_DN1739_c0_g5_i1 308 339 5.13 1989 Unknown
TRINITY_DN1739_c0_g3_i1 1679 2625 27.98 138 DNA/En-Spm
TRINITY_DN1739_c0_g3_i2 1096 2210 18.27 138 DNA/En-Spm
TRINITY_DN1478_c2_g2_i1 922 215 15.37 2493 LTR/Gypsy
TRINITY_DN1478_c2_g1_i3 802 279 13.37 2493 LTR/Gypsy
downstream analysis
The three landscape files can now be used for a variety of downstream analyses. For example comparisons of class distribution across different species, searching for expanded classes/clusters, etc.
To get a quick graphical overview, you can use the R script plot-landscape.R
(requires ggplot2) in the scripts/
subdirectory.