How does reper work? - nterhoeven/reper GitHub Wiki

How does reper work?

The reper pipeline searches for repeats in NGS datasets. This is done in seven steps which are explained below.

counting kmers

This is the first step of reper. Here we use the jellyfish software to count 31-mers across the whole input data set. The resulting jellyfish hash is stored in reads.jf

filter the reads

Using the kmer-filter-script, the reads are now filtered according to these criteria:

  • at least 50 % of the read must consist of kmers with counts above 5 times the genomic coverage
  • a read pair is kept only if both reads match the filter The filtered read pairs are stored in reads.filtered_1.fq and reads.filtered_2.fq.

repeat assembly

In this step the filtered reads are assembled using the transcriptome assembler Trinity. In contrast to a genome assembler, Trinity results in different possible variants of a sequence (different Isoforms of a Unigene). This fits well to the behavior of repeats, which are also present in multiple variants. However, the resulting sequences (Trinity.fasta) are not the "correct" set of repeats in the genome. They rather show a set of possible variants that may be present in the genome.

clustering

Since the relationship between sequences determined by Trinity (isoforms and unigenes) does not fit perfectly to repeats, we cluster the sequences additionally using cd-hit-est. The clustering parameters are chosen as follows:

  • The alignment of two sequences must cover at least 90 % of the shorter sequence
  • The alignment must have an identity of at least 80 %

For each cluster, the longest sequence is chosen as representative. They are stored in the file Trinity.fasta.exemplars.

classification

The exemplar sequences are then classified by aligning them to a reference database. This is done using blast. The resulting alignments are analyzed and the class is chosen in the following way:

  • all alignments of an exemplar are categorized in three confidence groups: high (e-value <= 1e-3), medium (e-value <=1) and low.
  • the confidence groups are analysed from high to low until a decision is reached.
  • If there are sequences in the high confidence group, a majority vote decides the class
  • If there are no sequences in the high confidence group, a majority vote in the medium group decides the class. There must be at least three alignments voting for the same class here. Otherwise the class is "Unknown".
  • If there are also no alignments in the medium confidence group, the same as above is done for the low confidence group, except that there are five agreed votes necessary.

The results are store in the file Trinity.fasta.exemplars.classified

quantification

The next step is the quantification of the repeats. This is achieved by mapping the original reads to the assembled repeat regions. This is done using bowtie2 and samtools. The mapping results are then used to build the repeat landscape in the next step.

build landscape

This is the last step of the reper pipeline. It combines the results of the previous steps to create an overview of the repeat landscape. This includes various information about the detected repeats like the assigned class and the amount of bases it consists of in the genome.

The output is split into three levels of detail (sequence, cluster and class). Each is stored in a tab-separated file that look like this:

repeat-landscape_by-class.tab: class name, read count, number of bp, number of sequences, genomic part in Mbp

class	count	numBp	numSeqs	genomPart[Mbp]
Retroelement	4090	5358	8	0.07
LINE	2259	1744	3	0.04
DNA	423424	189228	236	7.06

repeat-landscape_by-cluster.tab: cluster ID, read count, cluster size in bp, cluster size in num sequences, genomic part in Mbp, assigned class

clstrID	count	size[bp]	size[numSeqs]	genomPart[Mbp]	class
1989	308	339	1	0.01	Unknown
138	2775	4835	2	0.05	DNA/En-Spm
2493	1724	494	2	0.03	LTR/Gypsy
3041	671	239	1	0.01	LTR/Gypsy
396	5000	4266	10	0.08	LTR

repeat-landscape_by-sequence.tab: sequence ID, read count, sequence length in bp, genomic Part in Kbp, ID of assigned cluster, assigned class

seqID	count	length[bp]	genomPart[Kbp]	clstrID	class
TRINITY_DN1739_c0_g5_i1	308	339	5.13	1989	Unknown
TRINITY_DN1739_c0_g3_i1	1679	2625	27.98	138	DNA/En-Spm
TRINITY_DN1739_c0_g3_i2	1096	2210	18.27	138	DNA/En-Spm
TRINITY_DN1478_c2_g2_i1	922	215	15.37	2493	LTR/Gypsy
TRINITY_DN1478_c2_g1_i3	802	279	13.37	2493	LTR/Gypsy

downstream analysis

The three landscape files can now be used for a variety of downstream analyses. For example comparisons of class distribution across different species, searching for expanded classes/clusters, etc. To get a quick graphical overview, you can use the R script plot-landscape.R (requires ggplot2) in the scripts/ subdirectory.