dnaPipeTE pipeline - DR-genomics/Genomics-pipelines GitHub Wiki

From dnaPipeTE github webpage

"dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X per run)."

Data analysis - thorny flat cluster

Installed dnaPipeTE via singularity container in my scratch space 1- First create a Singularity image from the Docker container

mkdir dnaPipeTE
cd dnaPipeTE
singularity pull --name dnapipete.img docker://clemgoub/dnapipete:latest

This step requires approximately 20 minutes to complete. However, it is only required once for installation.

2- Inside dnapipete directory, create a new file to enter commands for the run. For example:

touch dnaPipeTE_cmds.sh

With the text editor of your choice, edit dnaPipeTE_cmd.sh with the commands for dnaPipeTE. For example: Ran the analysis for 5 samples with 2 iterations

cd /opt/dnaPipeTE
python3 dnaPipeTE.py -input /mnt/sample-1.R1.fastq -output /mnt/dnapipeTE_JSPoadb_sample-1 -genome_size 1200000000 -genome_coverage 0.1 -sample_number 2 -RM_lib /scratch/dramacha/Jstilt_Poaceae_TEdb_Rbase_format.fa

python3 dnaPipeTE.py -input /mnt/sample-3.R1.fastq -output /mnt/dnapipeTE_JSPoadb_sample-3 -genome_size 1200000000 -genome_coverage 0.1 -sample_number 2 -RM_lib /scratch/dramacha/Jstilt_Poaceae_TEdb_Rbase_format.fa

python3 dnaPipeTE.py -input /mnt/sample-5.R1.fastq -output /mnt/dnapipeTE_JSPoadb_sample-5 -genome_size 1200000000 -genome_coverage 0.1 -sample_number 2 -RM_lib /scratch/dramacha/Jstilt_Poaceae_TEdb_Rbase_format.fa

python3 dnaPipeTE.py -input /mnt/sample-11.R1.fastq -output /mnt/dnapipeTE_JSPoadb_sample-11 -genome_size 1200000000 -genome_coverage 0.1 -sample_number 2 -RM_lib /scratch/dramacha/Jstilt_Poaceae_TEdb_Rbase_format.fa

python3 dnaPipeTE.py -input /mnt/sample-16.R1.fastq -output /mnt/dnapipeTE_JSPoadb_sample-16 -genome_size 1200000000 -genome_coverage 0.1 -sample_number 2 -RM_lib /scratch/dramacha/Jstilt_Poaceae_TEdb_Rbase_format.fa

The first line is required to execute the scripts in the right directory of the container The second line is a standard dnaPipeTE command /mnt is the default directory in the Singularity container where a user directory can be mounted to access and write data outside the container. In this example, /mnt in the container will points towards ~/data in your machine. It will be specified in the next command, that actually starts the container, mount the user data and run the program.

3- Start a run. Created a bash file and entered below commands in it.

#!/bin/sh
#PBS -N sing_dnaPipeTE
#PBS -l nodes=1:ppn=12,walltime=96:00:00
#PBS -q cfb0001
module load lang/gcc/11.2.0 lang/r/4.1.2_gcc112
module load singularity/3.7.0
singularity exec --bind /scratch/dramacha/dnaPipeTE:/mnt /scratch/dramacha/dnaPipeTE/dnapipete.img bash /mnt/dnapipete_cmds.sh

Note that --bind is the command that indicates where the data are located outside the container. In this example, /scratch/dramacha/dnaPipeTE . This directory will also be where the output folder will be created.

Add sample name suffix to fasta headers in Trinity.fasta contigs for each sample and concatenate all into one large fasta file.

perl -p -e 's/^(>.*)$/$1-s1/g' dnapipeTE_JSPoadb_sample-1/Trinity.fasta > dnapipeTE_JSPoadb_sample-1/Trinity_renamed.fasta
perl -p -e 's/^(>.*)$/$1-s3/g' dnapipeTE_JSPoadb_sample-3/Trinity.fasta > dnapipeTE_JSPoadb_sample-3/Trinity_renamed.fasta
perl -p -e 's/^(>.*)$/$1-s5/g' dnapipeTE_JSPoadb_sample-5/Trinity.fasta > dnapipeTE_JSPoadb_sample-5/Trinity_renamed.fasta
....

cat dnapipeTE_JSPoadb_sample-*/Trinity_renamed.fasta > s1_s3_s5_s11_s16_Trinity.fasta

Run cd-hit on combined contig file to cluster sequences into groups and extract 1 representative for each group to estimate TE copy number

/scratch/dramacha/cd-hit-v4.8.1-2019-0228/cd-hit-est -i /scratch/dramacha/dnaPipeTE/s1_s3_s5_s11_s16_Trinity.fasta -o s1_s3_s5_s11_s16_Trinity_cdhit.fa -aS 0.8 -c 0.8 -G 0 -g 1 -d 0

Cd-Hit produces two files - fasta file with a representative sequence for each cluster, and a .clstr file with information of all sequence IDs present in each cluster. To estimate TE copy number, we need to convert the .clstr file to a column separated file using the CD-hit utility script "clstr2txt.pl"

perl clstr2txt.pl s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr > s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr.list

Reformat "reads_per_component_and_annotation" file for each sample as follows to extract the read count info for each contig/sequence to estimate TE copy number. First, convert space separated "reads_per_component_and_annotation" file to tab delimited file

sed -i 's/ /\t/g' dnapipeTE_JSPoadb_sample-1/reads_per_component_and_annotation
sed -i 's/ /\t/g' dnapipeTE_JSPoadb_sample-3/reads_per_component_and_annotation
....
sed -i 's/ /\t/g' dnapipeTE_JSPoadb_sample-16/reads_per_component_and_annotation

Now, add a suffix "s1" or "s3" or "s5" string at the end of contig names in the second column of each sample'sreads_per_component_and_annotation file respectively.

awk 'BEGIN{FS=OFS=" "}{$3=$3"-s1"}1' ../dnaPipeTE/dnapipeTE_sample-1/reads_per_component_and_annotation > ../dnaPipeTE/dnapipeTE_sample-1/reads_per_component_and_annotation-s1
awk 'BEGIN{FS=OFS=" "}{$3=$3"-s3"}1' ../dnaPipeTE/dnapipeTE_sample-1/reads_per_component_and_annotation > ../dnaPipeTE/dnapipeTE_sample-3/reads_per_component_and_annotation-s3
...
awk 'BEGIN{FS=OFS=" "}{$3=$3"-s16"}1' ../dnaPipeTE/dnapipeTE_sample-1/reads_per_component_and_annotation > ../dnaPipeTE/dnapipeTE_sample-16/reads_per_component_and_annotation-s16

Concatenate all reformated reads_per_component_and_annotation file into 1 file

cat dnapipeTE_sample-*/reads_per_component_and_annotation-s* > s1_s3_s5_s11_s16_reads_per_annotation

Compare CD-hit clstr.list file to the above concatenated reads_per_component_and_annotation file, and add read count info as a last column in .clstr.list file by matching the contig names in both the files

awk 'NR==FNR{a[$3]=$1;next}a[$1]{print $0"\t"a[$1]}' s1_s3_s5_s11_s16_reads_per_annotation ../cd-hit-v4.8.1-2019-0228/s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr.list > readcounts
awk 'NR==FNR{a[$3]=$5;next}a[$1]{print $0"\t"a[$1]}' s1_s3_s5_reads_per_annotation ../cd-hit-v4.8.1-2019-0228/s1_s3_s5_Trinity_cdhit.fa.clstr.list > superfamily
awk 'NR==FNR{a[$3]=$6;next}a[$1]{print $0"\t"a[$1]}' s1_s3_s5_reads_per_annotation ../cd-hit-v4.8.1-2019-0228/s1_s3_s5_Trinity_cdhit.fa.clstr.list > subfamily
paste <(awk '{print $0}' readcounts) <(awk '{print $8}' superfamily) <(awk '{print $8}' subfamily)  readcounts superfamily superfamily > s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr.list.readcounts.family

Remote copy the above file to local desktop, to estiamte TE copy number using excel
scp s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr.list.readcounts.family [email protected]:~/
In local desktop, download the file from remote server
scp [email protected]:~/s1_s3_s5_s11_s16_Trinity_cdhit.fa.clstr.list.readcounts.family /path/to/Desktop/test/

grep -c ">" dnapipeTE_JSPoadb_sample-1/renamed.blasting_reads.fasta #Make note of total number of reads for each sample
...
grep -c ">" dnapipeTE_JSPoadb_sample-16/renamed.blasting_reads.fasta