RepeatExplorer2 - DR-genomics/Genomics-pipelines GitHub Wiki

RE2 Analysis

Initial run with 111,867 PE reads (1% of the genome) randomly subsampled for 32 invasive and 16 native samples.

Ran with default RE2 REXdb database and Viridaeplantae taxonomy specification for annotation Run completed ~ 1 week or so for all samples Almost all samples were occupied predominanlty with SIRE and Ogre LTR-RTs. Variation with satellite repeats were also seen. However, most of the repeat types were unclassified.

3/29/22 Currently running with more PE reads, sampling 10% of the genome = little over 1 million PE reads for 32 invasive and 16 native samples and annotating using JS+Poaceae database, which consists of ~35000 elements

#!/bin/bash

#PBS -N Run_RE2_invasive
#PBS -l nodes=1:ppn=8,walltime=168:00:00
#PBS -q comm_small_week #standby
##PBS -q cfb0001

cd /scratch/dramacha/RepeatExplorer2_Analysis/invasive

##################### Read pre-processing ############################
#Remember to quality-trim reads first (remove reads with Phred score <10) and remove adapters. From that, we can move right on to our pre-processing.
#Remove reads below 100bp and trim the remainder to 100bp
#for f in *R1_001_Tpaired.fastq;
#do (trimmomatic-0.36.jar PE -phred33 $f ${f/R1_001_Tpaired.fastq}R2_001_Tpaired.fastq ${f/R1_001_Tpaired.fastq}R1_001_TFpaired.fastq ${f/R1_001_Tpaired.fastq}R1_001_TFunpaired.fastq ${f/R1_001_Tpaired.fastq}R2_001_TFpaired.fastq ${f/R1_001_Tpaired.fastq}R2_001_TFunpaired.fastq MINLEN:100 CROP:100 -threads 12);
#done

#Count number of reads in fastq - simply count the number of lines and divide by four
#Genome Skimming data
for f in *trimmed_*.fastq;
do (echo $f >> readcounts_GS.txt | awk '{s++}END{print s/4}' $f >> readcounts_GS.txt);
done
##################### READ SAMPLING ############################
 for filename in *_R1.fastq; do   dir="/scratch/dramacha/RepeatExplorer2_Analysis/";   base=$(basename $filename "_R1.fastq");   echo ${base};    seqtk sample -s100 ${base}_R1.fastq 1111867 > ${base}_sub_R1.fastq; seqtk sample -s100 ${base}_R2.fastq 1111867 > ${base}_sub_R2.fastq;  done ##No. of reads need to sample 1% of the Mv.genome=({1118666032*0.01}/100)

#Interleave R1 and R2 pairs
for f in *sub_R1.fastq;
do (seqtk mergepe $f ${f/sub_R1.fastq}sub_R2.fastq > ${f/sub_R1.fastq}sub_paired.fastq);
done
                ############ REPEATEXPLORER 2 ############

### REMEMBER - BEFORE RUNNING ACTIVATE THE ENVIRONMENT WITH conda activate repeatexplorer
##Run Repeatexplorer2
#Turn fastq to fasta
for f in *sub_paired.fastq;
do (seqtk seq -a $f > ${f/sub_paired.fastq}sub_paired.fasta);
done
#Run RepeatExplorer2 (no read sampling as it's already done)
module load singularity
for f in *sub_paired.fasta;
do
singularity exec /shared/containers/repex_tarean.sif seqclust -p $f -l $f.log -m 0.05 -o 55 -v ./${f/sub_paired.fasta}_RE2 -tax VIRIDIPLANTAE3.0 -d /scratch/dramacha/JS_repeatdb/Jstilt_Poaceae_TEdb_Rbase_format_SL.fa JS_poaceae_TEdb
done

For 1 sample, it is taking approx 7 hours to complete the run. I see there is improvement in annotation for the sample when compared to previous run with 100,000 PE reads and default database. So, clearly using spp. specific database and adding more reads improves the run.

Based on initial run, large variation in repeat content is observed between CHN-Hubei and CHN-Jiangxi.

NOTE: Individual runs crashed (32 Inv + 16 native) - due to storage issues.

###Comparative analysis 4/4/22 Running comparative clustering analysis (32 invasive + 16 US samples - 500K PE reads against JS_Poaceae repeat db) => currently in queue mode in our research queue. Hopefully it kick start at least after 4 hours.

singularity exec /shared/containers/repex_tarean.sif seqclust -p JS_native_invasive_PE.fasta -l JS_native_invasive_PE.log -m 0.05 -o 55 -v JS_native_invasive_PE__RE2 -tax VIRIDIPLANTAE3.0 --prefix_length 10 -d /scratch/dramacha/JS_repeatdb/Jstilt_Poaceae_TEdb_Rbase_format_SL.fa JS_poaceae_TEdb

Note: Refer the full script for the prep work for comparative analysis: /scratch/dramacha/RepeatExplorer2_Analysis/RE2_comparative_clustering_JS.sh 4/4/22 The job was in research queue for more than 8 hours. So killed it and running it one of the community nodes. Status - Running 4/5/22 Status - running. clustering is in progress. Noticed that one of the sample (Wisconsin) didn't have the read prefix, usually added to identify samples when running on comparative mode.