Transposomator - DR-genomics/Genomics-pipelines GitHub Wiki

Date: May 1 2022 Create input file with 3 columns of reads details Edit Trasnsposomator.sh with ref rep db and plastome run transposomator_pipe.sh pipeline

Server myco Folder /data/dhanu/transposome_May1_2022 Input /data/dhanu/transposome_May1_2022/input.tab (see next sheet) TE db /data/dhanu/transposome_May1_2022/JS_TEdb_Rbase_format.ML.fasta organelle db /data/dhanu/Mvim_genome_skim/trial/M_vimineum_plastome.fasta

Command

nohup /usr/local/src/Transposomator/TransposomatorV02_1.sh -i input.tab -T 24 -I 90 -c 55 -k 5000000 -e 0.00001 -r 18 -f /data/dhanu/Mvim_genome_skim/trial/M_vimineum_plastome.fasta & /data/dhanu/Mvim_genome_skim/trial/M_vimineum_plastome.fasta &			

Job ID 22472 killed job as it took more space and time for one replicate run for 1 sample (> 4.5 terabytes for an incomplete rep 1 mgblast analysis) Notes on Transposomator.sh -i [tab delimited, three column file: output_prefix /t forward_reads /t reverse_reads] -T [threads] -I [minimum percent identity] -c [minimum read overlap] -k [number of reads to analyze] -e [ex. 0.00001] -r [number of replicate runs of transposome analysis to be conducted] -f [fasta file containing organellar or other sequences to filter against] Date: May 3 2022 Rerun with 1M reads and 3 reps nohup /usr/local/src/Transposomator/TransposomatorV02_1.sh -i input.tab -T 16 -I 90 -c 55 -k 1000000 -e 0.00001 -r 3 -f /data/dhanu/Mvim_genome_skim/trial/M_vimineum_plastome.fasta &

Job ID 21268 Run speed: mgblast for 1 rep for 1M reads from 1 sample - ~1 hour and it might take forever to finish the next steps for 1 rep. So killed it

Rerun with 100K reads and 3 reps nohup /usr/local/src/Transposomator/TransposomatorV02_1.sh -i input.tab -T 16 -I 90 -c 55 -k 100000 -e 0.00001 -r 3 -f /data/dhanu/Mvim_genome_skim/trial/M_vimineum_plastome.fasta &

Job ID 25533 Killed it again! As the clustering results for the 1st sample = error prone. Each rep. run didn't complete successfully. Had errors, saying too many files open, which is a system error according to the author. So rerunning it again.

Date: May 4 2022 Previously, when used transposome for trezalka data, I sampled 100K reads and used /data/dhanu/Mvim_genome_skim/trial/Jstilt_RepbaseGrass.ref (~19000 seq's with Mvim and poaceae TEs), and it produced sample with TE annotations. So gonna repeat the same with our data but only with Mvim repeats (~16000 seqs): /data/dhanu/Mvim_genome_skim/trial/Jstilt.ref

Date: May 5, 2022 Transposome - running

Date: May 9, 2022 Running transposome for 18 more samples, as Craig was doing mapping those to whole genome. 12 samples from his list were already on my previous sample list. Running as root, as softlinked to those read files from craig's data folder.

###Post run - Stacked bar chart To each accession, modify cluster_log1._annotations_summary file.csv using below command: This cmd add line # synonymous to cluster# by counting rows from 2nd line and add accesion name to each row and remove unwanted space in the file esp. in front of col 1. cat cluster_log_1._annotations_summary.tsv | tail -n +2 | nl | awk '{print $0"\t""AL-MAD-1-1"}' | awk 'gsub(/ +/,"")' > AL-MAD-1-1cluster_log_1._annotations_summary.tsv Add header line: Cluster ReadNum Order Superfamily Family GenomeFraction Species

To do the same for multiple samples at the same time first rename each sample's transposome output dir's by adding a common suffix find . -maxdepth 1 -type d -exec mv {} {}_te \;

Then to each cluster annotation summary file, add a col with cluster number by counting row# and remove unwanted space. for f in *_te; do (cat $f/1/cluster_log_1._annotations_summary.tsv | tail -n +2 | nl | awk 'gsub(/ +/,"")' > $f/1/${f}.cluster_log_1._annotations_summary.tsv); done

Move all edited files to a folder called data analysis mv *_te/1/*_te.cluster_log_1._annotations_summary.tsv data_analysis/

Include sample name as a new col in each file and concatenate to one file. awk '{print FILENAME, $0}' *.tsv >a

Then edit the filename column by keeping only the sample name; make sure tab separated and rearrange colmns by keeping file name col as last col instead of first. awk '{ gsub("_te.*", "", $1); print }' a | sed 's/ / /g' | awk '{print $2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$1}' > transposome_cluster_may1.tsv

Add header: ClusterNo ReadNum Order Superfamily Family GenomeFraction Sample_name

Date May 15 2022 annotated unknown hits from Jstilt_Poaceae_TEdb_Rbase_format_SL.fa using blast - Jstilt.ref (EDTA annotated) as database Why? The jstilt poa db used for RE2 analysis and hence trying to use the same for transposom Filtered hits >=200 bp with 80%id with evalue 1e-10

awk '{print $1}' Jstilt.ref.unknown.blasthits > unknown.hits```
extracted unknown hits from stilt_Poaceae_TEdb_Rbase_format_SL.fa  which had a blast match. Purpose to rename fasta headers with new annotaton				
```filterbyname.sh in=../repeat_db/Jstilt_Poaceae_TEdb_Rbase_format_SL.fa out=unknown.fa include=t names=unknown.hits
				
awk '{print $0"@""Microstegium_vimineum"}' unknown.hits.blastinfo > tmp && mv tmp unknown.hits.blastinfo				
sed -i 's/.*\///g' unknown.hits.blastinfo

				
seqkit replace -p "(.+)" -r '{kv}' -k unknown.key.value.tsv unknown.fa > Jstilt.ref.unknown.renamed.fa				
paste <(seqkit fx2tab Jstilt.ref.unknown.renamed.fa | cut -f 1,2 ) unknown.hits.blastinfo  |   awk '{print $1"@"$3"\t"$2}'  |  seqkit tab2fx -w 0 | sed ' s/@/    /g; s/_vimineum/ vimineum/g'> tmp				
mv tmp Jstilt.ref.unknown.renamed.fa```

remove unknown hits that had a match from ../repeat_db/Jstilt_Poaceae_TEdb_Rbase_format_SL.fa, to replace them with new headers				
```/usr/local/src/bbmap/filterbyname.sh in=../repeat_db/Jstilt_Poaceae_TEdb_Rbase_format_SL.fa out=out.fa include=f names=unknown.hits	```			
Now reformat headers of those sequences in the out.fa to repbase format and then concatenate annotated unknown fasta to this new repbase formated fasta				
```grep ">rnd" out.fa |sed 's/>//g' > out.Mvim.id
grep -v ">rnd" out.id > out.Poa.id				
filterbyname.sh in=out.fa out=out.Poa.fa include=t names=out.Poa.id				
cat ../repeat_db/Poaceae_TEdb.headers | sed 's/.*\///g' | awk '{print $1"@"$2}' > ../repeat_db/Poaceae.info				
paste <(seqkit fx2tab out.Poa.fa | cut -f 1,2 ) ../repeat_db/Poaceae.info  |   awk '{print $1"@"$3"\t"$2}'  |  seqkit tab2fx -w 0 | sed 's/@/    /g;  s/.*#//g' > tmp && mv tmp out.Poa.fa				
 /usr/local/src/bbmap/filterbyname.sh in=out.fa out=out.Mvim.fa include=t names=out.Mvim.id				
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'  out.Mvim.fa > tmp &&mv tmp out.Mvim.fa				
 cat out.Mvim.id | sed 's/.*\///g; s/.*#//g' | awk '{print $1"@Micorstegium_vimineum"}' > out.Mvim.info				
 paste <(seqkit fx2tab out.Mvim.fa | cut -f 1,2 ) out.Mvim.info  |   awk '{print $1"@"$3"\t"$2}'  |  seqkit tab2fx -w 0 | sed 's/@/    /g' > tmp && mv tmp out.Mvim.fa				
 cat Jstilt.ref.unknown.renamed.fa out.Mvim.fa out.Poa.fa > ../repeat_db/Jstilt.ref.unknown.reannotated.Poa_Rbase_format.fa

Date: May 16/2022 Moved all intermediate files used for creating the ./repeat_db/Jstilt.ref.unknown.reannotated.Poa_Rbase_format.fa file to ../repeat_db/Jstilt_Poa_tedb_transposomeformat_creation Still didn't work with transposome. So used transposome inhouse perl script format_database.pl to create one. It needed a fasta file with only seq.name. So first created that by deleting other info from prev. created db. And it req genus and spp name. Since the prev db contained poaceae seq's as well, I had to format the db separately for JS, and other poaceae sequences. Oh.. man! this was boring to repeat the same stuff again n agaun for all spp in the db. Any how, finally I concatenated all into one. Intermediate files are in ../repeat_db/Jstilt_Poa_tedb_transposomeformat_creation and ../repeat_db/Jstilt_Poa_tedb_transposomeformat_creation/working_dir/

Wow… after several attempts, figured out why my new TE db (even after spending so much time formatting specific to transpsosome) isn't working to transposome analysis! It’s the freaking forward slash in fasta headers!!!! Just changed / to underscore and now it works. My earlier db contained / and not sure why it worked at that time, but not now. Anyhow, transposome running with my new db.

Running for 30 samples.