EG Workshop LRT 3 - GenomeRIK/workshop_tutorials GitHub Wiki
Running TAMA Merge and downstream processes
You can download and install IGV from here if you need to: http://software.broadinstitute.org/software/igv/
Comparing the Iso-Seq annotation to the Ensemble annotation
Start from the folder where your Iso-Seq TAMA Collapse was run.
Remember to deactivate the conda environment:
conda deactivate
Check that the deactivation worked with this command:
python --version
It should say Python 2.7 or something like that.
Convert the Ensembl GTF file to a BED12 file: https://genome.ucsc.edu/FAQ/FAQformat.html
spath='/home/training/bin/tama/tama_go/format_converter/' pscript='tama_format_gtf_to_bed12_ensembl.py' gtf='/home/training/long_read_transcriptomics/rkuo/0_ref_files/chrom28_Gallus_gallus.GRCg6a.101.gtf' bed='chrom28_Gallus_gallus.GRCg6a.101.bed' python ${spath}${pscript} ${gtf} ${bed}
Make the filelist.txt file for input into TAMA Merge (make sure it is tab separated):
tc_mm2_chrom28_flnc_gal6.bed capped 1,1,1 tama chrom28_Gallus_gallus.GRCg6a.101.bed capped 1,1,1 ensembl
Make a bash script for running TAMA Merge and run:
spath='/home/training/bin/tama/' pscript='tama_merge.py' filelist='filelist.txt' prefix='tm_tc_flnc_ensembl' fivethresh='100' threethresh='100' source='ensembl' python ${spath}${pscript} -f ${filelist} -p ${prefix} -a ${fivethresh} -z ${threethresh} -d merge_dup -s ${source}
You should see this if it runs successfully:
TAMA Merge has completed successfully!
Outputs should be:
tm_tc_flnc_ensembl.bed tm_tc_flnc_ensembl_gene_report.txt tm_tc_flnc_ensembl_merge.txt tm_tc_flnc_ensembl_trans_report.txt
You did it!
Now let's take a look at the results using IGV.
Start another terminal and enter the command:
igv
This should open an IGV window.
Load the genome file:
Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa
Load the annotation file (this is the file you just made using TAMA Merge):
tm_tc_flnc_ensembl.bed
Note that you may have used a different name for this file.
Ok let's trying merging the Iso-Seq annotation with the Nanopore annotation and the Ensembl annotation.
Make a filelist_trio.txt file:
tc_mm2_chrom28_flnc_gal6.bed capped 1,1,1 isoseq tc_mm2_sample_ont_chrom28.bed capped 1,1,1 nanopore chrom28_Gallus_gallus.GRCg6a.101.bed capped 1,1,1 ensembl
Make a bash script to run TAMA Merge:
spath='/home/training/bin/tama/' pscript='tama_merge.py' filelist='filelist_trio.txt' prefix='tm_pacbio_nanopore_ensembl' fivethresh='100' threethresh='100' source='ensembl' python ${spath}${pscript} -f ${filelist} -p ${prefix} -a ${fivethresh} -z ${threethresh} -d merge_dup -s ${source}
Run the script.
You should get these output files:
tm_pacbio_nanopore_ensembl.bed tm_pacbio_nanopore_ensembl_gene_report.txt tm_pacbio_nanopore_ensembl_merge.txt tm_pacbio_nanopore_ensembl_trans_report.txt
Now let's view the resulting annotation file in IGV:
tm_pacbio_nanopore_ensembl.bed
TAMA ORF/NMD Pipeline
Ok great! Now let's run the ORF/NMD pipeline to get coding region predictions and protein matching.
First we need to convert the Iso-Seq TAMA Collapse bed file into a fasta file.
Use this script to do so:
fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' bed='tc_mm2_chrom28_flnc_gal6.bed' # may need to change this naming for what you named it outfile='tc_mm2_chrom28_flnc_gal6.fa' bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}
Run the script. The output is a fasta file called: tc_mm2_chrom28_flnc_gal6.fa
Now we will run "tama_orf_seeker.py" on the fasta file to find all the open reading frames.
Make a script for running tama_orf_seeker.py:
spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/' pscript='tama_orf_seeker.py' fasta='tc_mm2_chrom28_flnc_gal6.fa' outfile='orf_pred_tc_mm2_chrom28_flnc_gal6.fa' python ${spath}${pscript} -f ${fasta} -o ${outfile}
Run the script.The output is a single file: orf_pred_tc_mm2_chrom28_flnc_gal6.fa
Now we need to run BlastP to match these ORF sequences to a protein database.
Make this script:
queryfa='orf_pred_tc_mm2_chrom28_flnc_gal6.fa' m_blast_db='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.pep.all.fa' blast_opts='' blast_opts="$blast_opts -evalue 1e-10"; # as a quick setting blast_opts="$blast_opts -ungapped"; # un-gapped blast blast_opts="$blast_opts -num_threads 4" ; # 2 CPUs, boost the performance blast_opts="$blast_opts -comp_based_stats F"; blast_opts="$blast_opts -max_target_seqs 3"; blast_opts="$blast_opts -db $m_blast_db" # database settings blast_opts="$blast_opts -query ${queryfa}"; echo $blast_opts outfile=`echo ${queryfa} | awk -F ".fa" '{print "blastp_"$1".bls"}'` blastp $blast_opts > ${outfile}
The output is a single file: blastp_orf_pred_tc_mm2_chrom28_flnc_gal6.bls
Now we need to parse the blastp results using tama_orf_blastp_parser.py:
spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/' pscript='tama_orf_blastp_parser.py' bpath='./' blastp='blastp_orf_pred_tc_mm2_chrom28_flnc_gal6.bls' outfile='parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt' python ${spath}${pscript} -b ${bpath}${blastp} -o ${outfile}
The output is a single file: parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt
Now we can generate a new annotation bed file with coding predictions using tama_cds_regions_bed_add.py:
Make this script:
spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/' pscript='tama_cds_regions_bed_add.py' orf='parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt' bed='tc_mm2_chrom28_flnc_gal6.bed' fasta='tc_mm2_chrom28_flnc_gal6.fa' outfile='cds_tc_mm2_chrom28_flnc_gal6.bed' python ${spath}${pscript} -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}
Run the script.
The output is a single file: cds_tc_mm2_chrom28_flnc_gal6.bed
Let's look at this file in IGV!
Read support and transcript filtering
We will now cover generating read support files which can be used for transcript and gene level quantification.
Go to the folder where your Iso-Seq TAMA Collapse was run.
Make a filelist_readsupport.txt file for input:
flnc tc_mm2_chrom28_flnc_gal6_trans_read.bed trans_read
You may need to adjust the naming of the trans_read file above. This will be used by the tama_read_support_levels.py tool.
Make a bash script to run the tama_read_support_levels.py tool:
spath='/home/training/bin/tama/tama_go/read_support/' pscript='tama_read_support_levels.py' filelist='filelist_readsupport.txt' prefix='rs_tc_mm2_chrom28_flnc_gal6' merge='no_merge' python ${spath}${pscript} -f ${filelist} -o ${prefix} -m ${merge}
Run the script.
You will get a single output:
rs_tc_mm2_chrom28_flnc_gal6_read_support.txt
Let's have a look.
The file should look somegthing like this:
merge_gene_id merge_trans_id gene_read_count trans_read_count source_line support_line G1 G1.1 2 2 flnc flnc:m54041_161210_051053/68420467/ccs,m54041_161210_051053/45220300/ccs G2 G2.1 1 1 flnc flnc:m54041_161210_051053/50725832/ccs G3 G3.1 15 1 flnc flnc:m54041_161210_051053/45809784/ccs
Now we can use this file to run transcript filtering
Make a bash script:
spath='/home/training/bin/tama/tama_go/filter_transcript_models/' pscript='tama_remove_single_read_models_levels.py' bed='tc_mm2_chrom28_flnc_gal6.bed' readsupport='rs_tc_mm2_chrom28_flnc_gal6_read_support.txt' prefix='fsm_tc_mm2_chrom28_flnc_gal6' level='transcript' multi='keep_multi' numsources='1' python ${spath}${pscript} -b ${bed} -r ${readsupport} -o ${prefix} -l ${level} -k ${multi} -s ${numsources}
Run the bash script.
You should get these outputs files:
fsm_tc_mm2_chrom28_flnc_gal6.bed fsm_tc_mm2_chrom28_flnc_gal6_singleton.bed fsm_tc_mm2_chrom28_flnc_gal6_singleton_report.txt
The fsm_tc_mm2_chrom28_flnc_gal6.bed file is your main annotation file.
Let's view that on IGV!
Other tools/papers to look into!
SQANTI Annotation and QC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5848618/
TAPPAS Differential isoform expression https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02028-w