EG Workshop LRT 3 - GenomeRIK/workshop

Running TAMA Merge and downstream processes

You can download and install IGV from here if you need to: http://software.broadinstitute.org/software/igv/

Comparing the Iso-Seq annotation to the Ensemble annotation

Start from the folder where your Iso-Seq TAMA Collapse was run.

Remember to deactivate the conda environment:

  conda deactivate

Check that the deactivation worked with this command:

  python --version

It should say Python 2.7 or something like that.

Convert the Ensembl GTF file to a BED12 file: https://genome.ucsc.edu/FAQ/FAQformat.html

  spath='/home/training/bin/tama/tama_go/format_converter/'
  pscript='tama_format_gtf_to_bed12_ensembl.py'
  gtf='/home/training/long_read_transcriptomics/rkuo/0_ref_files/chrom28_Gallus_gallus.GRCg6a.101.gtf'
  bed='chrom28_Gallus_gallus.GRCg6a.101.bed'
  python ${spath}${pscript} ${gtf} ${bed}

Make the filelist.txt file for input into TAMA Merge (make sure it is tab separated):

  tc_mm2_chrom28_flnc_gal6.bed	capped	1,1,1	tama
  chrom28_Gallus_gallus.GRCg6a.101.bed	capped	1,1,1	ensembl

Make a bash script for running TAMA Merge and run:

  spath='/home/training/bin/tama/'
  pscript='tama_merge.py'
  filelist='filelist.txt'
  prefix='tm_tc_flnc_ensembl'
  fivethresh='100'
  threethresh='100'
  source='ensembl'
  python ${spath}${pscript} -f ${filelist} -p ${prefix} -a ${fivethresh} -z ${threethresh} -d merge_dup  -s ${source}

You should see this if it runs successfully:

  TAMA Merge has completed successfully!

Outputs should be:

  tm_tc_flnc_ensembl.bed
  tm_tc_flnc_ensembl_gene_report.txt
  tm_tc_flnc_ensembl_merge.txt
  tm_tc_flnc_ensembl_trans_report.txt

You did it!

Now let's take a look at the results using IGV.

Start another terminal and enter the command:

igv

This should open an IGV window.

Load the genome file:

  Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa

Load the annotation file (this is the file you just made using TAMA Merge):

  tm_tc_flnc_ensembl.bed

Note that you may have used a different name for this file.

Ok let's trying merging the Iso-Seq annotation with the Nanopore annotation and the Ensembl annotation.

Make a filelist_trio.txt file:

  tc_mm2_chrom28_flnc_gal6.bed	capped	1,1,1	isoseq
  tc_mm2_sample_ont_chrom28.bed	capped	1,1,1	nanopore
  chrom28_Gallus_gallus.GRCg6a.101.bed	capped	1,1,1	ensembl

Make a bash script to run TAMA Merge:

  spath='/home/training/bin/tama/'
  pscript='tama_merge.py'
  filelist='filelist_trio.txt'
  prefix='tm_pacbio_nanopore_ensembl'
  fivethresh='100'
  threethresh='100'
  source='ensembl'
  python ${spath}${pscript} -f ${filelist} -p ${prefix} -a ${fivethresh} -z ${threethresh} -d merge_dup  -s ${source}

Run the script.

You should get these output files:

  tm_pacbio_nanopore_ensembl.bed
  tm_pacbio_nanopore_ensembl_gene_report.txt
  tm_pacbio_nanopore_ensembl_merge.txt
  tm_pacbio_nanopore_ensembl_trans_report.txt

Now let's view the resulting annotation file in IGV:

  tm_pacbio_nanopore_ensembl.bed

TAMA ORF/NMD Pipeline

Ok great! Now let's run the ORF/NMD pipeline to get coding region predictions and protein matching.

First we need to convert the Iso-Seq TAMA Collapse bed file into a fasta file.

Use this script to do so:

  fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa'
  bed='tc_mm2_chrom28_flnc_gal6.bed' # may need to change this naming for what you named it
  outfile='tc_mm2_chrom28_flnc_gal6.fa'
  bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}

Run the script. The output is a fasta file called: tc_mm2_chrom28_flnc_gal6.fa

Now we will run "tama_orf_seeker.py" on the fasta file to find all the open reading frames.

Make a script for running tama_orf_seeker.py:

  spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/'
  pscript='tama_orf_seeker.py'
  fasta='tc_mm2_chrom28_flnc_gal6.fa'
  outfile='orf_pred_tc_mm2_chrom28_flnc_gal6.fa'
  python ${spath}${pscript} -f ${fasta} -o ${outfile}

Run the script.The output is a single file: orf_pred_tc_mm2_chrom28_flnc_gal6.fa

Now we need to run BlastP to match these ORF sequences to a protein database.

Make this script:

  queryfa='orf_pred_tc_mm2_chrom28_flnc_gal6.fa'
  m_blast_db='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.pep.all.fa'
  blast_opts=''
  blast_opts="$blast_opts -evalue 1e-10"; # as a quick setting
  blast_opts="$blast_opts -ungapped";  # un-gapped blast
  blast_opts="$blast_opts -num_threads 4" ;  # 2 CPUs, boost the performance
  blast_opts="$blast_opts -comp_based_stats F";
  blast_opts="$blast_opts -max_target_seqs 3";
  blast_opts="$blast_opts -db $m_blast_db"        # database settings
  blast_opts="$blast_opts -query ${queryfa}";
  echo $blast_opts
  outfile=`echo ${queryfa} | awk -F ".fa" '{print "blastp_"$1".bls"}'`
  blastp $blast_opts > ${outfile}

The output is a single file: blastp_orf_pred_tc_mm2_chrom28_flnc_gal6.bls

Now we need to parse the blastp results using tama_orf_blastp_parser.py:

  spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/'
  pscript='tama_orf_blastp_parser.py'
  bpath='./'
  blastp='blastp_orf_pred_tc_mm2_chrom28_flnc_gal6.bls'
  outfile='parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt'
  python ${spath}${pscript} -b ${bpath}${blastp} -o ${outfile}

The output is a single file: parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt

Now we can generate a new annotation bed file with coding predictions using tama_cds_regions_bed_add.py:

Make this script:

  spath='/home/training/bin/tama/tama_go/orf_nmd_predictions/'
  pscript='tama_cds_regions_bed_add.py'
  orf='parse_blastp_orf_pred_tc_mm2_chrom28_flnc_gal6_prot_onlytophits.txt'
  bed='tc_mm2_chrom28_flnc_gal6.bed'
  fasta='tc_mm2_chrom28_flnc_gal6.fa'
  outfile='cds_tc_mm2_chrom28_flnc_gal6.bed'
  python ${spath}${pscript} -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}

Run the script.

The output is a single file: cds_tc_mm2_chrom28_flnc_gal6.bed

Let's look at this file in IGV!

Read support and transcript filtering

We will now cover generating read support files which can be used for transcript and gene level quantification.

Go to the folder where your Iso-Seq TAMA Collapse was run.

Make a filelist_readsupport.txt file for input:

  flnc	tc_mm2_chrom28_flnc_gal6_trans_read.bed	trans_read

You may need to adjust the naming of the trans_read file above. This will be used by the tama_read_support_levels.py tool.

Make a bash script to run the tama_read_support_levels.py tool:

  spath='/home/training/bin/tama/tama_go/read_support/'
  pscript='tama_read_support_levels.py'
  filelist='filelist_readsupport.txt'
  prefix='rs_tc_mm2_chrom28_flnc_gal6'
  merge='no_merge'
  python ${spath}${pscript} -f ${filelist} -o ${prefix} -m ${merge}

Run the script.

You will get a single output:

  rs_tc_mm2_chrom28_flnc_gal6_read_support.txt

Let's have a look.

The file should look somegthing like this:

  merge_gene_id	merge_trans_id	gene_read_count	trans_read_count	source_line	support_line
  G1	G1.1	2	2	flnc	flnc:m54041_161210_051053/68420467/ccs,m54041_161210_051053/45220300/ccs
  G2	G2.1	1	1	flnc	flnc:m54041_161210_051053/50725832/ccs
  G3	G3.1	15	1	flnc	flnc:m54041_161210_051053/45809784/ccs

Now we can use this file to run transcript filtering

Make a bash script:

  spath='/home/training/bin/tama/tama_go/filter_transcript_models/'
  pscript='tama_remove_single_read_models_levels.py'
  bed='tc_mm2_chrom28_flnc_gal6.bed'
  readsupport='rs_tc_mm2_chrom28_flnc_gal6_read_support.txt'
  prefix='fsm_tc_mm2_chrom28_flnc_gal6'
  level='transcript'
  multi='keep_multi'
  numsources='1'
  python ${spath}${pscript} -b ${bed}  -r ${readsupport} -o ${prefix} -l ${level} -k ${multi} -s ${numsources}

Run the bash script.

You should get these outputs files:

  fsm_tc_mm2_chrom28_flnc_gal6.bed
  fsm_tc_mm2_chrom28_flnc_gal6_singleton.bed
  fsm_tc_mm2_chrom28_flnc_gal6_singleton_report.txt

The fsm_tc_mm2_chrom28_flnc_gal6.bed file is your main annotation file.

Let's view that on IGV!

Other tools/papers to look into!

SQANTI Annotation and QC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5848618/

TAPPAS Differential isoform expression https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02028-w

EG Workshop LRT 3 - GenomeRIK/workshop_tutorials GitHub Wiki

⚠️ GitHub.com Fallback ⚠️

EG Workshop LRT 3 - GenomeRIK/workshop_tutorials GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️