EG Workshop LRT 1 - GenomeRIK/workshop_tutorials GitHub Wiki

Processing Iso-Seq data

See TAMA Wiki for manual:https://github.com/GenomeRIK/tama/wiki

See paper for more in depth explanation: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07123-7

Path to TAMA: /home/training/bin/tama

Path to reference files: /home/training/long_read_transcriptomics/rkuo/0_ref_files

PacBio website for Iso-Seq3 pipeline: https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md

Running LIMA:

Make a folder for Iso-Seq processing:

  mkdir 1_isoseq

Go into that folder and make a folder for running LIMA:

  mkdir 1_lima

Go into that folder.

Running LIMA:

Path to CCS bam file:

  /home/training/long_read_transcriptomics/rkuo/1_isoseq/0_data/chrom28_m54041_161210_051053.ccs.bam

Path to primer fasta file:

  /home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa

LIMA command structure:

  lima --isoseq --dump-clips ${bam} ${primers} ${output}
  lima --isoseq --dump-clips /home/training/long_read_transcriptomics/rkuo/1_isoseq/0_data/chrom28_m54041_161210_051053.ccs.bam /home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa  chrom28_m54041_161210_051053.demux.bam

Running REFINE:

REFINE command structure:

  isoseq3 refine  ${bam} ${primers} ${flnc}
  isoseq3 refine  chrom28_m54041_161210_051053.demux.primer_5p--primer_3p.bam /home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa  chrom28_m54041_161210_051053.flnc.bam 

Converting from BAM to Fasta:

  samtools fasta ${bam} > ${fasta}
  samtools fasta chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.bam > chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.fa

Running Poly-A cleanup:

See explanation for this here:

https://twitter.com/GenomeRIK/status/1179788262187110401

Remember to turn off conda environment for this part:

  conda deactivate

Bash scripts as follows:

  spath='/home/training/bin/tama/tama_go/sequence_cleanup/'
  pscript='tama_flnc_polya_cleanup.py'
  flnc='chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.fa'
  prefix='chrom28_flnc'
  python ${spath}${pscript}  -f ${flnc}  -p ${prefix}

Outputs:

  chrom28_flnc.fa
  chrom28_flnc_polya_flnc_report.txt
  chrom28_flnc_tails.fa

Mapping reads to the genome

Make a directory for mapping in your Iso-Seq folder (and go into folder):

  mkdir 2_mapping

Mapping reads with Minimap2:

  ref='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa'
  query='/home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/chrom28_flnc.fa' # need to change this for your paths
  outfile='mm2_chrom28_flnc_gal6.sam'
  minimap2 --secondary=no -ax splice -uf -C5  ${ref} ${query} > ${outfile}

Output is a SAM file.

https://samtools.github.io/hts-specs/SAMv1.pdf

https://broadinstitute.github.io/picard/explain-flags.html

Convert from SAM to BAM:

  sam='mm2_chrom28_flnc_gal6.sam'
  bam='mm2_chrom28_flnc_gal6.bam'
  samtools view -bS  ${sam} > ${bam}

Output is a BAM file.

Sort BAM file::

  bam='mm2_chrom28_flnc_gal6.bam'
  outfile='mm2_chrom28_flnc_gal6_sort.bam'
  samtools sort ${bam} -o ${outfile}

Output is a BAM file.

Running TAMA Collapse

Note that you made need to deactivate the conda environment to use Python 2.7 i.e. :

  conda deactivate

Make folder for running TAMA Collapse in your Iso-Seq folder (and got to folder):

  mkdir 3_collapse

Running TAMA Collapse:

  spath='/home/training/bin/tama/'
  pscript='tama_collapse.py'
  fpath='/home/training/long_read_transcriptomics/rkuo/1_isoseq/3_map/' # need to adjust this for your folder structure
  sam='mm2_chrom28_flnc_gal6_sort.bam' # need to adjust this for your file naming
  prefix='tc_mm2_chrom28_flnc_gal6'
  capflag='capped'
  awobble='100'
  zwobble='100'
  sjt='10'
  fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa'
  python ${spath}${pscript} -s ${fpath}${sam} -f ${fasta} -p ${prefix} -d merge_dup -x ${capflag} -a ${awobble} -z ${zwobble} -sj sj_priority -sjt ${sjt} -log log_off -b BAM

Outputs:

  tc_mm2_chrom28_flnc_gal6.bed
  tc_mm2_chrom28_flnc_gal6_local_density_error.txt
  tc_mm2_chrom28_flnc_gal6_polya.txt
  tc_mm2_chrom28_flnc_gal6_read.txt
  tc_mm2_chrom28_flnc_gal6_strand_check.txt
  tc_mm2_chrom28_flnc_gal6_trans_read.bed
  tc_mm2_chrom28_flnc_gal6_trans_report.txt
  tc_mm2_chrom28_flnc_gal6_varcov.txt
  tc_mm2_chrom28_flnc_gal6_variants.txt

Here is a nifty little BASH script for giving a brief summary of TAMA Collapse output:

  file=$1
  echo "Genes"
  cat ${file} | awk -F "\t" '{print $4}' | awk -F ";" '{print $1}' | sort | uniq | wc -l
  echo "Transcripts"
  cat ${file} | awk -F "\t" '{print $4}' | awk -F ";" '{print $2}' | sort | uniq | wc -l
  echo "Multi-exon Transcripts"
  cat ${file} | awk -F "\t" '{if($10>1)print $4}' | awk -F ";" '{print $2}' | sort | uniq | wc -l
⚠️ **GitHub.com Fallback** ⚠️