EG Workshop LRT 1 - GenomeRIK/workshop_tutorials GitHub Wiki
Processing Iso-Seq data
See TAMA Wiki for manual:https://github.com/GenomeRIK/tama/wiki
See paper for more in depth explanation: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07123-7
Path to TAMA: /home/training/bin/tama
Path to reference files: /home/training/long_read_transcriptomics/rkuo/0_ref_files
PacBio website for Iso-Seq3 pipeline: https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md
Running LIMA:
Make a folder for Iso-Seq processing:
mkdir 1_isoseq
Go into that folder and make a folder for running LIMA:
mkdir 1_lima
Go into that folder.
Running LIMA:
Path to CCS bam file:
/home/training/long_read_transcriptomics/rkuo/1_isoseq/0_data/chrom28_m54041_161210_051053.ccs.bam
Path to primer fasta file:
/home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa
LIMA command structure:
lima --isoseq --dump-clips ${bam} ${primers} ${output}
lima --isoseq --dump-clips /home/training/long_read_transcriptomics/rkuo/1_isoseq/0_data/chrom28_m54041_161210_051053.ccs.bam /home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa chrom28_m54041_161210_051053.demux.bam
Running REFINE:
REFINE command structure:
isoseq3 refine ${bam} ${primers} ${flnc}
isoseq3 refine chrom28_m54041_161210_051053.demux.primer_5p--primer_3p.bam /home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/primers_teloprime.fa chrom28_m54041_161210_051053.flnc.bam
Converting from BAM to Fasta:
samtools fasta ${bam} > ${fasta}
samtools fasta chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.bam > chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.fa
Running Poly-A cleanup:
See explanation for this here:
https://twitter.com/GenomeRIK/status/1179788262187110401
Remember to turn off conda environment for this part:
conda deactivate
Bash scripts as follows:
spath='/home/training/bin/tama/tama_go/sequence_cleanup/' pscript='tama_flnc_polya_cleanup.py' flnc='chrom28_m54041_161210_051053.flnc.primer_5p--primer_3p.fa' prefix='chrom28_flnc' python ${spath}${pscript} -f ${flnc} -p ${prefix}
Outputs:
chrom28_flnc.fa chrom28_flnc_polya_flnc_report.txt chrom28_flnc_tails.fa
Mapping reads to the genome
Make a directory for mapping in your Iso-Seq folder (and go into folder):
mkdir 2_mapping
Mapping reads with Minimap2:
ref='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' query='/home/training/long_read_transcriptomics/rkuo/1_isoseq/2_lima/chrom28_flnc.fa' # need to change this for your paths outfile='mm2_chrom28_flnc_gal6.sam' minimap2 --secondary=no -ax splice -uf -C5 ${ref} ${query} > ${outfile}
Output is a SAM file.
https://samtools.github.io/hts-specs/SAMv1.pdf
https://broadinstitute.github.io/picard/explain-flags.html
Convert from SAM to BAM:
sam='mm2_chrom28_flnc_gal6.sam' bam='mm2_chrom28_flnc_gal6.bam' samtools view -bS ${sam} > ${bam}
Output is a BAM file.
Sort BAM file::
bam='mm2_chrom28_flnc_gal6.bam' outfile='mm2_chrom28_flnc_gal6_sort.bam' samtools sort ${bam} -o ${outfile}
Output is a BAM file.
Running TAMA Collapse
Note that you made need to deactivate the conda environment to use Python 2.7 i.e. :
conda deactivate
Make folder for running TAMA Collapse in your Iso-Seq folder (and got to folder):
mkdir 3_collapse
Running TAMA Collapse:
spath='/home/training/bin/tama/' pscript='tama_collapse.py' fpath='/home/training/long_read_transcriptomics/rkuo/1_isoseq/3_map/' # need to adjust this for your folder structure sam='mm2_chrom28_flnc_gal6_sort.bam' # need to adjust this for your file naming prefix='tc_mm2_chrom28_flnc_gal6' capflag='capped' awobble='100' zwobble='100' sjt='10' fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' python ${spath}${pscript} -s ${fpath}${sam} -f ${fasta} -p ${prefix} -d merge_dup -x ${capflag} -a ${awobble} -z ${zwobble} -sj sj_priority -sjt ${sjt} -log log_off -b BAM
Outputs:
tc_mm2_chrom28_flnc_gal6.bed tc_mm2_chrom28_flnc_gal6_local_density_error.txt tc_mm2_chrom28_flnc_gal6_polya.txt tc_mm2_chrom28_flnc_gal6_read.txt tc_mm2_chrom28_flnc_gal6_strand_check.txt tc_mm2_chrom28_flnc_gal6_trans_read.bed tc_mm2_chrom28_flnc_gal6_trans_report.txt tc_mm2_chrom28_flnc_gal6_varcov.txt tc_mm2_chrom28_flnc_gal6_variants.txt
Here is a nifty little BASH script for giving a brief summary of TAMA Collapse output:
file=$1 echo "Genes" cat ${file} | awk -F "\t" '{print $4}' | awk -F ";" '{print $1}' | sort | uniq | wc -l echo "Transcripts" cat ${file} | awk -F "\t" '{print $4}' | awk -F ";" '{print $2}' | sort | uniq | wc -l echo "Multi-exon Transcripts" cat ${file} | awk -F "\t" '{if($10>1)print $4}' | awk -F ";" '{print $2}' | sort | uniq | wc -l