EG Workshop LRT 2 - GenomeRIK/workshop_tutorials GitHub Wiki
Processing Nanopore cDNA sequencing data
See TAMA Wiki for TAMA Manual: https://github.com/GenomeRIK/tama/wiki
Path to TAMA:
/home/training/bin/tama
Path to reference files:
/home/training/long_read_transcriptomics/rkuo/0_ref_files
Basecalled Fastq reads (using Guppy):
/home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1
Run SeqKit
#!/bin/bash
hpath='/home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1/' # path to fastq files from Guppy
opath=`pwd` # define output path
cd ${hpath} # go to fastq folder
files=(`ls | grep "fastq.gz$" `) # get all fastq file names
numfiles=`ls | grep "fastq.gz$" | wc -l` # count number of files
let "numfiles=$numfiles-1"
for j in $(seq 0 $numfiles) # loop through processing each file
do
echo ${j}
input="${files[$j]}"
outfile=`echo ${files[$j]} | awk -F ".gz" '{print "seqkit_"$1}'`
seqkit seq -Q 7 ${input} > ${opath}/${outfile} # run sekit on each file
done
Example to run: bash script.sh
The output is a new fastq file for each original fastq file with low quality reads removed.
Combine all the fastq outputs from seqkit
files=(`ls | grep "^seqkit" |grep "fastq$"`)
numfiles=`ls |grep "^seqkit" |grep "fastq$" | wc -l`
let "numfiles=$numfiles-1"
outfile="all_seqkit.fastq"
rm ${outfile}
for j in $(seq 0 $numfiles)
do
echo ${j}
cat ${files[$j]} >> ${outfile}
done
The output is a single file containing all the sequences from all the input fastq files.
Run Pychopper
Remember to activate the base conda environment: conda activate base
pychopper='/home/training/miniconda3/bin/cdna_classifier.py'
fastq='/home/training/long_read_transcriptomics/rkuo/2_nanopore/2_seqkit/all_seqkit.fastq' # replace this with the path to your seqkit fastq file
suffix='sample_ont'
ppath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/3_pychopper/'
pfile='primers_teloprime_ont.fa'
python ${pychopper} -b ${ppath}${pfile} -m edlib -r report_${suffix}.pdf -u unclassified_${suffix}.fq -w rescued_${suffix}.fq ${fastq} full_length_output_${suffix}.fq
Outputs:
cdna_classifier_report.tsv full_length_output_sample_ont.fq report_sample_ont.pdf rescued_sample_ont.fq unclassified_sample_ont.fq
Convert fastq to fasta
full_length_output_sample_ont.fq -> full_length_output_sample_ont.fa
spath='/home/training/bin/tama/tama_go/format_converter/'
pscript='tama_convert_nanopore_fastq_fasta.py'
fastq='full_length_output_sample_ont.fq'
outfile='full_length_output_sample_ont.fa'
python ${spath}${pscript} ${fastq} ${outfile}
Map reads to the genome
Use the fasta file we just created: full_length_output_sample_ont.fa
ref='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa'
query='full_length_output_sample_ont.fa' # make sure to adjust path for your folder structure
outfile='mm2_sample_ont_chrom28.sam'
minimap2 --secondary=no -ax splice -t 2 ${ref} ${query} > ${outfile}
Output is a SAM file.
Convert SAM to BAM
SAM is text file. BAM is a compressed binary version of SAM.
sam='mm2_sample_ont_chrom28.sam'
bam=`echo ${sam} | sed 's/sam$/bam/'`
samtools view -bS ${sam} > ${bam}
Sort BAM file
bam='mm2_sample_ont_chrom28.bam'
outfile=`echo ${bam} | sed 's/.bam/_sort.bam/'` # just an automatic way to make new name
samtools sort ${bam} -o ${outfile}
Check BAM file:
samtools view mm2_sample_ont_chrom28.bam | awk '{print $2}' | sort | uniq -c
Remember to change conda environment:
conda deactivate
Run TAMA Collapse
spath='/home/training/bin/tama/'
pscript='tama_collapse.py'
fpath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/4_mapping/' ## adjust this pathway for where your BAM file is
sam='mm2_sample_ont_chrom28_sort.bam' #the bam file we just made, adjust the name if you used a different name
prefix='tc_mm2_sample_ont_chrom28'
capflag='capped'
awobble='100'
zwobble='100'
sjt='10'
fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' # chaeck that your fasta file is there and looks ok
python ${spath}${pscript} -s ${fpath}${sam} -f ${fasta} -p ${prefix} -d merge_dup -x ${capflag} -a ${awobble} -z ${zwobble} -sj sj_priority -sjt ${sjt} -log log_off -b BAM
Outputs:
tc_mm2_sample_ont_chrom28.bed tc_mm2_sample_ont_chrom28_local_density_error.txt tc_mm2_sample_ont_chrom28_polya.txt tc_mm2_sample_ont_chrom28_read.txt tc_mm2_sample_ont_chrom28_strand_check.txt tc_mm2_sample_ont_chrom28_trans_read.bed tc_mm2_sample_ont_chrom28_trans_report.txt tc_mm2_sample_ont_chrom28_varcov.txt tc_mm2_sample_ont_chrom28_variants.txt