EG Workshop LRT 2 - GenomeRIK/workshop_tutorials GitHub Wiki
Processing Nanopore cDNA sequencing data
See TAMA Wiki for TAMA Manual: https://github.com/GenomeRIK/tama/wiki
Path to TAMA:
/home/training/bin/tama
Path to reference files:
/home/training/long_read_transcriptomics/rkuo/0_ref_files
Basecalled Fastq reads (using Guppy):
/home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1
Run SeqKit
#!/bin/bash hpath='/home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1/' # path to fastq files from Guppy opath=`pwd` # define output path cd ${hpath} # go to fastq folder files=(`ls | grep "fastq.gz$" `) # get all fastq file names numfiles=`ls | grep "fastq.gz$" | wc -l` # count number of files let "numfiles=$numfiles-1" for j in $(seq 0 $numfiles) # loop through processing each file do echo ${j} input="${files[$j]}" outfile=`echo ${files[$j]} | awk -F ".gz" '{print "seqkit_"$1}'` seqkit seq -Q 7 ${input} > ${opath}/${outfile} # run sekit on each file done
Example to run: bash script.sh
The output is a new fastq file for each original fastq file with low quality reads removed.
Combine all the fastq outputs from seqkit
files=(`ls | grep "^seqkit" |grep "fastq$"`) numfiles=`ls |grep "^seqkit" |grep "fastq$" | wc -l` let "numfiles=$numfiles-1" outfile="all_seqkit.fastq" rm ${outfile} for j in $(seq 0 $numfiles) do echo ${j} cat ${files[$j]} >> ${outfile} done
The output is a single file containing all the sequences from all the input fastq files.
Run Pychopper
Remember to activate the base conda environment: conda activate base
pychopper='/home/training/miniconda3/bin/cdna_classifier.py' fastq='/home/training/long_read_transcriptomics/rkuo/2_nanopore/2_seqkit/all_seqkit.fastq' # replace this with the path to your seqkit fastq file suffix='sample_ont' ppath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/3_pychopper/' pfile='primers_teloprime_ont.fa' python ${pychopper} -b ${ppath}${pfile} -m edlib -r report_${suffix}.pdf -u unclassified_${suffix}.fq -w rescued_${suffix}.fq ${fastq} full_length_output_${suffix}.fq
Outputs:
cdna_classifier_report.tsv full_length_output_sample_ont.fq report_sample_ont.pdf rescued_sample_ont.fq unclassified_sample_ont.fq
Convert fastq to fasta
full_length_output_sample_ont.fq -> full_length_output_sample_ont.fa
spath='/home/training/bin/tama/tama_go/format_converter/' pscript='tama_convert_nanopore_fastq_fasta.py' fastq='full_length_output_sample_ont.fq' outfile='full_length_output_sample_ont.fa' python ${spath}${pscript} ${fastq} ${outfile}
Map reads to the genome
Use the fasta file we just created: full_length_output_sample_ont.fa
ref='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' query='full_length_output_sample_ont.fa' # make sure to adjust path for your folder structure outfile='mm2_sample_ont_chrom28.sam' minimap2 --secondary=no -ax splice -t 2 ${ref} ${query} > ${outfile}
Output is a SAM file.
Convert SAM to BAM
SAM is text file. BAM is a compressed binary version of SAM.
sam='mm2_sample_ont_chrom28.sam' bam=`echo ${sam} | sed 's/sam$/bam/'` samtools view -bS ${sam} > ${bam}
Sort BAM file
bam='mm2_sample_ont_chrom28.bam' outfile=`echo ${bam} | sed 's/.bam/_sort.bam/'` # just an automatic way to make new name samtools sort ${bam} -o ${outfile}
Check BAM file:
samtools view mm2_sample_ont_chrom28.bam | awk '{print $2}' | sort | uniq -c
Remember to change conda environment:
conda deactivate
Run TAMA Collapse
spath='/home/training/bin/tama/' pscript='tama_collapse.py' fpath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/4_mapping/' ## adjust this pathway for where your BAM file is sam='mm2_sample_ont_chrom28_sort.bam' #the bam file we just made, adjust the name if you used a different name prefix='tc_mm2_sample_ont_chrom28' capflag='capped' awobble='100' zwobble='100' sjt='10' fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' # chaeck that your fasta file is there and looks ok python ${spath}${pscript} -s ${fpath}${sam} -f ${fasta} -p ${prefix} -d merge_dup -x ${capflag} -a ${awobble} -z ${zwobble} -sj sj_priority -sjt ${sjt} -log log_off -b BAM
Outputs:
tc_mm2_sample_ont_chrom28.bed tc_mm2_sample_ont_chrom28_local_density_error.txt tc_mm2_sample_ont_chrom28_polya.txt tc_mm2_sample_ont_chrom28_read.txt tc_mm2_sample_ont_chrom28_strand_check.txt tc_mm2_sample_ont_chrom28_trans_read.bed tc_mm2_sample_ont_chrom28_trans_report.txt tc_mm2_sample_ont_chrom28_varcov.txt tc_mm2_sample_ont_chrom28_variants.txt