Suggested Best Practices for Long Reads - McIntyre-Lab/TranD GitHub Wiki
Table of Contents
- Overview of Best Practices
- Prepare baseline reads - Pre-processing/QC of raw PacBio *.subreads.bam or Oxford Nanopore (ONT) FASTQ files
- Align reads to genome
- Convert SAM alignment files to GTF annotation files
- SQANTI3 QC of long reads
- Get associated gene_id values
- Read selection
- Transcript model estimation
- Validation of transcript model estimation
- Quantification of reads
Overview of Best Practices
Prepare baseline reads
For PacBio
Starting input is a *.subreads.bam file - sequences include subreads separated by primer and adapter sequences.
PacBio library processing includes 3 main steps:
- Circular consensus sequence (CCS)
- Primer removal and demultiplexing (lima)
- Trim poly-A tails (refine)
1) Circular consensus sequence (CCS)
Consensus calling outputs consensus sequence from each ZMW with at least one full pass in ccs.bam
The argument --minPasses is the minimum number of full passes per read (each primer is seen at least once).
ccs --min-passes 1 \
--num-threads 12 \
--log-file ${LOG}/${SAMPLEID}.log \
--report-file ${OUTD}/${SAMPLEID}_report.txt \
${SUBREADS_BAM} \
${OUTD}/${SAMPLEID}.ccs.bam
CCS documentation can be found here.
2) Primer removal and demultiplexing (lima)
Trim primers (and barcodes if present).
A primer file should be in FASTA format. For example:
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGG
>primer_3p
GTACTCTGCGTTGATACCACTGCTT
If reads are multiplexed, then a barcode FASTA file will be used instead:
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>sample_brain_3p
AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG
>sample_liver_3p
AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG
For primer trimming without demultiplexing, we have previously used the following:
lima --num-threads 12 \
--isoseq \
${OUTD}/${SAMPLEID}.ccs.bam \
primers.fasta \
${OUTD}/${SAMPLEID}_ccs_trimmed.bam
However, we are currently testing preset options. For primer trimming and demultiplexing we are currently testing:
lima --num-threads 12 \
--split-named \
--hifi-present SYMMETRIC-ADAPTERS \
--peek-guess \
$OUT/${NAME}.subreads.ccs.bam \
barcodes.fasta \
$OUTD/${NAME}_demuxed.bam
NOTE: To store unbarcoded reads in a separate file, use the --store-unbarcoded argument as well.
lima documentation can be found here.
3) Trim poly-A tails (refine)
Remove poly-A tails and concatemers
isoseq3 refine \
--num-threads 12 \
--require-polya \
${OUTD}/${SAMPLEID}_ccs_trimmed.primer_5p--primer_3p.bam \
${PRIMERS} \
${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.bam
Convert BAM files after refining to FASTQ files using bedtools bamtofastq:
bedtools bamtofastq \
-i ${IND}/${SAMPLEID}_ccs_trimmed.flnc.bam \
-fq ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.fq
For ONT
For ONT data we recommend:
- pycoQC - to evaluate quality of ONT run and distribution of barcodes if present
- pychopper - to classify and trim reads based on primers
1) pycoQC
PycoQC generates an interactive HTML report of ONT run quality metrics using the run summary file (sequencing_summary_*.txt)
pycoQC -f sequencing_summary.txt -o pycoQC_output.html
PycoQC documentation can be found here.
2) pychopper
Pychopper identifies full-length (fl) reads, “rescued” reads (fused reads that could be separated), and unclass reads (or reads without proper primers).
NOTE: Only fl reads have primers trimmed.
pychopper \
-t 24 \
-r report_sample.pdf \
-u unclass_sample.fq \
-w rescue_sample.fq \
sample.fastq \
fl_sample.fq
NOTE: Here we are using pHMM default.
Pychopper documentation can be found here.
Align reads to genome
Align to reference genome with minimap2.
To use the --junc-bed option of minimap2 to prioritize reference junction mappings, paftools.js gff2bed is used to convert the reference GTF to a formatted BED file (see minimap2 documentation for instructions).
paftools.js gff2bed reference_annotation.gtf \
> annotation.bed
Aligning stranded PacBio and ONT
For stranded libraries (or for ONT fl and rescue out of pychopper) use the following minimap2 arguments:
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed annotation.bed \
-u f \
--secondary=no \
-C 5 \
genome.fasta \
sample.fq \
> sample_minimap2.sam
** Aligning unstranded**
NOTE: All arguments are identical to stranded alignment, other than the -u f option. This can be used for unstraded PacBio or the unclassified ONT reads out of pychopper.
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed annotation.bed \
--secondary=no \
-C 5 \
genome.fasta \
sample.fq \
> sample_minimap2.sam
NOTE: The authors of minimap2 have recommended options for different types of reads on the minimap2 github.
Convert SAM alignment files to GTF annotation files
SAM files are converted to BAM files with samtools view.
First, unmapped and ambiguous alignments are removed:
samtools view \
-h \
-F 2052 \
sample_minimap2.sam \
> sample_minimap2_mapped.sam
Then the SAM is converted to a BAM:
samtools view -b sample_minimap2_mapped.sam \
> sample_minimap2_mapped.bam
BAM files are converted to BED12 files with bedtools bamtobed
bedtools bamtobed \
-split \
-i sample_minimap2_mapped.bam \
> sample_minimap2_mapped.bed
BED12 files are converted to GTF files using information from the BED12:
awk -v source=${SOURCE} -F "\t" '{print $1"\t"source"\texon\t"$2+1"\t"$3"\t.\t"$6"\t.\tgene_id \""$4"\"; transcript_id \""$4"\";"}' \
sample_minimap2_mapped.bed \
> sample_minimap2_mapped.gtf
SQANTI3 QC of long reads
SQANTI3 QC can be used to evaluate the mapped long reads against the reference annoation.
sqanti3_qc.py \
--skipORF \
--report pdf \
--force_id_ignore \
--dir /path/to/outdir \
-o sample \
sample_minimap2_mapped.gtf \
reference_annotation.gtf \
reference_genome.fasta
## --skipORF option to skip ORF prediction (runs much faster)
## --report pdf option to make a SQANTI QC report in PDF format
## --force_id_ignore option to allow transcript ID names that are not like PB.X.Y
## Order of files is (1) isoforms (FA/FQ/GTF), (2) annotation (ref GTF/GFF), (3) genome (ref FA)
# --dir is path to output directory
# -o prefix name to use
A high quality PacBio dataset compared to a reasonable reference annotation should have around 80% full-splice matches (FSM) to reference transcripts. If you find proportions substantially lower than this, we recommend re-evaluating previous steps and run quality to identify potential read losses.
ONT datasets are expected to have lower proportions of FSM.
Get associated gene_id values
SQANTI3 QC output can be used to get associated gene_id values, however reads in novel gene loci will each get their own arbitrary gene_id value regardless of any overlap between them. We have therefore used GFFcompare to find overlapping gene loci across reads.
Concatenate all samples mapped to the given genome so that common gene_id values can be associated across samples:
cat *_minimap2_mapped.gtf \
> combined_mapped.gtf
To compare splice junction chains across transcripts with TranD or other tools, genes with overlapping exons (on the same strand) are given a common gene_id GFFcompare is used to find overlapping exons:
gffcompare -r reference_annotation.gtf -o /path/to/output_prefix combined_mapped.gtf
To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:
python correct_gffcompare_GTF_gene_id.py \
-a /path/to/output_prefix.annotated.gtf \
-g combined_mapped.gtf \
-o combined_corrected_associated_gene.gtf
Split back out the combined samples by species using the TranD utility subset_gtf.py:
First make a list of the transcript_id (or read_id) values for each sample:
cat sample_minimap2_mapped.gtf | awk '{print $10}' | sed s/'\"'/''/g \
> sample_read_id.txt
Then subset:
python subset_gtf.py \
-g combined_corrected_associated_gene.gtf \
-t transcript_id \
-i sample_read_id.txt \
-o sample_corrected_associated_gene.gtf
Read selection
Identify unspliced fragment reads (monoexon read in multiexon reference gene).
Get GTF of isoforms that pass selection parameters.
Subset other outputs for selected reads.
NOTE: Scripts used in this method can be found here(LOCATION BEING UPDATED).
## Flag monoexon reads and references
python ${SCRIPTS}/flag_monoexon_03avn.py \
-r reference_annotation.gtf \
-i sample_corrected_associated_gene.gtf \
-o sample_corrected_associated_gene_monoexon_flag.csv
## Get reads that are not unspliced fragments
flagCOL=$(head -1 sample_corrected_associated_gene_monoexon_flag.csv \
| sed s/','/'\n'/g | cat -n | awk '$2=="flag_unspliced_fragment"{print $1}')
xcrptCOL=$(head -1 sample_corrected_associated_gene_monoexon_flag.csv \
| sed s/','/'\n'/g | cat -n | awk '$2=="transcript_id"{print $1}')
awk -F "," -v flagcol=${flagCOL} -v xcrptcol=${xcrptCOL} 'NR!=1 && $(flagcol)==0 {print $(xcrptcol)}' \
sample_corrected_associated_gene_monoexon_flag.csv \
> sample_baseline_selection_keep.txt
## Get GTF of read that pass selection parameters
python ${SCRIPTS}/subset_gtf.py \
-g sample_corrected_associated_gene.gtf \
-t transcript_id \
-i sample_baseline_selection_keep.txt \
-o sample_corrected_associated_gene_baseline_selection.gtf
## Subset other output files for selected reads
python $PROJ/scripts/Common_extracFeatureFromList.py \
--list sample_baseline_selection_keep.txt \
--type sam \
--input sample_minimap2_mapped.sam \
--out sample_baseline_selec.sam
samtools view -bS \
sample_baseline_selec.sam \
> sample_baseline_selec.bam
Transcript model estimation
There are many tools that can be used to estimate transcript models from long reads. We provide an example of a reference-free method (IsoSeq3 cluster) and a method that utilizes a reference annotation to guide estimates (FLAIR). IsoSeq3 cluster has been developed for PacBio and FLAIR for ONT reads, however they can be used on either data type.
NOTE: It is important to consider the types of samples you have and whether it makes sense to combine some samples prior to transcript model estimation.
IsoSeq3 cluster
Run IsoSeq3 cluster on baseline selection output across samples::
isoseq3 cluster \
--num-threads 12 \
--log-file sample_isoseq3Cluster.log \
--verbose \
--use-qvs \
sample_baseline_selec.bam \
sample_isoseq3Cluster.bam
Re-align the resulting IsoSeq3 consensus sequences (hq AND lq files) to the genome and merge the hq and lq alignments:
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed ${ROZ}/${MAP}_anno.bed \
-u f \
--secondary=no \
-C 5 \
${REF} \
${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.lq.fasta.gz
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed ${ROZ}/${MAP}_anno.bed \
-u f \
--secondary=no \
-C 5 \
${REF} \
${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.hq.fasta.gz
samtools merge \
${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.hq.sam \
${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.lq.sam
Convert SAM-BAM-BED12-GTF and remove secondary alignment results by samtools view -F 2052:
samtools view \
-b \
-h \
-F 2052 \
${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
> ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam
bedtools bamtobed \
-split \
-i ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam \
| awk -F "\t" -v source="${SPECIES}_2_${MAP}" \
'{print $1"\t"source"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\"" $4 "\"" "; " "transcript_id " "\"" $4 "\"" ";"}' \
> ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realign.gtf
Validation of transcript model estimation
In progress...
Quantification of reads
In progress...