Suggested Best Practices for Long Reads - McIntyre-Lab/TranD GitHub Wiki

Table of Contents

Overview of Best Practices

Long Read Flowchart

Prepare baseline reads

For PacBio

Starting input is a *.subreads.bam file - sequences include subreads separated by primer and adapter sequences.

PacBio library processing includes 3 main steps:

  1. Circular consensus sequence (CCS)
  2. Primer removal and demultiplexing (lima)
  3. Trim poly-A tails (refine)

1) Circular consensus sequence (CCS)

Consensus calling outputs consensus sequence from each ZMW with at least one full pass in ccs.bam

The argument --minPasses is the minimum number of full passes per read (each primer is seen at least once).

ccs --min-passes 1 \
    --num-threads 12 \
    --log-file ${LOG}/${SAMPLEID}.log \
    --report-file ${OUTD}/${SAMPLEID}_report.txt \
    ${SUBREADS_BAM} \
    ${OUTD}/${SAMPLEID}.ccs.bam

CCS documentation can be found here.

2) Primer removal and demultiplexing (lima)

Trim primers (and barcodes if present).

A primer file should be in FASTA format. For example:

>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGG
>primer_3p
GTACTCTGCGTTGATACCACTGCTT

If reads are multiplexed, then a barcode FASTA file will be used instead:

>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>sample_brain_3p
AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG
>sample_liver_3p
AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG

For primer trimming without demultiplexing, we have previously used the following:

lima --num-threads 12 \
    --isoseq \
    ${OUTD}/${SAMPLEID}.ccs.bam \
    primers.fasta \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.bam

However, we are currently testing preset options. For primer trimming and demultiplexing we are currently testing:

lima --num-threads 12 \
    --split-named \
    --hifi-present SYMMETRIC-ADAPTERS \
    --peek-guess \
    $OUT/${NAME}.subreads.ccs.bam \
    barcodes.fasta \
    $OUTD/${NAME}_demuxed.bam

NOTE: To store unbarcoded reads in a separate file, use the --store-unbarcoded argument as well.

lima documentation can be found here.

3) Trim poly-A tails (refine)

Remove poly-A tails and concatemers

isoseq3 refine \
    --num-threads 12 \
    --require-polya \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.primer_5p--primer_3p.bam \
    ${PRIMERS} \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.bam

Convert BAM files after refining to FASTQ files using bedtools bamtofastq:

bedtools bamtofastq \
    -i ${IND}/${SAMPLEID}_ccs_trimmed.flnc.bam \
    -fq ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.fq

For ONT

For ONT data we recommend:

  1. pycoQC - to evaluate quality of ONT run and distribution of barcodes if present
  2. pychopper - to classify and trim reads based on primers

1) pycoQC

PycoQC generates an interactive HTML report of ONT run quality metrics using the run summary file (sequencing_summary_*.txt)

pycoQC -f sequencing_summary.txt -o pycoQC_output.html

PycoQC documentation can be found here.

2) pychopper

Pychopper identifies full-length (fl) reads, “rescued” reads (fused reads that could be separated), and unclass reads (or reads without proper primers).

NOTE: Only fl reads have primers trimmed.

pychopper \
    -t 24 \
    -r report_sample.pdf \
    -u unclass_sample.fq \
    -w rescue_sample.fq \
    sample.fastq \
    fl_sample.fq

NOTE: Here we are using pHMM default.

Pychopper documentation can be found here.

Align reads to genome

Align to reference genome with minimap2.

To use the --junc-bed option of minimap2 to prioritize reference junction mappings, paftools.js gff2bed is used to convert the reference GTF to a formatted BED file (see minimap2 documentation for instructions).

paftools.js gff2bed reference_annotation.gtf \
    > annotation.bed

Aligning stranded PacBio and ONT

For stranded libraries (or for ONT fl and rescue out of pychopper) use the following minimap2 arguments:

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed annotation.bed \
    -u f \
    --secondary=no \
    -C 5 \
    genome.fasta \
    sample.fq \
    > sample_minimap2.sam

** Aligning unstranded**

NOTE: All arguments are identical to stranded alignment, other than the -u f option. This can be used for unstraded PacBio or the unclassified ONT reads out of pychopper.

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed annotation.bed \
    --secondary=no \
    -C 5 \
    genome.fasta \
    sample.fq \
    > sample_minimap2.sam

NOTE: The authors of minimap2 have recommended options for different types of reads on the minimap2 github.

Convert SAM alignment files to GTF annotation files

SAM files are converted to BAM files with samtools view.

First, unmapped and ambiguous alignments are removed:

samtools view \
    -h \
    -F 2052 \
    sample_minimap2.sam \
    > sample_minimap2_mapped.sam

Then the SAM is converted to a BAM:

samtools view -b sample_minimap2_mapped.sam \
    > sample_minimap2_mapped.bam

BAM files are converted to BED12 files with bedtools bamtobed

bedtools bamtobed \
    -split \
    -i sample_minimap2_mapped.bam \
    > sample_minimap2_mapped.bed

BED12 files are converted to GTF files using information from the BED12:

awk -v source=${SOURCE} -F "\t" '{print $1"\t"source"\texon\t"$2+1"\t"$3"\t.\t"$6"\t.\tgene_id \""$4"\"; transcript_id \""$4"\";"}' \
    sample_minimap2_mapped.bed \
    > sample_minimap2_mapped.gtf

SQANTI3 QC of long reads

SQANTI3 QC can be used to evaluate the mapped long reads against the reference annoation.

sqanti3_qc.py \
    --skipORF \
    --report pdf \
    --force_id_ignore \
    --dir /path/to/outdir \
    -o sample \
    sample_minimap2_mapped.gtf \
    reference_annotation.gtf \
    reference_genome.fasta

##  --skipORF option to skip ORF prediction (runs much faster)
##  --report pdf option to make a SQANTI QC report in PDF format
##  --force_id_ignore option to allow transcript ID names that are not like PB.X.Y
##  Order of files is (1) isoforms (FA/FQ/GTF), (2) annotation (ref GTF/GFF), (3) genome (ref FA)
    #   --dir is path to output directory
    #   -o prefix name to use

A high quality PacBio dataset compared to a reasonable reference annotation should have around 80% full-splice matches (FSM) to reference transcripts. If you find proportions substantially lower than this, we recommend re-evaluating previous steps and run quality to identify potential read losses.

ONT datasets are expected to have lower proportions of FSM.

Get associated gene_id values

SQANTI3 QC output can be used to get associated gene_id values, however reads in novel gene loci will each get their own arbitrary gene_id value regardless of any overlap between them. We have therefore used GFFcompare to find overlapping gene loci across reads.

Concatenate all samples mapped to the given genome so that common gene_id values can be associated across samples:

cat *_minimap2_mapped.gtf \
    > combined_mapped.gtf

To compare splice junction chains across transcripts with TranD or other tools, genes with overlapping exons (on the same strand) are given a common gene_id GFFcompare is used to find overlapping exons:

gffcompare -r reference_annotation.gtf -o /path/to/output_prefix combined_mapped.gtf

To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:

python correct_gffcompare_GTF_gene_id.py \
    -a /path/to/output_prefix.annotated.gtf \
    -g combined_mapped.gtf \
    -o combined_corrected_associated_gene.gtf

Split back out the combined samples by species using the TranD utility subset_gtf.py:

First make a list of the transcript_id (or read_id) values for each sample:

cat sample_minimap2_mapped.gtf | awk '{print $10}' | sed s/'\"'/''/g \
    > sample_read_id.txt

Then subset:

python subset_gtf.py \
    -g combined_corrected_associated_gene.gtf \
    -t transcript_id \
    -i sample_read_id.txt \
    -o sample_corrected_associated_gene.gtf

Read selection

Identify unspliced fragment reads (monoexon read in multiexon reference gene).

Get GTF of isoforms that pass selection parameters.

Subset other outputs for selected reads.

NOTE: Scripts used in this method can be found here(LOCATION BEING UPDATED).

## Flag monoexon reads and references
python ${SCRIPTS}/flag_monoexon_03avn.py \
    -r reference_annotation.gtf \
    -i sample_corrected_associated_gene.gtf \
    -o sample_corrected_associated_gene_monoexon_flag.csv

## Get reads that are not unspliced fragments
flagCOL=$(head -1 sample_corrected_associated_gene_monoexon_flag.csv \
    | sed s/','/'\n'/g | cat -n | awk '$2=="flag_unspliced_fragment"{print $1}')
xcrptCOL=$(head -1 sample_corrected_associated_gene_monoexon_flag.csv \
    | sed s/','/'\n'/g | cat -n | awk '$2=="transcript_id"{print $1}')
awk -F "," -v flagcol=${flagCOL} -v xcrptcol=${xcrptCOL} 'NR!=1 && $(flagcol)==0 {print $(xcrptcol)}' \
    sample_corrected_associated_gene_monoexon_flag.csv \
    > sample_baseline_selection_keep.txt

## Get GTF of read that pass selection parameters
python ${SCRIPTS}/subset_gtf.py \
    -g sample_corrected_associated_gene.gtf \
    -t transcript_id \
    -i sample_baseline_selection_keep.txt \
    -o sample_corrected_associated_gene_baseline_selection.gtf

## Subset other output files for selected reads
python $PROJ/scripts/Common_extracFeatureFromList.py \
    --list sample_baseline_selection_keep.txt \
    --type sam \
    --input sample_minimap2_mapped.sam \
    --out sample_baseline_selec.sam

samtools view -bS \
    sample_baseline_selec.sam \
    > sample_baseline_selec.bam

Transcript model estimation

There are many tools that can be used to estimate transcript models from long reads. We provide an example of a reference-free method (IsoSeq3 cluster) and a method that utilizes a reference annotation to guide estimates (FLAIR). IsoSeq3 cluster has been developed for PacBio and FLAIR for ONT reads, however they can be used on either data type.

NOTE: It is important to consider the types of samples you have and whether it makes sense to combine some samples prior to transcript model estimation.

IsoSeq3 cluster

Run IsoSeq3 cluster on baseline selection output across samples::

isoseq3 cluster \
    --num-threads 12 \
    --log-file sample_isoseq3Cluster.log \
    --verbose \
    --use-qvs \
    sample_baseline_selec.bam \
    sample_isoseq3Cluster.bam

Re-align the resulting IsoSeq3 consensus sequences (hq AND lq files) to the genome and merge the hq and lq alignments:

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed ${ROZ}/${MAP}_anno.bed \
    -u f \
    --secondary=no \
    -C 5 \
    ${REF} \
    ${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.lq.fasta.gz

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed ${ROZ}/${MAP}_anno.bed \
    -u f \
    --secondary=no \
    -C 5 \
    ${REF} \
    ${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.hq.fasta.gz

samtools merge \
    ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
    ${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.hq.sam \
    ${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.lq.sam

Convert SAM-BAM-BED12-GTF and remove secondary alignment results by samtools view -F 2052:

samtools view \
    -b \
    -h \
    -F 2052 \
    ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
    > ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam

bedtools bamtobed \
    -split \
    -i ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam \
    | awk -F "\t" -v source="${SPECIES}_2_${MAP}" \
    '{print $1"\t"source"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\"" $4 "\"" "; " "transcript_id " "\"" $4 "\"" ";"}' \
    > ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realign.gtf

Validation of transcript model estimation

In progress...

Quantification of reads

In progress...