Long read Method Comparison (Drosophila PacBio) - McIntyre-Lab/TranD GitHub Wiki

Table of Contents


Below is a description of the method used to compare the long read clustering methods FLAIR and IsoSeq3 cluster using Drosophila PacBio data (D. melanogaster and D. simulans head data from males and females).

More detailed documentation and scripts are in the links provided.

For more information on how to run TranD, see the User Guide.

Prepare baseline reads

Below are the steps that outline a general process for taking raw sequencing data from and processing it into a mapped annotation file (GTF) that can be analyzed via TranD.

Library processing

PacBio library processing for D. melanogaster and D. simulans head data.

There are 8 samples: 2 species, 1 genotype, 2 treatments (ethanol/no ethanol) and replicate of D. simulans female treated (used in calibration).

Starting input is a *.subreads.bam file - sequences include subreads separated by primer and adapter sequences.

Circular consensus sequence (CCS)

CCS Documentation

Consensus calling outputs consensus sequence from each ZMW with at least one full pass in ccs.bam

The argument --minPasses is the minimum number of full passes per read (each primer is seen at least once).

ccs --min-passes 1 \
    --num-threads 12 \
    --log-file ${LOG}/${SAMPLEID}.log \
    --report-file ${OUTD}/${SAMPLEID}_report.txt \
    ${SUBREADS_BAM} \
    ${OUTD}/${SAMPLEID}.ccs.bam
  • ${SAMPLEID} is a variable used to represent the sample to run through all of them in a bash array
  • ${OUTD} is an output directory

Primer removal via lima

Trim primers (and barcodes if present) via lima.

NOTE: lima can also be used for demultiplexing (in this example, the data is not barcoded).

lima --num-threads 12 \
    --isoseq \
    ${OUTD}/${SAMPLEID}.ccs.bam \
    ${PRIMERS} \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.bam

Trim poly-A tails (refine)

Remove poly-A tails and concatemers using IsoSeq3 Refine

isoseq3 refine \
    --num-threads 12 \
    --require-polya \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.primer_5p--primer_3p.bam \
    ${PRIMERS} \
    ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.bam

Align sample reads to reference genome

Convert BAM files after refining to FASTQ files using bedtools bamtofastq:

bedtools bamtofastq \
    -i ${IND}/${SAMPLEID}_ccs_trimmed.flnc.bam \
    -fq ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.fq

Align to reference genome with minimap2. To use the --junc-bed option of minimap2 to prioritize reference junction mappings, paftools.js gff2bed is used to convert the reference GTF to a formatted BED file (see minimap2 documentation for instructions).

paftools.js gff2bed ${GTF} \
    > ${ROZ}/${MAP}_anno.bed

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed ${ROZ}/${MAP}_anno.bed \
    -u f \
    --secondary=no \
    -C 5 \
    ${REF} \
    ${IND}/${SAMPLEID}_ccs_trimmed.flnc.fq \
    > ${OUTD}/${SAMPLEID}_2_${MAP}_unfiltered_minimap2.sam
  • ${REF} refers to the reference FASTA file for the genome of the species mapped to
  • ${MAP} is the species being mapped to ("mel" for melanogaster and "sim" for simulans)

Convert SAM alignment files to GTF annotation files

SAM files are converted to BAM files with samtools view

samtools view \
    -h \
    -F 2052 \
    ${OUTD}/${SAMPLEID}_2_${MAP}_unfiltered_minimap2.sam \
    > ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.sam

samtools view -b ${IND}/${SAMPLEID}_2_${MAP}_minimap2.sam \
    > ${IND}/${SAMPLEID}_2_${MAP}_minimap2.bam

BAM files are converted to BED12 files with bedtools bamtobed

bedtools bamtobed \
    -split \
    -i ${IND}/${SAMPLEID}_2_${MAP}_minimap2.bam \
    > ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.bed

BED12 files are converted to GTF files using information from the BED12:

awk -v source=${SOURCE} -F "\t" '{print $1"\t"source"\texon\t"$2+1"\t"$3"\t.\t"$6"\t.\tgene_id \""$4"\"; transcript_id \""$4"\";"}' \
    ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.bed \
    > ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.gtf

Get associated gene_id values

Concatenate all samples mapped to the given genome (D. melanogaster or D. simulans) so that common gene_id values can be associated across samples:

cat ${IND}/*_2_${MAP}_minimap2.gtf \
    > ${ROZ}/combined_2_${MAP}_minimap2.gtf

To compare junction chains across transcripts with TranD, genes with overlapping exons (on the same strand) are given a common gene_id.GFFcompare is used to find overlapping exons:

gffcompare -r ${GTF} -o ${OUTD}/combined_2_${MAP} ${ROZ}/combined_2_${MAP}_minimap2.gtf

To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:

python correct_gffcompare_GTF_gene_id.py \
    -a ${OUTD}/combined_2_${MAP}.annotated.gtf \
    -g ${ROZ}/combined_2_${MAP}_minimap2.gtf \
    -o ${OUTD}/combined_2_${MAP}_corrected_associated_gene.gtf

Split back out the combined samples by species using the TranD utility subset_gtf.py:

for SPECIES in mel sim; do
    cat ${IND}/${SPECIES}_*_2_${MAP}_minimap2.gtf \
        | awk '{print $10}' | sed s/'\"'/''/g | sed s/';'/''/g \
        > ${ROZ}/${SPECIES}_2_${MAP}_reads.txt
    python subset_gtf.py \
        -g ${OUTD}/combined_2_${MAP}_corrected_associated_gene.gtf \
        -t transcript_id \
        -i ${ROZ}/${SPECIES}_2_${MAP}_reads.txt \
        -o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf
done

Read selection

Identify and remove unspliced fragments (NOTE: Keeping all novel/fusion loci and other classifications).

Flag unspliced fragment reads (monoexon read in multiexon reference gene).

Get GTF of isoforms that pass selection parameters.

Subset other outputs for selected reads.

Script:

## Flag monoexon reads and references
python ${SCRIPTS}/flag_monoexon_03avn.py \
    -r ${GTF} \
    -i ${IND}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf \
    -o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv

## Get reads that are not unspliced fragments
flagCOL=$(head -1 ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
    | sed s/','/'\n'/g | cat -n | awk '$2=="flag_unspliced_fragment"{print $1}')
xcrptCOL=$(head -1 ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
    | sed s/','/'\n'/g | cat -n | awk '$2=="transcript_id"{print $1}')
awk -F "," -v flagcol=${flagCOL} -v xcrptcol=${xcrptCOL} 'NR!=1 && $(flagcol)==0 {print $(xcrptcol)}' \
    ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
    > ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt

## Get GTF of read that pass selection parameters
python ${SCRIPTS}/subset_gtf.py \
    -g ${IND}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf \
    -t transcript_id \
    -i ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt \
    -o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_baseline_selection.gtf

## Subset other output files for selected reads
samtools merge ${INDSAM}/${SPECIES}_2_${MAP}_minimap2.sam \
    ${INDSAM}/${SPECIES}_*_2_${MAP}_minimap2.sam

python $PROJ/scripts/Common_extracFeatureFromList.py \
    --list ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt \
    --type sam \
    --input ${INDSAM}/${SPECIES}_2_${MAP}_minimap2.sam \
    --out ${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.sam

samtools view -bS \
    ${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.sam \
    > ${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.bam

Run IsoSeq3 cluster

Run IsoSeq3 cluster on baseline selection output from above across samples for each species mapped to each reference (aka, mel_2_mel, mel_2_sim, etc.):

isoseq3 cluster \
    --num-threads 12 \
    --log-file ${LOG}/${SPECIES}_2_${MAP}.log \
    --verbose \
    --use-qvs \
    ${IND}/${SPECIES}_2_${MAP}_baseline_selec.bam \
    ${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.bam

Re-align the resulting IsoSeq3 consensus sequences (hq AND lq files) to the genome and merge the hq and lq alignments:

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed ${ROZ}/${MAP}_anno.bed \
    -u f \
    --secondary=no \
    -C 5 \
    ${REF} \
    ${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.lq.fasta.gz

minimap2 \
    -t 12 \
    -a \
    -x splice \
    --junc-bed ${ROZ}/${MAP}_anno.bed \
    -u f \
    --secondary=no \
    -C 5 \
    ${REF} \
    ${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.hq.fasta.gz

samtools merge \
    ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
    ${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.hq.sam \
    ${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.lq.sam

Convert SAM -> BAM -> BED12 -> GTF and remove secondary alignment results by samtools view -F 2052:

samtools view \
    -b \
    -h \
    -F 2052 \
    ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
    > ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam

bedtools bamtobed \
    -split \
    -i ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam \
    | awk -F "\t" -v source="${SPECIES}_2_${MAP}" \
    '{print $1"\t"source"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\"" $4 "\"" "; " "transcript_id " "\"" $4 "\"" ";"}' \
    > ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realign.gtf

Run FLAIR

Run FLAIR on baseline selection output from above across samples for each species mapped to each reference (aka, mel_2_mel, mel_2_sim, etc.):

Sort and index the BAM file via samtools:

samtools sort \
    ${IND}/${SPECIES}_2_${MAP}_baseline_selec.bam \
    > ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam
samtools index ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam

Concatenate the CCS FASTQ files for each reference by species:

cat $PROJ/dros_analysis/bam2fq/${SPECIES}_*_ccs_trimmed.flnc.fq \
    > $PROJ/dros_analysis/bam2fq/${SPECIES}_ccs_trimmed.flnc.fq

FQ=$PROJ/dros_analysis/bam2fq/${SPECIES}_ccs_trimmed.flnc.fq

Use FLAIR tools to convert BAM files to BED12. Run FLAIR using correct and collapse mode (23).

python ${HPC_FLAIR_DIR}/flair/bin/bam2Bed12.py \
    -i ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam \
    > ${BED12}/${SPECIES}_2_${MAP}.bed12

python ${HPC_FLAIR_DIR}/flair/flair.py 23 \
    -q ${BED12}/${SPECIES}_2_${MAP}.bed12 \
    -g ${GENOME} \
    -f ${ANNO} \
    -r ${FQ} \
    -o ${OUTD}/${SPECIES}_2_${MAP}_flair \
    --keep_intermediate \
    --generate_map
  • ${GENOME} refers to the reference FASTA for the species mapped to
  • ${ANNO} refers to the reference GTF for the species mapped to

Comparing Long Read Methods

At this point, there are four annotations available per species (ex: mel2mel_isoseq, sim2mel_isoseq, mel2mel_flair, sim2mel_flair) that allows for comparison of the long read methods used above.

Our method of comparison (that results in the files in the Precomputed Files) is done as follows:

The two methods can be compared by creating a TMM using the mel2mel (or sim2sim) GTF created from both Isoseq3 Cluster and FLAIR. This comparison method is the exact same as the method used in the [Drosophila Species Comparison](https://github.com/McIntyre-Lab/TranD/wiki/Drosophila-Species-Comparison-(D.-melanogaster-vs.-D.-simulans), starting from getting the corrected associated gene_ids.

Other methods may involve creating a union reference within species and per method (ex: mel2mel and sim2mel from both methods-> mel_union_isoseq, mel_union_FLAIR) and running TranD 2GTF on those two union references.