Long read Method Comparison (Drosophila PacBio) - McIntyre-Lab/TranD GitHub Wiki
Table of Contents
Below is a description of the method used to compare the long read clustering methods FLAIR and IsoSeq3 cluster using Drosophila PacBio data (D. melanogaster and D. simulans head data from males and females).
More detailed documentation and scripts are in the links provided.
For more information on how to run TranD, see the User Guide.
Prepare baseline reads
Below are the steps that outline a general process for taking raw sequencing data from and processing it into a mapped annotation file (GTF) that can be analyzed via TranD.
Library processing
PacBio library processing for D. melanogaster and D. simulans head data.
There are 8 samples: 2 species, 1 genotype, 2 treatments (ethanol/no ethanol) and replicate of D. simulans female treated (used in calibration).
Starting input is a *.subreads.bam file - sequences include subreads separated by primer and adapter sequences.
Circular consensus sequence (CCS)
Consensus calling outputs consensus sequence from each ZMW with at least one full pass in ccs.bam
The argument --minPasses is the minimum number of full passes per read (each primer is seen at least once).
ccs --min-passes 1 \
--num-threads 12 \
--log-file ${LOG}/${SAMPLEID}.log \
--report-file ${OUTD}/${SAMPLEID}_report.txt \
${SUBREADS_BAM} \
${OUTD}/${SAMPLEID}.ccs.bam
- ${SAMPLEID} is a variable used to represent the sample to run through all of them in a bash array
- ${OUTD} is an output directory
Primer removal via lima
Trim primers (and barcodes if present) via lima.
NOTE: lima can also be used for demultiplexing (in this example, the data is not barcoded).
lima --num-threads 12 \
--isoseq \
${OUTD}/${SAMPLEID}.ccs.bam \
${PRIMERS} \
${OUTD}/${SAMPLEID}_ccs_trimmed.bam
Trim poly-A tails (refine)
Remove poly-A tails and concatemers using IsoSeq3 Refine
isoseq3 refine \
--num-threads 12 \
--require-polya \
${OUTD}/${SAMPLEID}_ccs_trimmed.primer_5p--primer_3p.bam \
${PRIMERS} \
${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.bam
Align sample reads to reference genome
Convert BAM files after refining to FASTQ files using bedtools bamtofastq:
bedtools bamtofastq \
-i ${IND}/${SAMPLEID}_ccs_trimmed.flnc.bam \
-fq ${OUTD}/${SAMPLEID}_ccs_trimmed.flnc.fq
Align to reference genome with minimap2. To use the --junc-bed option of minimap2 to prioritize reference junction mappings, paftools.js gff2bed is used to convert the reference GTF to a formatted BED file (see minimap2 documentation for instructions).
paftools.js gff2bed ${GTF} \
> ${ROZ}/${MAP}_anno.bed
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed ${ROZ}/${MAP}_anno.bed \
-u f \
--secondary=no \
-C 5 \
${REF} \
${IND}/${SAMPLEID}_ccs_trimmed.flnc.fq \
> ${OUTD}/${SAMPLEID}_2_${MAP}_unfiltered_minimap2.sam
- ${REF} refers to the reference FASTA file for the genome of the species mapped to
- ${MAP} is the species being mapped to ("mel" for melanogaster and "sim" for simulans)
Convert SAM alignment files to GTF annotation files
SAM files are converted to BAM files with samtools view
samtools view \
-h \
-F 2052 \
${OUTD}/${SAMPLEID}_2_${MAP}_unfiltered_minimap2.sam \
> ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.sam
samtools view -b ${IND}/${SAMPLEID}_2_${MAP}_minimap2.sam \
> ${IND}/${SAMPLEID}_2_${MAP}_minimap2.bam
BAM files are converted to BED12 files with bedtools bamtobed
bedtools bamtobed \
-split \
-i ${IND}/${SAMPLEID}_2_${MAP}_minimap2.bam \
> ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.bed
BED12 files are converted to GTF files using information from the BED12:
awk -v source=${SOURCE} -F "\t" '{print $1"\t"source"\texon\t"$2+1"\t"$3"\t.\t"$6"\t.\tgene_id \""$4"\"; transcript_id \""$4"\";"}' \
${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.bed \
> ${OUTD}/${SAMPLEID}_2_${MAP}_minimap2.gtf
Get associated gene_id values
Concatenate all samples mapped to the given genome (D. melanogaster or D. simulans) so that common gene_id values can be associated across samples:
cat ${IND}/*_2_${MAP}_minimap2.gtf \
> ${ROZ}/combined_2_${MAP}_minimap2.gtf
To compare junction chains across transcripts with TranD, genes with overlapping exons (on the same strand) are given a common gene_id.GFFcompare is used to find overlapping exons:
gffcompare -r ${GTF} -o ${OUTD}/combined_2_${MAP} ${ROZ}/combined_2_${MAP}_minimap2.gtf
To update the GTF files with the corrected associated gene_id values, use the TranD utility correct_gffcompare_GTF_gene_id.py:
python correct_gffcompare_GTF_gene_id.py \
-a ${OUTD}/combined_2_${MAP}.annotated.gtf \
-g ${ROZ}/combined_2_${MAP}_minimap2.gtf \
-o ${OUTD}/combined_2_${MAP}_corrected_associated_gene.gtf
Split back out the combined samples by species using the TranD utility subset_gtf.py:
for SPECIES in mel sim; do
cat ${IND}/${SPECIES}_*_2_${MAP}_minimap2.gtf \
| awk '{print $10}' | sed s/'\"'/''/g | sed s/';'/''/g \
> ${ROZ}/${SPECIES}_2_${MAP}_reads.txt
python subset_gtf.py \
-g ${OUTD}/combined_2_${MAP}_corrected_associated_gene.gtf \
-t transcript_id \
-i ${ROZ}/${SPECIES}_2_${MAP}_reads.txt \
-o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf
done
Read selection
Identify and remove unspliced fragments (NOTE: Keeping all novel/fusion loci and other classifications).
Flag unspliced fragment reads (monoexon read in multiexon reference gene).
Get GTF of isoforms that pass selection parameters.
Subset other outputs for selected reads.
Script:
## Flag monoexon reads and references
python ${SCRIPTS}/flag_monoexon_03avn.py \
-r ${GTF} \
-i ${IND}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf \
-o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv
## Get reads that are not unspliced fragments
flagCOL=$(head -1 ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
| sed s/','/'\n'/g | cat -n | awk '$2=="flag_unspliced_fragment"{print $1}')
xcrptCOL=$(head -1 ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
| sed s/','/'\n'/g | cat -n | awk '$2=="transcript_id"{print $1}')
awk -F "," -v flagcol=${flagCOL} -v xcrptcol=${xcrptCOL} 'NR!=1 && $(flagcol)==0 {print $(xcrptcol)}' \
${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_monoexon_flag.csv \
> ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt
## Get GTF of read that pass selection parameters
python ${SCRIPTS}/subset_gtf.py \
-g ${IND}/${SPECIES}_2_${MAP}_corrected_associated_gene.gtf \
-t transcript_id \
-i ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt \
-o ${OUTD}/${SPECIES}_2_${MAP}_corrected_associated_gene_baseline_selection.gtf
## Subset other output files for selected reads
samtools merge ${INDSAM}/${SPECIES}_2_${MAP}_minimap2.sam \
${INDSAM}/${SPECIES}_*_2_${MAP}_minimap2.sam
python $PROJ/scripts/Common_extracFeatureFromList.py \
--list ${OUTD}/${SPECIES}_2_${MAP}_baseline_selection_keep.txt \
--type sam \
--input ${INDSAM}/${SPECIES}_2_${MAP}_minimap2.sam \
--out ${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.sam
samtools view -bS \
${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.sam \
> ${OUTD}/${SPECIES}_2_${MAP}_baseline_selec.bam
Run IsoSeq3 cluster
Run IsoSeq3 cluster on baseline selection output from above across samples for each species mapped to each reference (aka, mel_2_mel, mel_2_sim, etc.):
isoseq3 cluster \
--num-threads 12 \
--log-file ${LOG}/${SPECIES}_2_${MAP}.log \
--verbose \
--use-qvs \
${IND}/${SPECIES}_2_${MAP}_baseline_selec.bam \
${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.bam
Re-align the resulting IsoSeq3 consensus sequences (hq AND lq files) to the genome and merge the hq and lq alignments:
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed ${ROZ}/${MAP}_anno.bed \
-u f \
--secondary=no \
-C 5 \
${REF} \
${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.lq.fasta.gz
minimap2 \
-t 12 \
-a \
-x splice \
--junc-bed ${ROZ}/${MAP}_anno.bed \
-u f \
--secondary=no \
-C 5 \
${REF} \
${OUTD}/${SPECIES}_2_${MAP}_isoseq3_clustered.hq.fasta.gz
samtools merge \
${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.hq.sam \
${ALIGN_OUTD}/${SPECIES}_2_${MAP}_realigned.lq.sam
Convert SAM -> BAM -> BED12 -> GTF and remove secondary alignment results by samtools view -F 2052:
samtools view \
-b \
-h \
-F 2052 \
${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.sam \
> ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam
bedtools bamtobed \
-split \
-i ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realigned.bam \
| awk -F "\t" -v source="${SPECIES}_2_${MAP}" \
'{print $1"\t"source"\t""exon""\t"$2+1"\t"$3"\t"".""\t"$6"\t"".""\t""gene_id " "\"" $4 "\"" "; " "transcript_id " "\"" $4 "\"" ";"}' \
> ${sam2gtf_OUTD}/${SPECIES}_2_${MAP}_realign.gtf
Run FLAIR
Run FLAIR on baseline selection output from above across samples for each species mapped to each reference (aka, mel_2_mel, mel_2_sim, etc.):
Sort and index the BAM file via samtools:
samtools sort \
${IND}/${SPECIES}_2_${MAP}_baseline_selec.bam \
> ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam
samtools index ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam
Concatenate the CCS FASTQ files for each reference by species:
cat $PROJ/dros_analysis/bam2fq/${SPECIES}_*_ccs_trimmed.flnc.fq \
> $PROJ/dros_analysis/bam2fq/${SPECIES}_ccs_trimmed.flnc.fq
FQ=$PROJ/dros_analysis/bam2fq/${SPECIES}_ccs_trimmed.flnc.fq
Use FLAIR tools to convert BAM files to BED12. Run FLAIR using correct and collapse mode (23).
python ${HPC_FLAIR_DIR}/flair/bin/bam2Bed12.py \
-i ${IND}/${SPECIES}_2_${MAP}_baseline_selec.sorted.bam \
> ${BED12}/${SPECIES}_2_${MAP}.bed12
python ${HPC_FLAIR_DIR}/flair/flair.py 23 \
-q ${BED12}/${SPECIES}_2_${MAP}.bed12 \
-g ${GENOME} \
-f ${ANNO} \
-r ${FQ} \
-o ${OUTD}/${SPECIES}_2_${MAP}_flair \
--keep_intermediate \
--generate_map
- ${GENOME} refers to the reference FASTA for the species mapped to
- ${ANNO} refers to the reference GTF for the species mapped to
Comparing Long Read Methods
At this point, there are four annotations available per species (ex: mel2mel_isoseq, sim2mel_isoseq, mel2mel_flair, sim2mel_flair) that allows for comparison of the long read methods used above.
Our method of comparison (that results in the files in the Precomputed Files) is done as follows:
The two methods can be compared by creating a TMM using the mel2mel (or sim2sim) GTF created from both Isoseq3 Cluster and FLAIR. This comparison method is the exact same as the method used in the [Drosophila Species Comparison](https://github.com/McIntyre-Lab/TranD/wiki/Drosophila-Species-Comparison-(D.-melanogaster-vs.-D.-simulans), starting from getting the corrected associated gene_ids.
Other methods may involve creating a union reference within species and per method (ex: mel2mel and sim2mel from both methods-> mel_union_isoseq, mel_union_FLAIR) and running TranD 2GTF on those two union references.