Best practice for fusion transcript finding - Magdoll/cDNA

Last Updated: 08/23/2022

This tutorial contains guidelines for finding fusion transcripts. The content below are recommendations and is subject to change at any time. Feedbacks welcome!

Mapping Transcript to Genome

Different aligners have very different parameters and performance for detecting chimeric mappings. I recommend running multiple aligners (potentially with different parameters) and compare results!

Tutorial on using GMAP, Minimap2, and deSALT

For example, here are parameters I tested using GMAP and Minimap2:

gmap -D [dir] -d hg38 -f samse -n 0 input.fasta > input.fasta.gmap.sam

minimap2 -ax splice -uf --secondary=no hg38.fa input.fasta > input.fasta.minimap2.sam

Run Fusion Finding script

Cupcake tutorial on using fusion_finder.py

Once again, I recommend playing with different parameters on fusion_finder.py. Here are some examples:

sort -k 3,3 -k 4,4n input.fasta.minimap2.sam > input.fasta.minimap2.sorted.sam

fusion_finder.py --input input.fasta -s input.fasta.minimap2.sorted.sam \
    --cluster_report cluster_report.csv \
     -o output.fusion \
     --min_locus_coverage_bp 500 -d 1000000

The output from fusion_finder.py will be:

output.fusion.gff
output.fusion.rep.fq
output.fusion.abundance.txt

Classify Fusion Genes using SQANTI3

SQANTI3 now supports classifying fusion transcripts. You will need to use the --is_fusion parameter.

python <path_to>/sqanti3_qc.py --gtf output.fusion.gff gencode.v33.annotation.gtf hg38.fa \
  --is_fusion

The output from SQANTI3 includes the .classification.txt which will tell you which annotated (or novel) genes each component of a fusion maps to.

(note: pigeon, the PacBio implementation of SQANTI3, currently does not support fusion isoforms. If you are working with fusion transcripts, please continue to use SQANTI3 in the meantime)

Collate Fusion Information and Filtering Candidates

Now that we have SQANTI3 classification we can filter for fusion candidates based on:

Both genes must be from different genes
Both genes must be annotated (based on SQANTI3 classification)

The script usage is below:

usage: fusion_collate_info.py [-h] [--total_fl_count TOTAL_FL_COUNT]
                              [--config CONFIG] [--genome GENOME]
                              [--cds_gff CDS_GFF] [--orf_faa ORF_FAA]
                              [--min_fl_count MIN_FL_COUNT]
                              fusion_prefix class_filename gtf_filename

The fusion_prefix is the output prefix from running fusion_finder.py, so it would be output.fusion for this example. The classification file is from SQANTI3 output. You should also provide the GTF file that was used as input to SQANTI3 (ex: Gencode GTF).

For example, you will run:

fusion_collate_info.py output.fusion output.fusion_classification.txt \
      gencode.v36.annotation.gtf \
      --genome hg38.fa

And this will generate an output table that looks like this: (only showing the first few columns)

UniqueID	FusionName	LeftGeneName	LeftGeneID	RightGeneName	LeftBreakpoint	RightGeneID	RightGeneName	RightBreakpoint
PBfusion.142	CAMTA1--CAMTA1	ACOT7	CAMTA1	ENSGxxxxx	chr1:6825211:+	ACOT7	ENSGxxxxx	chr1:6264697:-

Best practice for fusion transcript finding - Magdoll/cDNA_Cupcake GitHub Wiki

Mapping Transcript to Genome

Run Fusion Finding script

Classify Fusion Genes using SQANTI3

Collate Fusion Information and Filtering Candidates

⚠️ GitHub.com Fallback ⚠️

Best practice for fusion transcript finding - Magdoll/cDNA_Cupcake GitHub Wiki

Mapping Transcript to Genome

Run Fusion Finding script

Classify Fusion Genes using SQANTI3

Collate Fusion Information and Filtering Candidates

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️