Best practice for fusion transcript finding - Magdoll/cDNA_Cupcake GitHub Wiki
Last Updated: 08/23/2022
This tutorial contains guidelines for finding fusion transcripts. The content below are recommendations and is subject to change at any time. Feedbacks welcome!
Different aligners have very different parameters and performance for detecting chimeric mappings. I recommend running multiple aligners (potentially with different parameters) and compare results!
Tutorial on using GMAP, Minimap2, and deSALT
For example, here are parameters I tested using GMAP and Minimap2:
gmap -D [dir] -d hg38 -f samse -n 0 input.fasta > input.fasta.gmap.sam
minimap2 -ax splice -uf --secondary=no hg38.fa input.fasta > input.fasta.minimap2.sam
Cupcake tutorial on using fusion_finder.py
Once again, I recommend playing with different parameters on fusion_finder.py
. Here are some examples:
sort -k 3,3 -k 4,4n input.fasta.minimap2.sam > input.fasta.minimap2.sorted.sam
fusion_finder.py --input input.fasta -s input.fasta.minimap2.sorted.sam \
--cluster_report cluster_report.csv \
-o output.fusion \
--min_locus_coverage_bp 500 -d 1000000
The output from fusion_finder.py
will be:
output.fusion.gff
output.fusion.rep.fq
output.fusion.abundance.txt
SQANTI3 now supports classifying fusion transcripts. You will need to use the --is_fusion
parameter.
python <path_to>/sqanti3_qc.py --gtf output.fusion.gff gencode.v33.annotation.gtf hg38.fa \
--is_fusion
The output from SQANTI3 includes the .classification.txt
which will tell you which annotated (or novel) genes each component of a fusion maps to.
(note: pigeon, the PacBio implementation of SQANTI3, currently does not support fusion isoforms. If you are working with fusion transcripts, please continue to use SQANTI3 in the meantime)
Now that we have SQANTI3 classification we can filter for fusion candidates based on:
- Both genes must be from different genes
- Both genes must be annotated (based on SQANTI3 classification)
The script usage is below:
usage: fusion_collate_info.py [-h] [--total_fl_count TOTAL_FL_COUNT]
[--config CONFIG] [--genome GENOME]
[--cds_gff CDS_GFF] [--orf_faa ORF_FAA]
[--min_fl_count MIN_FL_COUNT]
fusion_prefix class_filename gtf_filename
The fusion_prefix
is the output prefix from running fusion_finder.py
, so it would be output.fusion
for this example. The classification file is from SQANTI3 output. You should also provide the GTF file that was used as input to SQANTI3 (ex: Gencode GTF).
For example, you will run:
fusion_collate_info.py output.fusion output.fusion_classification.txt \
gencode.v36.annotation.gtf \
--genome hg38.fa
And this will generate an output table that looks like this: (only showing the first few columns)
UniqueID | FusionName | LeftGeneName | LeftGeneID | RightGeneName | LeftBreakpoint | RightGeneID | RightGeneName | RightBreakpoint |
---|---|---|---|---|---|---|---|---|
PBfusion.142 | CAMTA1--CAMTA1 | ACOT7 | CAMTA1 | ENSGxxxxx | chr1:6825211:+ | ACOT7 | ENSGxxxxx | chr1:6264697:- |