OutputDescription - Oshlack/JAFFA GitHub Wiki

JAFFA will output two files, named jaffa_results.csv and jaffa_results.fasta (by default).

jaffa_results.csv

This is an excel readable table that summarises the fusions found. It has the following fields:

  • sample - This is the sample name. JAFFA takes the sample names from the input file names.
  • fusion genes - The gene symbols for the genes involved in the fusion event. When the fusion is inframe, JAFFA infers the transcriptional direction and orders the names accordingly.
  • chrom1/chrom2/base1/base2 - The position of the breakpoints in the genome. Where 1 and 2 are given in the same order as the gene names above.
  • gap (kb) - How far apart are the breakpoints in the genome? This is only really relevant for intrachromosomal events.
  • spanning pairs - The number of read-pairs, where each read in the pair aligns entirely on either side of the breakpoint. For fusions with multiple breakpoints, the same spanning pairs will be reported for all breakpoints, i.e. counted multiple times. Therefore they are likely to be overestimated for minor isoforms. For some modes, you might see a "-". This indicates that no spanning pairs were found, but that the contig had only a small amount of flanking sequence to align reads to. i.e. the spanning pairs results may not be indicative of the true support for the fusion event.
  • spanning reads - The number of reads which cover the breakpoint.
  • inframe - Do the fusion genes share the same frame? Note that this is only calculated if "aligns" is true. Otherwise "NA" is given.
  • aligns - This indicates whether both breaks points lie on intron-exon boundaries. This would be consistent with a genomic breakpoint in an intron and splice sites being preserved.
  • rearrangement - This is true if the genes are on different chromosomes, if there was an inverse, or any other rearrangement, such as direction, i.e. anything inconsistent with the structure of the human reference genome.
  • contig - Either the read ID or the contig ID from the assembly.
  • contig break - At what position in the read or contig is the breakpoint.
  • classification - This is the prioritisation of the fusions. It is decided in the following way:
    • HighConfidence - aligns to exons and has at least one spanning read and one spanning pair (paired-end data) or multiple spanning reads (single-end data).
    • MediumConfidence - aligns to exons and has at least two spanning read
    • LowConfidence - does not align to exons but has at least one spanning read and one spanning pair (paired-end data) or multiple spanning reads (single-end data)
    • PotentialTransSplicing - aligns to exons, has one spanning read and no spanning pairs. These are often seen in healthy samples.
  • known - Is the fusion reported in the Mitelman databse? Fusions seen in the Mitelman database get bumped up a classification group.

In our validation tests, almost all true positives were classified as HighConfidence or MediumConfidence, and all false positives were classified as LowConfidence. Therefore we recommend focusing on the High and Medium candidates.

jaffa_results.fasta File

This file contains one sequence for each breakpoint identified. The ID of each sequence is in the format:

<sample>---<fusions genes>---<contig>

The two bases either size of the breakpoint are in lower case.

You will find often that the sequences are not full length transcripts. This is because the de novo assembly is not always able to assembly the full transcript. The start and end of the assembled sequence does not indicate the actual start and end of the real transcript. If you used reads, the sequence will just be the read sequence.

Often there will be more than one contig for each breakpoint. We only provide the sequence of one of these - a representative selected by the most number of supporting reads. If you want to see all contigs that span a particular breakpoint you can search in the intermediate JAFFA files like so:

grep <fusions genes> <sample>/<sample>.txt 

The first column are the contig IDs. Get the sequences like so:

grep -A1 "^><contig>"  <sample>/<sample>.fusions.fa 

Other files

Intermediate files are stored in each of the sample subdirectories. These are not intended for the user, but are useful for diagnosing issues, and for rerunning the pipeline without repeating steps.

⚠️ **GitHub.com Fallback** ⚠️