Tama Collapse - GenomeRIK/tama GitHub Wiki

Tama Collapse

TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.

Detailed explanation of TAMA Collapse:

After you map your Iso-Seq data onto a genome assembly you will notice that within some gene models there are multiple transcripts which appear to be very similar. Collapsing is the process of figuring out which transcripts are similar enough to be collapsed into a single transcript model. While this may sound trivial, it is actually a complex problem. For example, let's take a look at the Iso-Seq Tofu algorithm for collapsing. While not specifically termed on the Tofu site there are 2 modes of collapsing allowed for Tofu Collapse which I have named "Transcription Start Site Collapse" (TSSC) and "Exon Cascade Collapse" (ECC).

TSSC will collapse all transcripts with matching 3' end, matching spice junctions, and matching number of exons. The 5' end (or transcription start site) can differ. ECC will collapse all transcripts with matching 3' end and matching splice junctions. However, the transcripts to be collapsed do not need to have the same number of exons. See below for a diagram of this behaviour:

You might be wondering why the 5' end is allowed to differ in both modes. This was done because the official Iso-Seq library preparation protocol does not include a 5' cap selection step. This means that transcripts which have been degraded from the 5' end are sequenced as well. Thus with the standard protocol it is impossible to distinguish transcripts based on their 5' end alone.

A more detailed explanation of this can be found in this paper:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3691-9

If you have performed 5' cap selection on your Iso-Seq library then none of these algorithms will be appropriate for your data. However, TAMA Collapse is designed to work with both 5' cap selected and standard libraries.

One question you might have regarding these algorithms is "how much difference between exon start and end sites is allowed?" Indeed, if you look closely at your Iso-Seq data, you will see that very similar looking transcripts may have slight differences in their individual exon start and end sites. I call this phenomenon "wobble". One explanation for wobble is that a sequence error occured near the splice junction which caused the mapper to shift the exon start/end site slightly. In TAMA Collapse you can define the amount of difference you want to allow between exon start/end sites. TAMA Collapse also tracks the amount of wobble that occurs for each exon start/end site so you can see if there was a significant amount of wobble for a collapsed transcript model. Large amounts of wobble are caused by something I call "wobble walking". You can see what this looks like below:

As you can see in the above diagram, the first two exon start sites match given a 10 bp threshold. The second and third exon start sites also match given the same 10 bp threshold. Thus all three exon start sites "match". You can see that this can continue on which would cause a large amount of wobble for that feature.

With TAMA Collapse you can choose to either select the feature which is most common or the feature that would make the longest exon. The default is the most common feature. Basically what this does is look at all the transcripts which match and use the most common exon start/end site (the feature with the most support).

Other collapsing algortihms usually decide on the "correct" exon start site based on either reference information (i.e. short read data or public annotation) or based on canonical intronic splice donor-acceptor pairs (GT-AG). While this may seem like the right idea, it is masking over the information in your Iso-Seq data. Thus it projects or biases the processing toward an assumption which may not be true. However, if you do want to match the splice junctions to another transcriptome you can use TAMA Merge.

TAMA Collapse creates output files that contain all the information pertaining to wobble and sourcing so that you can see where there might be an anomaly with the collapsing. This ties into one of the core tenets of TAMA which is to "flag not hide" any unusual results.

In order to run TAMA Collapse with the ECC mode, you simply have to use the cap flag parameter "-x no_cap". In order to emulate the TSSC mode with TAMA you can just give a very large threshold for the 5' end. Thus TAMA Collapse can peform the same type of collapsing as TOFU Collapse.

NOTE: As of 2017-08-02, we noticed an issue with the strand information provided by TOFU Collapse when running it on bam files generated by GMAP. This was caused by TOFU Collapse using the XS:A? field to define the strand which is designated by GMAP. GMAP had a bug which sometimes gave the wrong strand information in that field. Newer versions of GMAP should have fixed this bug. However, to address this TAMA Collapse has a file output which shows when the XS:A? field disagrees with the strand information from the SAM flag.

GMAP actually uses the splice junction intronic donor-accepter sequences (canonical GT-AG) to "correct" the strand information. So if your mapped transcript has splice junctions which show the reverse complement of the canonical donor-accepter sequences, GMAP will flip the strand in the XS:A? field. The "strand_check" output file from TAMA Collapse contains these instances.

Manual

usage: tama_collapse.py [-h] [-s] [-f] [-p] [-x] [-e] [-c] [-i] [-a] [-m] [-z]

This script collapses mapped transcript models

arguments:

  -h, --help  show this help message and exit
  -s S        Sorted sam/bam file (required)(if using BAM file please use -b BAM flag as well)
  -f F        Fasta file (required)
  -p P        Output prefix (required)
  -x X        Capped flag: capped or no_cap
  -e E        Collapse exon ends flag: common_ends or longest_ends (default
              common_ends)
  -c C        Coverage (default 99)
  -i I        Identity (default 85)
  -icm ICM    Identity calculation method (default ident_cov for including coverage) (alternate is ident_map for excluding hard and soft clipping)
  -a A        5 prime threshold (default 10)
  -m M        Exon/Splice junction threshold (default 10)
  -z Z        3 prime threshold (default 10)
  -d D        Flag for merging duplicate transcript groups (default is merge_dup will merge duplicates ,no_merge quits when duplicates are found)
  -sj SJ      Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority)
  -sjt SJT    Threshold for detecting errors near splice junctions (default is 10bp)
  -lde LDE    Threshold for amount of local density error near splice junctions that is allowed (default is 1000 errors which practically means no threshold is applied)
  -ses SES    Simple error symbol. Use this to pick the symbol used to represent matches in the simple error string for LDE output.
  -b   BAM    Use BAM instead of SAM
  -log LOG    Turns off long log output of collapsing process. (default on, use log_off to turn off)
  -v   V      Prints out version date and exits.
  -rm  RM     Run mode allows you to use original or low_mem mode, default is original
  -vc  VC     Variation coverage threshold: Default 5 reads

Default command would look like this:

python tama_collapse.py -s mapped_reads.sam -f genome.fa -p prefix -x capped

Note on running TAMA Collapse:

The SAM file should be sorted. If it is compressed into a BAM file please use -b BAM to indicate this in the command. The fasta file input should be for the genome assembly that was used to create the SAM file. This tool is very memory greedy. If you are running a TAMA Collapse run on more than 500k reads it is probably better to split the sam file using tama_mapped_sam_splitter.py. See this page for directions: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Split-Files

You can merge the resulting bed12 files using TAMA Merge (with no wobble) and the transcript and gene models will be identical to running TAMA Collapse on the whole SAM file.

Detail explanation of optional arguments:

-x X Capped flag: capped or no_cap

If you performed 5' cap selection during youe Iso-Seq library preparation you should use the "-x capped" option. This will not allow transcripts to be collapses if they have a different number of exons.

If you did not do 5' cap selection (currently official Iso-Seq library preparation protocol), choose "-x no_cap". This will collapse transcripts if they have the same 3' (exons and splice junctions) end but differing number of exons on the 5' end.

Basically using "no_cap" will make TAMA Collapse behave in almost the same way as ECC Tofu Collapse. The only difference is that Tofu Collapse behaves in a more "stochastic" way. By that I mean that instead of adding each shorter model to all matching longer models, Tofu Collapse will only match the first instance it sees. This saves compute time but it means that you are not able to use all the information available to predict the final features for the final collapsed transcript models.

-e E Collapse exon ends flag: common_ends or longest_ends (default common_ends)

When collapsing, this option determines if the selected exon start/end site is defined by the most common feature or the longest feature. Longest feature means the feature which makes the exon longer.

-c C Coverage (default 99)

Coverage is defined by the percentage of the original read that is mapped to the genome. In some cases regions at the beginnning or end of the read will be clipped from the mapping representation. The amount of this clipping is what defines the coverage value.

-i I Identity (default 85)

Identity is defined as the number of bases of the read that match the bases of the genome as per the genomic mapping. Basically it is how close the read sequence is to the genomic sequence.

-icm ICM Identity calculation method (default ident_cov for including coverage) (alternate is ident_map for excluding hard and soft clipping)

There are two Identity calculation methods. The default (ident_cov) is to include the clipped sequence ends as mismatches. This means that if the coverage for a mapped read is 95%, then the identity cannot be higher than 95%. The second method (ident_map) is to only assess the non-clipped mapped region. This means that even with a coverage of 95%, the identity can still be 100% if the non-clipped mapped region has all base pairs matching the reference genome assembly. The second method (ident_map) is useful in cases where the adapter sequences were not trimmed (ie Nanopore reads without adapter trimming).

-a A 5 prime threshold (default 10)

The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.

-m M Exon/Splice junction threshold (default 10)

The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.

-z Z 3 prime threshold (default 10)

The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.

-d D Flag for merging duplicate transcript groups (default is merge_dup will merge duplicates ,no_merge quits when duplicates are found)

When collapsing transcripts, there can be events where different groups of transcript collapse into the same model. I call these duplicate transcript groups. By default TAMA Collapse will merge these groups during collapsing (merge_dup). Alternatively you can use "no_merge" so that when TAMA Collapse encounters these events it will exit the run prematurely with an error output. Note that prior to the introduction of LDE (2018/11/28) the default was no_merge. In summary: To merge these groups, use the merge_dup flag. To not merge these groups, use the no_merge flag.

-sj SJ Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority)

I call this option splice junction priority. Basically if you turn this on, TAMA Collapse will search for mismatches in the regions on each side of each splice junction. The length of the regions are defined by the SJT flag. TAMA Collapse will then rank the splice junction evidence from each transcript/read as either 0, 1, 2, or 3 with 0 being highest priroity. These rankings are defined as follows:

0 - No mismatches on either side of the splice junction

1 - One mismatch on the other side of the splice junction (if we are determining the priority of an exon start, a priority of 1 would mean there is a mismatch in the sequence within the region of the previous exon's end).

2 - One mismatch on the same side of the splice junction (if we are determining the priority of an exon start, a priority of 2 would mean there is a mismatch in the sequence within the region of this exon start).

3 - There are mismatches on both sides of the splice junction

TAMA Collapse will use these rankings to decide which splice junction (exon start and exon end) to use for the final model.

-sjt SJT Threshold for detecting errors near splice junctions (default is 10bp)

This is the length threshold for the regions to observe for splice junction priority. The default is 10bp which means TAMA collapse will look at the 10bp region upstream and downstream of the splice junction.

-lde LDE Threshold for amount of local density error near splice junctions that is allowed (default is 1000 errors which practically means no threshold is applied)

This is the Threshold for the number of errors that is allowed to occur on each side of the splice junction within the range specified by the -sjt argument. So for example, if you use the default SJT of 10bp and "-lde 5", then TAMA Collapse would remove all reads with at least one splice junction that has more than 5 errors withing the 10bp region downstream of the a splice junction or upstream of a splice junction. This feature is used to remove mapped reads where the splice junction mapping is likely to be erroneous due to a high local density of errors near the splice junction.

-ses SES Simple error symbol. Use this to pick the symbol used to represent matches in the simple error string for LDE output.

-b BAM Use sorted BAM input instead of SAM. To enable this use this parameter "-b BAM". Default is SAM input. If you use a BAM input you must have pysam installed. SAM input runs do not require pysam.

-log LOG Turns off long log output of collapsing process. (default on, use log_off to turn off)

This argument refers to the LDE output file. It allows you to change the symbol used to represent matches in the sj_error_simple field of the prefix_local_density_error.txt file.

-rm RM Run mode allows you to use original or low_mem mode, default is original

This argument is used to change the run mode. Using "original" will give you the original run algorithm. Using "low_mem" will use the new low memory run algorithm. The results will be the same except low_mem mode does not do variant calling and does not handle multi-mapped reads.

-vc VC Variation covwerage threshold: Default 5 reads

This argument is used to define the read coverage threshold for variant calling.

Outputs:

  prefix.bed
  prefix_read.txt
  prefix_polya.txt
  prefix_strand_check.txt
  prefix_trans_read.bed
  prefix_trans_report.txt
  prefix_varcov.txt
  prefix_variants.txt
  prefix_local_density_error.txt

Detail explanation:

prefix.bed

This is a bed12 format file containing the final collapsed version of your transcriptome.

prefix_read.txt

This file contains information for all mapped reads from the input SAM/BAM file. It shows both accepted and discarded reads and should match the number of mapped reads in your SAM/BAM file. Format as follows:

  read_id mapped_flag     accept_flag     percent_coverage        percent_identity        error_line<h;s;i;d;m>   length  cigar
  15_25_c195936/1/485     forward_strand  discarded       50.52   50.1    240;0;1;0;1     485     240H42M12896N65M1I137M

prefix_polya.txt

This file contains the reads with potential poly A truncation. The format is as follows:

  cluster_id      trans_id        strand  a_percent       a_count sequence
  15_25_c10327/1/1176     G60.1   -       85.0    17      TAAAAAAAATCAAAAAAAAA

a_percent is the percent of A's in the 20 nucleotide stretch after the 3' end in the genome. a_count is the number of A's in the 20 nucleotide stretch. The sequence shows the 20 nucleotide stretch. A threshold of 70% A's within the 20bp region is used for determining possible poly-A truncation sites.

prefix_strand_check.txt

This file shows instances where the sam flag strand information contrasted the GMAP strand information. Format as follows:

  read_id scaff_name      start_pos       cigar   strands
  15_25_c122605/5/833     1       81276   8S825M  +-

prefix_trans_read.bed

This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing. This only contains the reads which were accepted according to the defined thresholds. You can use this file to see if there were any strange occurrences during collapsing. It also contains the relationships between reads and collapsed transcript models. The 1st subfield in the 4th column shows the final transcript ID and the 2nd subfield in the 4th column shows the read ID. If you used no_cap mode for collapsing there may be multiple lines for a single read. This happens when a 5' degraded read can match to multiple 5' longer transcript models.

  1       2350    3116    G1.1;m54041_170609_082817/47972644/30_581_CCS   40      +       2350    3116    255,0,0 3       108,377,65      0,197,701

prefix_trans_report.txt

This file contains collapsing information for each transcript. Format as follows:

  transcript_id  num_clusters    high_coverage   low_coverage    high_quality_percent    low_quality_percent     start_wobble_list       end_wobble_list collapse_sj_start_err   collapse_sj_end_err     collapse_error_nuc
  G14.2          1               100.0           100.0           98.81                   98.81                   0,0,0,0,0               0,0,0,0,0       0,1,2,0,0               2,1,0,0,0               6.T.A>0;0>5M_1D;0>0;0>0

num_clusters - Number of pre-collapse support reads/FLNC/clusters

high_coverage - Highest coverage from pre-collapse support

low_coverage - Lowest coverage from pre-collapse support

high_quality_percent - Highest identity from pre-collapse support

low_quality_percent - Lowestidentity from pre-collapse support

start_wobble_list - Range of exon starts from reads supporting this exon start.

end_wobble_list - Range of exon ends from reads supporting this exon end.

collapse_sj_start_err - Splice junction priority for exon starts for highest priority support.

collapse_sj_end_err - Splice junction priority for exon ends for highest priority support.

collapse_error_nuc - Mapping mismatches occurring within threshold of splice junctions. The format for this line is as follows:

Splice junctions are delimited by ";".

Upstream and downstream of splice junction is delimited by ">".

Different mismatches from different reads for a given splice junction side are delimited with "-".

Mismatch types from the SAM CIGAR string are delimited by "_".

Single nucleotide mismatch (not indel) will display the position in relation to the splice junction and the read base and the genome base as such:

position.read base.genome base

Example:

6.T.A

Example of single splice junction error profile:

6.T.A>5M_1D

This means that 6 base pairs upstream of the splice junction there is a mismatch where T is the nucleotide found in the read for that position and A is the nucleotide found in the reference genome for that position.

For downstream of the splice junction, there are 5 matching nucleotides and then a 1 nucleotide deletion.

The symbols for this error representations are as such:

M - Matching nucleotides

I - Insertion

D- Deletion

S - Soft clipping

H - Hard clipping

This is a modified CIGAR format. For more information on CIGAR string format please see the SAM format guidelines:

https://samtools.github.io/hts-specs/SAMv1.pdf

prefix_varcov.txt

This file contains the coverage information for each variant detected. Format as follows:

  positions       overlap_clusters
  1_117276,1_117295       15_25_c205247/3/1197,1_14_c115955/1/1379,1_14_c126599/1/1419,1_14_c16580/3/1270,1_14_c80976/2/2080

prefix_variants.txt

This file contains the variants called. Variants are only called if 5 or more reads show the variant at a specific locus. If you would like to change the threshold, please make an issue about this in the Github repo. Format as follows:

  scaffold        position        type    ref_allele      alt_allele      count   cov_count       cluster_list
  chr1    185033  M       A       G       5       7       m64014_190506_005857/12584810/ccs,m64014_190506_005857/178390623/ccs,m64014_190506_005857/172231190/ccs,m64014_190506_005857/162989844/ccs,m64014_190506_005857/130744625/ccs

prefix_local_density_error.txt

This file contains the log of filtering for local density error around the splice junctions ("-lde") Format as follows:

  cluster_id      lde_flag        scaff_name      start_pos       end_pos strand  num_exons       bad_sj_num_line bad_sj_error_count_line sj_error_profile_idmsh  sj_error_nuc    sj_error_simple cigar
  m54053_170316_122436/64881153/30_577_CCS        lde_fail        Chr1    45296   45933   -       2       1       1>0     0,1,0,0,0>0,0,0,0,0     1D_2M>0 _______D__>__________   229M1D5M1I26M1D2M86N99M1D36M1I56M1D24M1D26M1D42M

lde_flag - Can be lde_pass (passed quality requirments) or lde_fail (failed threshold and removed from further analysis).

bad_sj_num_line - This is the splice junction number or numbers that did not pass the LDE threshold.

bad_sj_error_count_line - This shows the number of errors on each side of each splice junction using the same format as the collapse_error_nuc line in the prefix_trans_report.txt file.

sj_error_profile_idmsh - This shows the breakdown of type and number of errors on each side of each splice junction. Same formatting conventions of bad_sj_error_count_line except that there are 5 subfields delimited by "," representing insertion,deletions,mismatch,soft_clipping,hard_clipping errors.

sj_error_nuc - This shows a precise view of the errors surrounding each splice junction. Same format conventions as the collapse_error_nuc line in the prefix_trans_report.txt file.

sj_error_simple - This shows a simplified view of the errors on each side of each splice junction. The "-ses" argument determines what character is used to represent matches. The default character for "-ses" is "_".

cigar - This is the cigar string from the SAM file.

Tama Collapse - GenomeRIK/tama GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️