Tama Merge - GenomeRIK/tama GitHub Wiki

TAMA Merge

TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.

Detailed explanation of TAMA Merge:

Are you interested in:

1. Combining your Iso-Seq data from different tissue types/library preps into a single transcriptome.

2. Comparing your Iso-Seq data to the reference annotation (or short read RNAseq annotation).

3. Combining your Iso-Seq data with a short read RNAseq annotation and with the reference annotation.

4. Doing any of the above while still maintaining source information.

5. Doing any of the above with the power to define merging parameters.

6. Comparing pipelines (use TAMA Merge on annotations made from the same dataset but using different pipelines).

If so, TAMA Merge is probably what you are looking for.

TAMA Merge takes as input multiple transcriptomes in bed12 format. It then compares the transcript models from each transcriptome and merges models based on the similarity of features (transcription start/end sites and exon start/end sites). The ouput is a merged transcriptome in bed12 format along with other files containing source information.

Note that the input bed12 files must have the gene ID's and transcript ID's formatted as such "gene_id;transcript_id" in the 4th column. The gene ID must be the first subfield and the subfields must be delimited with a semicolon (;).

You can define the threshold for transcription start/end sites (TSS/TES) and exon start/end sites (ESS/EES). You can also give priority to features from specific transcriptomes. For instance, you may want to give priority to Iso-Seq data for transcription start/end sites and priority to your short read RNAseq transcriptome for splice junctions. This means that when you are merging models between these two transcriptomes the final merged model will use the TSS/TES from the Iso-Seq data and the ESS/EES from the short read RNAseq data. The source for each feature prediction is included in the output files so you can see exactly what happened with each merging event.

Manual

usage: tama_merge.py [-h] [-f] [-p] [-e] [-a] [-j] [-z]

This script merges transcriptomes.

optional arguments:

  -h, --help  show this help message and exit
  -f F        File list
  -p P        Output prefix
  -e E        Collapse exon ends flag: common_ends or longest_ends (Default is common_ends)
  -a A        5 prime threshold (Default is 10)
  -m M        Exon ends threshold/ splice junction threshold (Default is 10)
  -z Z        3 prime threshold (Default is 10)
  -d D        Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)
  -s S        Use gene and transcript ID from a merge source. Specify source name from filelist file here.
  -cds CDS    Use CDS from a merge source. Specify source name from filelist file here.

Default command would look like this:

python tama_merge.py -f filelist.txt -p merged_annos

NOTE: If you do not see "TAMA Merge has completed successfully!" as the last line of the terminal output, then TAMA Merge has not completed and there are likely issues with the input files. The last lines should show what error occurred. For help interpreting the error, please log and issue on the issues page of this github repository.

Detailed explanation of arguments:

-f filelist.txt

The filelist file contains the name of the files you want to merge as well as some additional information. The format for the file should be like this (tab separated, do not include header):

  file_name    cap_flag    merge_priority(start,junctions,end)    source_name
  annotation_capped.bed        capped  1,1,1   caplib
  annotation_nocap.bed        no_cap  2,1,1   nocaplib

"cap_flag" can be one of two options "capped" or "no_cap". This represents whether the transcriptome start sites should be trusted or if transcripts should be merged into longer matching transcripts. If "no_cap" is selected for a dataset, the start priority will be placed at last regardless of what is set in the filelist file.

"merge_priority" designates the rank of the information from each source with respect to start site, splice junctions, and end sites. "1" is the highest rank. So in the example above the "capped" transcriptome will have a start site priority over the "no_cap" transcriptome.

"source_name" is used for the source information files to show where each prediction comes from. It will be added as a prefix onto the gene and transcript names when showing the mapping between before merge IDs and after merge IDs. The source names must therefore be unique. Also do not use underscores in the source names as TAMA uses underscores as name delimiters.

-p P Output prefix

The output prefix is the prefix that will be sued to name the output files.

-e E Collapse exon ends flag: common_ends or longest_ends

The collapse exon ends flag is used to determine whether an exon end feature should be chosen based on how common it is (common_ends) or if it makes the longest exon (longest_ends). Default is common_ends. Common ends means the most abundant exon end as voted by all reads being collapsed into a single transcript model.

-a A 5 prime threshold

The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.

-m M Exon ends threshold/ pslice junction threshold

The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.

-z Z 3 prime threshold

The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.

-d D Flag for merging duplicate transcript groups

Either no_merge (default) or merge_dup. This gives you the choice to merge duplicate groups where different transcripts in different groups happen to collapse to the same model. If no_merge is used and there is a duplicate, the program will exit early and not complete the run. You can also adjust the thresholds (increase allowances) to avoid duplicates.

-s S Use gene and transcript ID from a merge source. Specify source name from filelist file here

Use this parameter if you want to carry over the gene and transcript ID's from one of the merge sources into the merged annotation. For instance, if you are merging an Iso-Seq annotation with a public annotation, you can pull the gene and transcript ID's from the public annotation and have them represented in the merged annotation. The source ID's will be extensions in the ID field (ie G1;G1.1;ENSG00000139618;ENST00000380152). Example usage is "-s ensembl" for multiple use commas to separate "-s ensembl,refseq".

-cds CDS Use CDS from a merge source. Specify source name from filelist file here.

Use this parameter if you want to carry over the CDS predictions from one or more of the merge sources into the merged annotation. Example usage is "-cds ensembl" for multiple use commas to separate "-cds ensembl,refseq".

Outputs:

  prefix.bed
  prefix_gene_report.txt
  prefix_merge.txt
  prefix_trans_report.txt

Detailed explanation:

prefix.bed

This is the main merged annotation file. Transcripts are coloured according to the source support for each model. Sources are numbered based on the order supplied in the input filelist file. For example the first file named in the filelist file would have its transcripts coloured in red. If a transcript has multiple sources the colour is shown as magenta.

  1 = red
  2 = orange
  3 = yellow
  4 = lime
  5 = light turquoise 
  6 = light blue
  7 = royal blue
  8 = dark blue
  9 = dark purple
  Multiple Sources = magenta

prefix_gene_report.txt

This contains a report of the genes from the merged file. "num_clusters" refers to the number of source transcripts that were used to make this gene model. "num_final_trans" refers to the number of transcripts in the final gene model. The format is as follows:

  gene_id num_clusters    num_final_trans sources chrom   start   end
  G1      2       2       tissue1,tissue2        1       225     3214

prefix_merge.txt

This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID. I used the "txt" extension even though it is a bed file just to avoid confusion with the main bed file. You can use this file to map the final merged transcript models to their pre-merged supporting transcripts. The 1st subfield in the 4th column shows the final merged transcript ID while the 2nd subfield shows the pre-merged transcript ID with source prefix.

  1       219     3261    G1.2;spleen_G1.1        40      +       219     3261    255,0,0 5       98,93,181,107,714       0,1457,1757,2132,2328

prefix_trans_report.txt

This contains the source information for each merged transcript. The format is as follows:

  transcript_id   num_clusters    sources start_wobble_list       end_wobble_list exon_start_support      exon_end_support
  G2.1    1       newnormbrain    0       0       newnormbrain_G2.1       newnormbrain_G2.1
⚠️ **GitHub.com Fallback** ⚠️