Home - GenomeRIK/tama GitHub Wiki

Welcome to the tama wiki!

This page serves as an introduction to the TAMA package.

Installation

To use the tools in this repository you should create a git clone on your system. To do this, go to the folder where you would like the git clone to be placed and enter this in the command line:

git clone https://github.com/GenomeRIK/tama

Make sure to supply the full path to the scripts when running or add the path to your bash profile.

You will need:

  1. Python 2
  2. Biopython module
  3. Pysam if you want to use BAM instead of SAM as the input to TAMA Collapse

See here for links to manuals for specific tools:

Tama Collapse

Tama Merge

Introduction

TAMA is a bioinformatics package intended to be used for constructing transcriptome/genome annotations. TAMA is ideal for working with Iso-Seq (long read RNA sequencing) data. However, due to its modular nature, it can be used for other data types as well.

The core philosophy behind TAMA is to provide a modular set of tools with full algorithm transparency. This is intended to make the package extremely flexible with respect to analysis goals and to give users more control over how the data is processed.

Iso-Seq Processing

Processing Iso-Seq data is comprised of three main goals:

  1. Error correction
  2. Mapping
  3. Removing redundancy

These goals are achieved through the various steps of Iso-Seq processing:

  1. Circular Consensus Sequence (CCS) calling
  2. Full Length Non-Chimeric (FLNC) read generation
  3. Cluster/Polish (not necessary)
  4. Minimap2/GMAP mapping to the reference genome assembly
  5. Collapse redundant transcript models
  6. Merge transcriptome annotations

You can see a visual representation of the pipeline below:

Iso-Seq Pipeline

The tools for getting CCS, FLNC, and Polished reads are available from the PacBio Iso-Seq github: Iso-Seq Github

Detailed explanations of Iso-Seq pipeline:

1. Circular Consensus Sequence (CCS) calling

This is the very first step in processing raw Iso-Seq data. Currently, there are no alternative software for performing this step. You must use the official Iso-Seq software. In this step, a "multiple" sequence alignment is performed on each read so that the subreads can be used to do error correction. The format for the resulting file has been changing so please check with the PacBio Iso-Seq Github Repo for more information.

2. FLNC

This step uses the fasta file from the first step to do several things: remove adapters, remove poly-A tails, and remove artificial concatemers. The end result is a fasta file with the transcript sequence represented from 5' to 3' with no adapters or poly-A tails. It is important to note that this step requires the adapter sequences used during library preparation. In the SMRT Portal (GUI version) software you currently cannot choose the adapter sequences and the official adapater sequences are used by default. However, in the command line version (Tofu), you can supply your own fasta file with adapter sequences. Thus if you used different adapter sequences (i.e. 5' cap selection kits), you will need to run the command line version.

3. Cluster/Polish

This step takes the FLNC and clusters the transcript sequences by similarity. It then makes a multiple alignment of each cluster and performs error correction using this alignment. This step is optional and if you are interested in allele specific expression/transcript phasing, you should not run this step as it can removed allele specific sequence variation.

4. Minimap2/GMAP

This step can take inputs from either the FLNC reads or the Polished reads. Minimap2/GMAP maps the transcript sequences onto a genome assembly. You can also use any other mapper/aligner that can deal with long read splice aware mapping. Minimap2 is faster than GMAP and has similar accuracy of mapping. They are from the same developer with Minimap2 being the newer tool.

5. TAMA Collapse

This step takes a bam/sam file from the transcript mapping and collapses redundant transcripts based on genomic location.

TAMA Collapse can handle all Iso-Seq libraries and gives the user much more control over how the collapsing is done. TAMA Collapse can also handle Nanopore transcript sequencing data.

TAMA Collapse also provides the following information:

  1. Source information for each predicted feature
  2. Variation calling
  3. Genomic poly-A detection
  4. Strand ambiguity

You can find out more about using TAMA Collapse here: TAMA Collapse

6. TAMA Merge

Merging is the process of combining multiple transcriptomes. For instance, if you have Iso-Seq data for different tissue types you might want to process them separately and then combine them at the end to use as a transcriptome for downastream analysis. However, the act of merging transcriptomes is non-trivial with respect to choosing what criteria to use to merge transcript models. You probably would also like to keep a record of which models from your merged transcriptome came from which source. TAMA Merge was developed to deal with this.

You can find more information here: TAMA Merge