TSSpredator and it´s prediction - Integrative-Transcriptomics/tss-prediction-comparison GitHub Wiki

What is the TSSpredator

The TSSpredator tool is an advanced method designed to predict transcription start sites (TSS) using high-resolution transcriptome data, particularly from differential RNA sequencing (dRNA-seq). The tool is particularly useful for comparative transcriptomics among multiple strains or conditions. Although initially designed for prokaryotes, the method can be adapted to analyze eukaryotic transcriptomes.

Input for TSSpredator

  • Genome FASTA Datei -> Enthält die genomische Sequenz des untersuchten Organismus.

    • Format: FASTA
  • Genom-Annotationsdatei -> Enthält Annotationen der Gene im Genom.

    • Format: GFF oder GTF
  • RNA-seq-Datei (Wiggle Dateien) -> Enthalten die RNA-seq Daten, die die Höhe des Transkripts entlang des Genoms darstellen.

    • Format: WIG oder GR
  • Multiple-Alignment-Datei (optional) -> Enthält das multiple Alignment der Genome, wenn mehrere Stämme/Arten verglichen werden.

    • Format: XMFA (progressive Mauve)

How does TSSpredator predict TSS positions?

The first step in TSS detection involves examining the RNA-seq expression graphs to locate positions where a significant number of reads begin. For each position (i) in the expression graph, the algorithm calculates the difference in expression between position (i) and position (i-1) $$e(i)-e(i-1).$$ Additionally, the fold change in expression $$e(i)/e(i-1)$$ is computed. These calculations help identify abrupt increases in read density, which are indicative of potential TSS.

To distinguish primary transcripts from processed RNA, an enrichment factor is calculated by comparing the expression in treated and untreated samples $$e_{\text{treated}}(i) / e_{\text{untreated}}(i).$$ Positions with values exceeding predefined thresholds for these metrics are marked as TSS candidates. If a TSS candidate is detected in at least one condition, the thresholds are relaxed for other conditions to increase sensitivity.

Once potential TSS are identified, TSSpredator applies several filtering and clustering steps. TSS candidates not found in both replicates of a condition within one nucleotide position are discarded. Remaining candidates are clustered if they are in close proximity, keeping only the candidate with the highest expression within each cluster.

The identified TSS are then mapped to a "SuperGenome," a composite coordinate system created from a whole-genome alignment of different strains. This mapping facilitates the comparison of TSS across different strains or conditions. Each TSS is characterized based on its genomic context, such as whether it is a primary or secondary TSS for a gene, internal, antisense, or an orphan TSS (For explanation look at the "Result" paragraph in this wiki entry.)

The final output includes a comprehensive annotation of TSS, detailing their classification and enrichment across different conditions.

Output of TSSpredator

Master Table (MasterTable.tsv)

This table contains information on positions and class assignments of all automatically annotated TSS. The table consists of the following columns:

  • SuperPos: The position of the TSS in the SuperGenome.

  • SuperStrand: The strand of the TSS in the SuperGenome.

  • MapCount: Number of strains into which the TSS can be mapped. Separate entry lines exist for each strain to which the TSS can be mapped whether the TSS was detected in that strain or not.

  • detCount: The number of strains/conditions in which this TSS was detected in the RNA-seq data.

  • Condition: The identifier of the strain/condition to which the rest of the line relates.

  • detected: Contains a ‘1’ if the TSS was detected in this strain/condition.

  • enriched: Contains a ‘1’ if the TSS is enriched in this strain/condition.

  • stepHeight: The expression height change at the position of the TSS. This relates to the number of reads starting at this position. (e(i) - e(i-1); e(i): expression height at position i)

  • stepFactor: The factor of height change at the position of the TSS. (e(i)/e(i-1); e(i): expression height at position i)

  • enrichmentFactor: The enrichment factor at the position of the TSS.

  • classCount: The number of classes to which this TSS was assigned.

  • Pos: Position of the TSS in that genome.

  • Strand: Strand of the TSS in that genome.

  • Locus tag: The locus tag of the gene to which the classification relates.

  • Product: The product description of this gene.

  • UTRlength: The length of the untranslated region between the TSS and the respective gene (nt). (Only applies to ‘primary’ and ‘secondary’ TSS.)

  • GeneLength: The length of the gene (nt).

  • Primary Contains: a ‘1’ if the TSS was classified as ‘primary’ with respect to the gene stated in ‘locusTag’.

  • Secondary Contains: a ‘1’ if the TSS was classified as ‘secondary’ with respect to the gene stated in ‘locusTag’.

  • Internal Contains: a ‘1’ if the TSS was classified as ‘internal’ with respect to the gene stated in ‘locusTag’.

  • Antisense: Contains a ‘1’ if the TSS was classified as ‘antisense’ with respect to the gene stated in ‘locusTag’.

  • Automated: Contains a ‘1’ if the TSS was detected automatically.

  • Manual: Contains a ‘1’ if the TSS was annotated manually.

  • Putative sRNA: Contains a ‘1’ if the TSS might be related to a novel sRNA. (Not evaluated automatically)

  • Putative asRNA: Contains a ‘1’ if the TSS might be related to an asRNA.

  • Sequence -50 nt upstream + TSS (51nt): Contains the base of the TSS and the 50 nucleotides upstream of the TSS.

Supplemental Files

  • strain_super.fa: Contains the genome sequence of each strain mapped to the coordinate system of the SuperGenome. All 4 files together actually contain the whole-genome alignment. These files can be used in genome browsers that allow the user to load several sequences simultaneously.

  • strain_super.gff:  Contains the gene annotations of each strain mapped to the coordinate system of the SuperGenome.

  • strain_superTypeStrand.gr: Contains the xy-graphs of each strain mapped to the coordinate system of the SuperGenome. Type is either ‘FivePrime’ (treated) or ‘Normal’ (untreated). Strand is either ‘Plus’ or ‘Minus’. Note that the files now contain the value 0.0001 instead of 0 as a value of 0 (i.e. no entry line) now indicates a gap. This is necessary for IGB’s thresholding feature (see below).

  • superTSS.gff:  Contains all TSS predicted in the four strains in the coordinate system of the SuperGenome. Also all TSS that were only predicted in one strain are listed. The information in how many strains (and in which) a TSS was detected is given in superClasses.tsv. In the header line all parameter names and values which are used for the run are reported.

  • TSSstatistics.tsv: Contains some general statistics about the TSS prediction results.

Result of TSSpredator

In the Mastertable.tsv one can find the position of the TSS in the genome (SuperPos / Pos), the strand of the TSS in the genome (SuperStrand / Strand) and one of the following five classifications of the TSS (In the column with "1"):

Primary TSS (pTSS):

  • Description: This is the main promoter of a gene. The pTSS is the primary starting site where transcription initiates.
  • Function: The pTSS ensures the initiation of transcription of the majority of mRNA for a specific gene. It is typically the most highly expressed start site and provides the major portion of mRNA for the respective gene.
  • Location: pTSS is usually located immediately upstream of the coding region of a gene.

Secondary TSS (sTSS):

  • Description: A secondary promoter also used for initiating transcription of the same gene, but less frequently than the pTSS.
  • Function: sTSS may play a role in fine-tuning gene expression by providing alternative transcription start sites utilized under specific conditions or in different cell types.
  • Location: sTSS is located upstream of the gene's pTSS but closer to the pTSS than to other intergenic regions.

Internal TSS (iTSS):

  • Description: A promoter located within a gene that initiates transcription from internal gene segments.
  • Function: iTSS can lead to the production of shorter mRNA variants encoding only specific sections of the gene. These shorter transcripts may have different functional properties.
  • Location: iTSS is situated within the coding region of a gene.

Antisense TSS (asTSS):

  • Description: A promoter located antisense (opposite) to a gene, often within ±100 nucleotides of a gene's coding sequence.
  • Function: asTSS results in the production of antisense RNA complementary to a gene's mRNA. This antisense RNA can regulate gene expression, often through mechanisms such as RNA interference or by affecting mRNA stability.
  • Location: asTSS is found on the opposite strand of DNA compared to the coding strand of the gene.

Orphan TSS:

  • Description: A TSS not associated with any known annotation or gene.
  • Function: The function of orphan TSS is often unclear, but they could represent new, undiscovered genes or regulatory RNA molecules.
  • Location: Orphan TSS are located in intergenic regions and are not directly associated with known genes.