GRAND SLAM - erhard-lab/gedi GitHub Wiki

Globally refined analysis of newly transcribed RNA and decay rates using SLAM-seq (GRAND-SLAM) is a computational approach to infer the proportion and the corresponding posterior distribution of new and old RNA for each gene from SLAM-seq experiments (or similar approaches based on metabolic labeling and chemical nucleoside conversion,e.g. this or this ).

Overview

To use GRAND-SLAM, you must do the following

  1. Install Gedi (see here)
  2. Prepare a genome (once per genome)
  3. Run it (once per data set)

Prepare a genome

Assuming you are able to call the gedi script:

gedi -e IndexGenome -organism mus_musculus -version 90 -p -nobowtie -nostar -nokallisto

This will download the genomic sequence (fasta file) as well as the corresponding annotation (gtf file) from ensembl and do some steps to prepare it for usage in GRAND-SLAM. Its name for referencing it will be mus_musculus.90. For details about preparing genomes, see here

Run GRAND-SLAM

Assuming you are able to call the gedi script and have mapped reads from a SLAM-seq experiment (in 24h.cit, which you can download from here):

gedi -e Slam -genomic mus_musculus.90 -prefix test/24h -progress -plot -D -reads 24h.cit

Always make sure, that the given genome is the same as used to map reads.

Here, read mappings are in CIT format. You can also use a bam file or a bamlist file instead. For more documentation on how mapped reads can be specified, see here.

Important: CIT is generally much more efficient than several bam files. We strongly recommend to convert your bam files to CIT first! This can be done by using the bash script bamlist2cit (bamlist2cit [-p] xyz.bamlist) that is included in the GRAND-SLAM distribution.

For more options (e.g. discard 5' end of reads), type:

gedi -e Slam -hh

You can find more example data set and detailed instruction how they were processed here:

Output table

In ${prefix}.tsv you will find the main output table, with all gene-wise information extracted from the data. The two central parameters GRAND-SLAM provides are the read count and the new-to-total RNA ratio (NTR; we introduced this abbreviation in the scSLAM-seq Nature paper, it was called πg in the original GRAND-SLAM Bioinformatics paper). Both are based on reads that are compatible with some transcript of the gene (i.e. the read could have been generated from an mRNA). Specifically, this means that intronic reads are not considered (as they would bias the NTR).

Column Description
Gene The Ensembl gene id
Symbol The gene symbol
XXX Readcount The total number of reads mapped to this gene in condition XXX
XXX 0.05 quantile The lower bound of the posterior distribution for the NTR
XXX Mean The mean of the posterior distribution for the NTR
XXX MAP The mode of the posterior distribution for the NTR (this should usually be used as the NTR)
XXX 0.95 quantile The upper bound of the posterior distribution for the NTR
XXX alpha The alpha parameter of the beta approximation of the posterior distribution for the NTR
XXX beta The beta parameter of the beta approximation of the posterior distribution for the NTR
XXX Conversions * The total number of conversions in this gene
XXX Coverage * The total number of U covered by any reads (if all U were converted, the Conversions=Coverage)
XXX Double-Hits * The same as Conversions, but only counting in doubly sequenced bases (paired-end)
XXX Double-Hit Coverage * The same as Coverage, but only counting in doubly sequenced bases (paired-end)
XXX min2 * The number of reads with at least two characteristic mismatches
Length The length of the longest transcript for this gene
  • this is only available if gedi -e Slam has been called with the -full parameter!

Other output files

We recommend reading the section More explanation on GRAND-SLAM!

  1. ${prefix}.mismatches.pdf: The pages entitled Exonic first and Exonic second contains mismatch statistics for all first and second reads mapped to exons, respectively (whether they may be sense or antisense reads). The next pages ("ExonicSense First"/"ExonicSense Second"/"ExonicAntisense First"/"ExonicAntisense Second") contain the same statistics distinguishing sense and antisense reads. The next few pages show the same info for intronic reads. The remaining pages show the same information but now ordered by mismatch type. Depending on the library preparation protocol / sequencing mode used, some pages may either be omitted or show no data.
  2. ${prefix}.double.pdf: The same info but here only for doubly sequenced base pairs. The mismatch types w.r.t to the sense strand sequence are shown.
  3. ${prefix}.mismatchpos.pdf: Here the mismatches are shown for each position in a read or read pair. For 2x 100bp read pairs, the x axis spans 0 through 200, the first 100 correspond to the first read, 100-200 to the second read (where 200 is the 5' end). The bases indicated on top of the plots is the genomic nucleotide, the color encodes the sequenced nucleotide. Opposite=0 means sense, opposite=1 means antisense. Overlap=0 means outside of any doubly sequenced part, overlap=1 means within the doubly sequenced part. This file in particular is useful to identify artifacts at the beginning or end of the reads (to be excluded from analysis by the -trim5p and -trim3p parameters).
  4. ${prefix}.mismatchposzoomed.pdf: The same as ${prefix}.mismatchpos.pdf but the y axis is scaled to more relevant values.
  5. ${prefix}.binomRates.png: The probabilities of observing T-C mismatches in old and new RNA for each sample. This is the direct result of the EM algorithm for estimating pc described in the paper. The status shown indicates whether the rates could be reliably estimated.
  6. ${prefix}.binomNewProportion.png: The estimated total percentage of new RNA in each sample. Note that this is fundamentally different from the data shown in binomRates.png: The binomRates only depend on the efficiency of 4sU incorporation and conversion (i.e. it is a technical parameter). The new proportion has nothing to do with 4sU incorporation but only indicates how much of the RNA is new (i.e. it is a biological parameter).
  7. ${prefix}.single_new.png and ${prefix}.single_old.png: The probabilities of T-C mismatches in new and old RNA, respectively (is the same as .binomRates.png if not overridden by command line parameter).
  8. ${prefix}.double_new.png and ${prefix}.single_old.png: The probabilities of doubly sequenced T-C mismatches in new and old RNA, respectively.

Important issues

How to map reads

The most important thing for the mapped reads is that the MD attribute is reported (e.g. for STAR: use --outSAMattributes MD), as this is how mismatches in read mapping are encoded in SAM. We also highly recommend to include the NH attribute --outSAMattributes MD NH, which tells GRAND-SLAM about the number of mapping positions for each read. In addition, we prefer reads mapped without softclipping (e.g. for STAR: --alignEndsType EndToEnd) and using the -trim5p and -trim3p parameters to generally remove artifactual patterns of mismatches at the ends of the reads.

Read trimming

Especially if reads have been mapped in end to end mode, there may be artifactual patterns of mismatches at the very 5' end. It is wise to inspect the .mismatchpos.pdf files generated by GRAND-SLAM, and to use the -trim5p parameter to remove the first bases, if necessary!

SNP calling

As T>C SNPs would influence on the estimated new/total ratios, they are first filtered out. Thus, it makes perfect sense to run GRAND-SLAM on all samples from the experiment (on the same cells). This can be done by either putting them into a CIT file, or by creating a bamlist file (see here] for more information).

Regression model for the T>C rate in old RNA

Make sure that you either have samples without 4sU to let GRAND-SLAM estimate the regression model from these data, or to have the regression model for you protocol/sequencer. For the former, make sure that the names of the conditions (i.e. given in the metadata of the CIT file, or by the name of the BAM file) include "no4sU" (or, alternatively, use the -no4sUpattern parameter to tell GRAND-SLAM how to identify them). For the latter, specify it via the -errlm parameter.

More explanations on GRAND-SLAM

Reads

For experiments based on metabolic labeling and chemical nucleoside conversion, several considerations about the observed reads are important:

  1. A (single-end) read can be sense or antisense. It is sense, if it has the same orientation as the mRNA, and it is antisense, if it is the reverse-complement of the mRNA sequence. A sequencing protocol may preserve strand specificity and produce (more or less) only sense reads (e.g. Quant-seq) or antisense reads (current TruSeq protocols). It may also lose strand-specificity (e.g. SMART-seq), producing both sense and antisense reads.
  2. In paired-end sequencing, there is a first and a second read. The first read is the one where the sequencing reaction is initiated at the Read1 primer, the second read the one initiated at the Read2 primer. In general, one read is sense, the other one is antisense. Here, we define a read pair to be sense, if the first read is sense, and antisense if the first read is antisense.
  3. Libraries from any protocol can be sequencing either in single-end mode, or in paired-end mode. Thus, there are four kinds of reads produced by a strand-unspecific protocol in paired-end mode: a) First read, sense; b) Second read, antisense; c) First read, antisense; d) Second read, sense. Normally, a and b occur in a read pair, and c and d occur in a read pair.
  4. If the insert (the part of the DNA library corresponding to RNA fragments) is shorter than 2x the read length, there are doubly sequenced base pairs. As sequencing errors are random events with probabilities in the order of 10-4, the probability of observing the corresponding mismatch twice due to sequencing errors at the same position is negligibly small. This helps GRAND-SLAM to more precisely distinguish old and new RNA (but be aware that sequencing errors and 4sU conversions are not the only cause of mismatches!).

Mismatches and "new" RNA

After mapping reads to the genome (or transcriptome), mismatches can be observed: For instance, if a read containing a C at some position maps to a location on the genome, where there is a T instead, we speak of a T-C mismatch (this could correspond to a 4sU conversion, or a sequencing error, or an error during reverse transcription or PCR and a few other causes).

The goal of experiments based on metabolic labeling and chemical nucleoside conversion is to distinguish RNA that has been synthesized during the time of labeling (new RNA) or before that (old RNA). We also speak of new reads (reads originating from new RNA) and old reads (reads from old RNA). We advice against calling it labeled and unlabeled, because, technically, an RNA (or read) can be new without being labeled (if there are only few T in the corresponding genomic locus, and even if 4sU was present, no incorporation took place).

Labeling efficiency and substitutions rates

The first thing you may want to know about your SLAM-seq (Timelapse-Seq, TUC-seq,...) experiments is the labeling efficiency: How much RNA has been labeled? This is what could be determined with sufficiently long reads without sequencing errors (PCR errors, transcription errors, RNA editing, ...): If there is no other cause for T-C mismatches, and each fragment of new RNA that is sequenced indeed contained a 4sU, one could easily determine for each read, whether is is from new or old RNA, and, by counting and dividing them, the labeling efficiency.

However, due to short reads, and non-negligible other sources for mismatches, this is not possible. Indeed, the labeling efficiency is in fact not even the most interesting parameter of the experiment, as it depends on two things: The percentage of new RNA in the sample and the rate of 4sU incorporation and conversion (refered to as substitution rate). These two parameters constitute the most fundamental principle that GRAND-SLAM is based on: The probability referred to as pc in the paper basically is the substitution rate (to be more precise, it is the probability of observing a T-C mismatch in a new read, and therefore also includes sequencing errors etc.), and the new-to-total ratio (πg) is the percentage of new RNA for a gene g.

Imagine two experiments, both with labeling for 2 hours in the same cell line. In experiment A a low dosage of 4sU was used (e.g. 50μM), in experiment B a high dosage (e.g. 500μM). For both, the amount of new RNA for each gene is the same, but the substitutions rate will be significantly lower in experiment A. Conversely, if the same 4sU dosage was used in both experiments, but A was done on slowly dividing cells, and B on quickly dividing cells, the subsitution rate will be the same (assuming a similar 4sU uptake in both cell types), but B will have much more new RNA. In both cases, the labeling efficiency will be higher in B than in A, but for very different reasons.

Troubleshooting

If you report any errors, please always use the -D parameter to run GRAND-SLAM (gedi -e Slam -D ...). This provides a much more verbose error message!

I am getting SEVERE Could not load startup FxStartup.class error messages in the log

For GRAND-SLAM you can savely ignore these. The reason for getting these is that you don't have the JavaFX (or openfx) package installed on your system. GRAND-SLAM does not use it.

Genes are missing in output / no genes in output file

By default, GRAND-SLAM will only process genes that have biotype protein_coding. Use the parameter -allGenes to circumvent that!

⚠️ **GitHub.com Fallback** ⚠️