TSSHub constructors’ handbook - Linlab-slu/TSSr GitHub Wiki

You can access this spreadsheet to check what species are available and what species you are interested in. If you have not access permission, please let us know.

Let start step by step:

1. Download the `genome file` and `gtf` according to `Genome URL` column with wget

# For example
# Download genome
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.fna.gz
# Download annotation in gtf format
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.gtf.gz

2. Build BSgenome with the genome you are using:

clone autoBSgenome first:

# If you haven't a conda env, create it first
conda create -n tsshub 
conda activate tsshub
conda install -c conda-forge -c bioconda prompt_toolkit rich r-base bioconductor-bsgenome
git clone https://github.com/JohnnyChen1113/autoBSgenome.git

run autoBSgenome

python autoBSgenome.py

After successfully generating the BSgenome package, please fill in the absolute path of BSgenome packages in the spreadsheet in the BSgenome location column.

3. Download raw data

You can download CAGE (CAGE-like) data with SRA accession via fastq-dl in the SRA accession column

conda install -c bioconda fastq-dl

fastq-dl -a <SRA accession> --prefix <SRA accession> --cpus 8

Of course, you can also use other tools as you like to download the data. 😃

4. Rename the file

rename the fastq.gz file with the RENAME SAMPLE AS THIS COLUMN column

mv <fastq file> <name in RENAME SAMPLE AS THIS COLUMN>

Because we will provide the files to TSSHub user, rename with a unified name can help them to have comprehensive information.

5. Intron length statistics

Using this script to get the best value for the option --alignIntronMax: You can get the max intron length with this code (Thanks Catherine Bailey! 🎊 ):

# Check and install 'GenomicFeatures' package
if (!require("GenomicFeatures", character.only = TRUE)) {
    BiocManager::install("GenomicFeatures")
}
# Check and install 'txdbmaker' package
if (!require("txdbmaker", character.only = TRUE)) {
    BiocManager::install("txdbmaker")
}

library(GenomicFeatures)
library(rtracklayer)
annotation <- makeTxDbFromGFF(file = "aluch_annotations.gff", format = "gff")
introns <- intronsByTranscript(annotation )
introns <- unlist(introns)
summary(width(introns))
quantile(width(introns), 0.95)

If the result less than 2000, you can just set the --alignIntronMax 2000, if the max is an unreasonable large number (like max length is 10k bp 😅 ) you can set the length to the value of quantile 95%

6. Mapping

Mapping CAGE data with STAR

Build STAR index

STAR --runMode genomeGenerate \
--runThreadN 8 \
--sjdbGTFfile <.gtf file> \
--genomeDir <genome index output folder> \
--genomeFastaFiles <genome file>

mapping

STAR --runMode alignReads --runThreadN 24 \
--outSAMtype BAM SortedByCoordinate
--outSAMorder Paired --outFileNamePrefix <output prefix> \
--genomeDir <STAR index folder> \
--readFilesIn <CAGE data> \
--readFilesCommand zcat \
--limitBAMsortRAM 1317991407 \
--alignIntronMax <max intron length> \
--alignEndsType Extend5pOfRead1

7. Call TSS and generate `.bw` file with TSSr

##########Load packages################
library(TSSr)
library(Rsamtools)
library(BSgenome.Spombe.NCBI.ASM294v231) # Here change to the BSgenome name you generated for your species
##########Change below lines################
genomeName <- "BSgenome.Spombe.NCBI.ASM294v231" # Here change to the BSgenome name you generated for your species
mergeIndex <- c(1,1)
organismName <- "Schizosaccharomyces pombe"
inputFiles <- c(
"SRR8360238_Aligned.sortedByCoord.out.bam",   # set to your renamed bam files
"SRR8360239_Aligned.sortedByCoord.out.bam"
)
inputFilesType <- "bam"
sampleLabels <- c(
                  "SRR8360238"   # set to your renamed file name
                 ,"SRR8360239"
                 )
output_file <- "pombe.star.RData"  # set according to your species
###########################
myTSSr <- new("TSSr", genomeName = genomeName
            ,inputFiles = inputFiles
            ,inputFilesType = inputFilesType
            ,sampleLabels = sampleLabels
            ,sampleLabelsMerged = sampleLabels
            ,mergeIndex = mergeIndex
            ,organismName = organismName)
getTSS(myTSSr)
exportTSStable(myTSSr, data = "raw", merged = "FALSE")
plotCorrelation(myTSSr, samples = "all")

normalizeTSS(myTSSr)

exportTSStoBedgraph(myTSSr, data = "processed", format = "bedGraph") 
exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")

save(myTSSr, file = output_file)

How to Deal with STRIPE-seq Data

The only difference between STRIPE-seq and CAGE data is the presence of 8 random primers and a linker sequence in paired-end reads 1.

8 random primers: The first 8 nucleotides of the read are completely random.
Linker sequence: TATAGGG.

Reads Trimming

To process STRIPE-seq data, the first step is to trim the random primers and linker sequence from the reads.

A script is available to handle this:

This script is a multithreaded FASTQ trimming tool designed to process single-end or paired-end sequencing data. It reads compressed (gzip) FASTQ files, scans for a specific sequence motif ("TATAGGG"), trims sequences starting after the motif, and ensures the trimmed sequences in paired-end files maintain synchronized lengths.

Download the script:

   wget -c https://github.com/JohnnyChen1113/tsstk/raw/refs/heads/main/fastq_trimmer

Give the script execute permission:

chmod 777 ./fastq_trimmer

Run the script with the following command: (This is a GO script, so check you have go installed or not first before you run it!)

./fastq_trimmer -inputR1 xxx_1.fastq.gz -inputR2 xxx_2.fastq.gz \
-outputR1 xxx_1.trimmed.fastq.gz -outputR2 xxx_2.trimmed.fastq.gz \
-threads 8

After trimming, proceed with the standard pipeline for CAGE data.