TSSHub constructors’ handbook - Linlab-slu/TSSr GitHub Wiki
You can access this spreadsheet to check what species are available and what species you are interested in. If you have not access permission, please let us know.
Let start step by step:
# For example
# Download genome
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.fna.gz
# Download annotation in gtf format
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.gtf.gz
clone autoBSgenome first:
# If you haven't a conda env, create it first
conda create -n tsshub
conda activate tsshub
conda install -c conda-forge -c bioconda prompt_toolkit rich r-base bioconductor-bsgenome
git clone https://github.com/JohnnyChen1113/autoBSgenome.git
run autoBSgenome
python autoBSgenome.py
After successfully generating the BSgenome package, please fill in the absolute path of BSgenome packages in the spreadsheet in the BSgenome location
column.
You can download CAGE (CAGE-like) data with SRA accession via fastq-dl in the SRA accession
column
conda install -c bioconda fastq-dl
fastq-dl -a <SRA accession> --prefix <SRA accession> --cpus 8
Of course, you can also use other tools as you like to download the data. 😃
rename the fastq.gz file with the RENAME SAMPLE AS THIS COLUMN
column
mv <fastq file> <name in RENAME SAMPLE AS THIS COLUMN>
Because we will provide the files to TSSHub user, rename with a unified name can help them to have comprehensive information.
Using this script to get the best value for the option --alignIntronMax
:
You can get the max intron length with this code (Thanks Catherine Bailey! 🎊 ):
# Check and install 'GenomicFeatures' package
if (!require("GenomicFeatures", character.only = TRUE)) {
BiocManager::install("GenomicFeatures")
}
# Check and install 'txdbmaker' package
if (!require("txdbmaker", character.only = TRUE)) {
BiocManager::install("txdbmaker")
}
library(GenomicFeatures)
library(rtracklayer)
annotation <- makeTxDbFromGFF(file = "aluch_annotations.gff", format = "gff")
introns <- intronsByTranscript(annotation )
introns <- unlist(introns)
summary(width(introns))
quantile(width(introns), 0.95)
If the result less than 2000, you can just set the --alignIntronMax 2000
, if the max is an unreasonable large number (like max length is 10k bp 😅 ) you can set the length to the value of quantile 95%
Mapping CAGE data with STAR
Build STAR index
STAR --runMode genomeGenerate \
--runThreadN 8 \
--sjdbGTFfile <.gtf file> \
--genomeDir <genome index output folder> \
--genomeFastaFiles <genome file>
mapping
STAR --runMode alignReads --runThreadN 24 \
--outSAMtype BAM SortedByCoordinate
--outSAMorder Paired --outFileNamePrefix <output prefix> \
--genomeDir <STAR index folder> \
--readFilesIn <CAGE data> \
--readFilesCommand zcat \
--limitBAMsortRAM 1317991407 \
--alignIntronMax <max intron length> \
--alignEndsType Extend5pOfRead1
##########Load packages################
library(TSSr)
library(Rsamtools)
library(BSgenome.Spombe.NCBI.ASM294v231) # Here change to the BSgenome name you generated for your species
##########Change below lines################
genomeName <- "BSgenome.Spombe.NCBI.ASM294v231" # Here change to the BSgenome name you generated for your species
mergeIndex <- c(1,1)
organismName <- "Schizosaccharomyces pombe"
inputFiles <- c(
"SRR8360238_Aligned.sortedByCoord.out.bam", # set to your renamed bam files
"SRR8360239_Aligned.sortedByCoord.out.bam"
)
inputFilesType <- "bam"
sampleLabels <- c(
"SRR8360238" # set to your renamed file name
,"SRR8360239"
)
output_file <- "pombe.star.RData" # set according to your species
###########################
myTSSr <- new("TSSr", genomeName = genomeName
,inputFiles = inputFiles
,inputFilesType = inputFilesType
,sampleLabels = sampleLabels
,sampleLabelsMerged = sampleLabels
,mergeIndex = mergeIndex
,organismName = organismName)
getTSS(myTSSr)
exportTSStable(myTSSr, data = "raw", merged = "FALSE")
plotCorrelation(myTSSr, samples = "all")
normalizeTSS(myTSSr)
exportTSStoBedgraph(myTSSr, data = "processed", format = "bedGraph")
exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")
save(myTSSr, file = output_file)
The only difference between STRIPE-seq and CAGE data is the presence of 8 random primers and a linker sequence in paired-end reads 1.
- 8 random primers: The first 8 nucleotides of the read are completely random.
-
Linker sequence:
TATAGGG
.
To process STRIPE-seq data, the first step is to trim the random primers and linker sequence from the reads.
A script is available to handle this:
This script is a multithreaded FASTQ trimming tool designed to process single-end or paired-end sequencing data. It reads compressed (gzip) FASTQ files, scans for a specific sequence motif ("TATAGGG"), trims sequences starting after the motif, and ensures the trimmed sequences in paired-end files maintain synchronized lengths.
- Download the script:
wget -c https://github.com/JohnnyChen1113/tsstk/raw/refs/heads/main/fastq_trimmer
- Give the script execute permission:
chmod 777 ./fastq_trimmer
- Run the script with the following command: (This is a GO script, so check you have go installed or not first before you run it!)
./fastq_trimmer -inputR1 xxx_1.fastq.gz -inputR2 xxx_2.fastq.gz \
-outputR1 xxx_1.trimmed.fastq.gz -outputR2 xxx_2.trimmed.fastq.gz \
-threads 8
After trimming, proceed with the standard pipeline for CAGE data.