TSSHub constructors’ handbook - Linlab-slu/TSSr GitHub Wiki
You can access this spreadsheet to check what species are available and what species you are interested in. If you have not access permission, please let us know.
Let start step by step:
# For example
# Download genome
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.fna.gz
# Download annotation in gtf format
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/708/835/GCA_000708835.1_ASM70883v1/GCA_000708835.1_ASM70883v1_genomic.gtf.gz
clone autoBSgenome first:
# If you haven't a conda env, create it first
conda create -n tsshub
conda activate tsshub
conda install -c conda-forge -c bioconda prompt_toolkit rich r-base bioconductor-bsgenome
git clone https://github.com/JohnnyChen1113/autoBSgenome.git
run autoBSgenome
python autoBSgenome.py
After successfully generating the BSgenome package, please fill in the absolute path of BSgenome packages in the spreadsheet in the BSgenome location
column.
You can download CAGE (CAGE-like) data with SRA accession via fastq-dl in the SRA accession
column
conda install -c bioconda fastq-dl
fastq-dl -a <SRA accession> --prefix <SRA accession> --cpus 8
Of course, you can also use other tools as you like to download the data. 😃
rename the fastq.gz file with the RENAME SAMPLE AS THIS COLUMN
column
mv <fastq file> <name in RENAME SAMPLE AS THIS COLUMN>
Because we will provide the files to TSSHub user, rename with a unified name can help them to have comprehensive information.
Using this script to get the best value for the option --alignIntronMax
:
You can get the max intron length with this code (Thanks Catherine Bailey! 🎊 ):
# Check and install 'GenomicFeatures' package
if (!require("GenomicFeatures", character.only = TRUE)) {
BiocManager::install("GenomicFeatures")
}
# Check and install 'txdbmaker' package
if (!require("txdbmaker", character.only = TRUE)) {
BiocManager::install("txdbmaker")
}
library(GenomicFeatures)
library(rtracklayer)
annotation <- makeTxDbFromGFF(file = "aluch_annotations.gff", format = "gff")
introns <- intronsByTranscript(annotation )
introns <- unlist(introns)
summary(width(introns))
quantile(width(introns), 0.95)
If the result less than 2000, you can just set the --alignIntronMax 2000
, if the max is an unreasonable large number (like max length is 10k bp 😅 ) you can set the length to the value of quantile 95%
Mapping CAGE data with STAR
Build STAR index
STAR --runMode genomeGenerate \
--runThreadN 8 \
--sjdbGTFfile <.gtf file> \
--genomeDir <genome index output folder> \
--genomeFastaFiles <genome file>
mapping
STAR --runMode alignReads --runThreadN 24 \
--outSAMtype BAM SortedByCoordinate
--outSAMorder Paired --outFileNamePrefix <output prefix> \
--genomeDir <STAR index folder> \
--readFilesIn <CAGE data> \
--readFilesCommand zcat \
--limitBAMsortRAM 1317991407 \
--alignIntronMax <max intron length> \
--alignEndsType Extend5pOfRead1
##########Load packages################
library(TSSr)
library(Rsamtools)
library(BSgenome.Spombe.NCBI.ASM294v231) # Here change to the BSgenome name you generated for your species
##########Change below lines################
genomeName <- "BSgenome.Spombe.NCBI.ASM294v231" # Here change to the BSgenome name you generated for your species
mergeIndex <- c(1,1)
organismName <- "Schizosaccharomyces pombe"
inputFiles <- c(
"SRR8360238_Aligned.sortedByCoord.out.bam", # set to your renamed bam files
"SRR8360239_Aligned.sortedByCoord.out.bam"
)
inputFilesType <- "bam"
sampleLabels <- c(
"SRR8360238" # set to your renamed file name
,"SRR8360239"
)
output_file <- "pombe.star.RData" # set according to your species
###########################
myTSSr <- new("TSSr", genomeName = genomeName
,inputFiles = inputFiles
,inputFilesType = inputFilesType
,sampleLabels = sampleLabels
,sampleLabelsMerged = sampleLabels
,mergeIndex = mergeIndex
,organismName = organismName)
getTSS(myTSSr)
exportTSStable(myTSSr, data = "raw", merged = "FALSE")
plotCorrelation(myTSSr, samples = "all")
normalizeTSS(myTSSr)
exportTSStoBedgraph(myTSSr, data = "processed", format = "bedGraph")
exportTSStoBedgraph(myTSSr, data = "processed", format = "BigWig")
save(myTSSr, file = output_file)