Calculating the RNA expression for each sample - labbces/sugarcane_RNAome GitHub Wiki

RNA quantification against the pan-transcriptome reference

I developed a Snakemake pipeline to automate the process of generating an RNA-Seq expression matrix from raw RNA-Seq datasets. The Snakemake pipeline was executed with this bash script.

This pipeline was executed to generate quantification files (quant.sf) for samples from 54 contrasting genotypes in fiber and sugar content (as present in the three selected papers mentioned previously). The reference for quantification was the pan-transcriptome of sugarcane clustered with CD-HIT using -c 1: CD-HIT_48_genotypes_transcriptome_salmonInx.

The following directed acyclic graph (DAG) represents the workflow to calculate the RNA expression matrix for each genotype (e.g. SP80-3280).

runSalmon_SnakefileDAG

download_fastq: Downloads raw RNA-Seq datasets (fastq.gz) of read 1 and read 2 of the samples.

bbduk: Removes adapters, ribosomal RNA (rRNA) sequences, and filters by quality.

count_raw_sequences: Counts the number of sequences in raw files, then removes these files.

salmon_index: Generates a Salmon index for quantification.

salmon_quant: Performs quantification of reads against the Salmon index.

count_trimmed_sequences: Counts the number of sequences in trimmed files, then removes these files.

filter_stranded: Filters stranded and paired reads after quantification.

filter_low_mapping_reads: Filters reads by mapping rate and low percentage of mapped reads.

preliminar_report: Generates a preliminary report with statistics of quantified reads.

merge_quantification_results: Generates the expression matrix using Salmon quantmerge.

Important: This description provides an overview of the pipeline. Please refer to the Snakefile and the config.yaml configuration file for complete details on implementation and parameters used in each rule. To run the pipeline, make sure to have a properly configured configuration file. To utilize my bash script to execute the pipeline, you must ensure that the config.yaml is present in your directory, along with the $genotype_samples.csv, (e.g. Q200_samples.csv file. This latter contains the SRA/ERR access identifiers for the raw data associated with each genotype. Notably, the Snakefile inherently identifies the genotype's name by extracting it from the $genotype_samples.csv file.