Running rnaseq pipeline - YounisLab/docs GitHub Wiki
rnaseq-pipeline is a nextflow-based pipeline that we use to process RNA-Seq data. The pipeline takes as input a folder of .fastq
files and uses various off-the-shelf bioinformatics tools to process them into the desired output. More details can be found in the repo.
Requires nextflow and Docker installed.
The pipeline is run with the following args:
nextflow run rnaseq-pipeline.nf [OPTIONS] --ref_dir <REF_DIR> --fastq_dir <FASTQ_DIR> \
--star_index <STAR_INDEX_DIR> --genome <GENOME_VERSION> \
--cores <NUM_CORES> --output_dir <OUTPUT_DIR>
Most of the mandatory arguments are self-explainable (--cores
, --output_dir
) and are documented in the repo itself. Here's some more cluster-specific description of the others:
-
--ref_dir
: This is the path to the folder containing the references files required for the pipeline to work. These references files were generated from a script calledmaking_all_reference_files.sh
(Ihab should be familiar with this script). As of 12/03/2021, the location of this directory is in/home/data/hg38_ref
on the bio-crs clusters. -
--fastq_dir
: This is the path to the input folder containing.fastq
files to be processed. The filenames in this directory have to be in a particular format, and this format changes depending on whether the--single_end
switch is used or not. If the--no_replicates
switch is used, the filename does not matter. The format itself is documented in the repo. -
--star_index
: This is the path to the folder containing indices for STAR to work. As of 12/03/2021, this is in/home/data/STAR_indexes
on the clusters. -
--genome
: This is the version of the reference genome in the--ref_dir
folder. As of 12/03/2021, we use thehg38
version, so this should always be set tohg38
.