Running rnaseq pipeline - YounisLab/docs GitHub Wiki

Introduction

rnaseq-pipeline is a nextflow-based pipeline that we use to process RNA-Seq data. The pipeline takes as input a folder of .fastq files and uses various off-the-shelf bioinformatics tools to process them into the desired output. More details can be found in the repo.

Setup

Requires nextflow and Docker installed.

Running

The pipeline is run with the following args:

nextflow run rnaseq-pipeline.nf [OPTIONS] --ref_dir <REF_DIR> --fastq_dir <FASTQ_DIR> \ 
--star_index <STAR_INDEX_DIR> --genome <GENOME_VERSION> \
--cores <NUM_CORES> --output_dir <OUTPUT_DIR>

Most of the mandatory arguments are self-explainable (--cores, --output_dir) and are documented in the repo itself. Here's some more cluster-specific description of the others:

  • --ref_dir: This is the path to the folder containing the references files required for the pipeline to work. These references files were generated from a script called making_all_reference_files.sh (Ihab should be familiar with this script). As of 12/03/2021, the location of this directory is in /home/data/hg38_ref on the bio-crs clusters.
  • --fastq_dir: This is the path to the input folder containing .fastq files to be processed. The filenames in this directory have to be in a particular format, and this format changes depending on whether the --single_end switch is used or not. If the --no_replicates switch is used, the filename does not matter. The format itself is documented in the repo.
  • --star_index: This is the path to the folder containing indices for STAR to work. As of 12/03/2021, this is in /home/data/STAR_indexes on the clusters.
  • --genome: This is the version of the reference genome in the --ref_dir folder. As of 12/03/2021, we use the hg38 version, so this should always be set to hg38.
⚠️ **GitHub.com Fallback** ⚠️