Usage scripts - MeryemAk/dRNASeq GitHub Wiki

Pipeline Overview

The dRNASeq pipeline is designed for the analysis of Nanopore RNA sequencing data, with the goal of characterizing both host and microbial gene expression for diagnostic markers. The pipeline includes the following steps:

  1. Configuration file - Define settings and parameters for the scripts within this file.
  2. Raw Data Processing – Merging FASTQ files and filtering rRNA
  3. Trimming - identify and orient reads
  4. Quality control - Quality control of individual samples and a comparative report of all samples within the same run
  5. Alignment – Mapping reads to a reference genome (Homo sapiens, Candida albicans, and selected bacteria) and counting unmapped reads
  6. Counting – Gene-level quantification
  7. Taxonomic Classification – Assigning taxonomic labels to unmapped reads
  8. Differential Analysis – Identifying differentially expressed genes (not yet implemented)

Note: This usage guide explains how the scripts look. For a detailed step-by-step guide, please refer to the Standard Operating Procedure for scripts


Getting Started

1. Install and activate the Conda Environment

Before running the pipeline, install and activate the Conda environment:

./1.linuxsetup
conda activate dRNAseq  # To deactivate the environment later, use `conda deactivate`

This script automates the installation of the miniconda3 tool and the environment necessary to run the pipeline in.

2. Check whether the parameters in the config file are correct.

Open the configuration file with notepad (or any other text editor). Change the location of folders and files to their correct location before running the scripts.

Running the Pipeline

Step 1. Merging

Merge all FASTQ files from a single barcode using the 2.merge.sh script:

./2.merge.sh

Step 2. Filter rRNA

rRNA is filtered from the samples with minimap2 using the silva database. After mapping the unmapped reads are taken with samtools and are used for downsteam analysis.

./3.filter_rRNA.sh

Step 3. Trimming

Pychopper is a tool developed by Oxford Nanopore Technologies (ONT) to process cDNA reads from ONT-sequencing . It is used to identify full-length cDNA reads by detecting primer sequences which were added during library prep. This helps to determine whether a read represents a full-length cDNA molecule. Pychopper also orients reads so that they β€˜re all in the correct direction (5’ --> 3’). This way, all sequences are consistently aligned for downstream analysis.

./4.pychoper.sh

Step 4. Quality control (QC)

Perform QC on the trimmed FASTQ files. This script uses the Nanopack tools NanoPlot and NanoComp. NanoPlot makes reports for each individual sample, while NanoComp compares all samples within the same run with eachother.

./5.qc.sh

(Optional: Re-run 5.qc.sh after trimming to assess post-trimming quality. Back up the original QC reports before rerunning the script, as they will be overwritten!)

Step 5. Reference mapping library

A bacterial reference library was established using a list of organisms (see table), ensuring representation of relevant vaginal bacterial species. Reference genomes and annotation files for these organisms were retrieved from NCBI genome, and bacterial sequences were clustered with MMSeqs2 (sequence identity 95%) to optimize alignment efficiency. Indexed databases were generated using the default parameters in minimap2 to facilitate mapping. To create this bacterial reference library run the create_bacteria_index.sh script within the reference_genomes/ directory.

Step 6. Mapping

The mapping process follows a multi-stage mapping process with Samtools and minimap2 using distinct alignment settings for eukaryotic and prokaryotic sequences. First, reads are aligned to the Homo sapiens reference genome with parameters optimized for splicing. Unmapped reads are then filtered and mapped against Candida albicans using the same approach. For bacterial sequences, where splicing is not relevant, remaining unmapped reads are mapped without splicing.

./5.mapping.sh

Step 7. Quantification

Counting the genes is performed with the bambu tool. First the Docker image of bambu needs to be built. Run the 6.counting.sh script after. The output is given within count tables which can be used for differential gene expression analysis in R.

./6.counting.sh

Step 8. (optional): Taxonomic classification.

The workflow concludes with taxonomic classification performed by Kraken2, where reads that remained unmapped after the host and pathogen alignment steps are assigned to taxa against a Human Vaginal Microbiome Collection database. This final classification step could be used for identifying additional significant bacterial species. This allows for iterative improvements to the reference mapping library, enabling re-execution of the pipeline if new insights emerge.

./7.kraken.sh

Step 9. Differential gene expression analysis in R (not yet implemented).

https://rnnh.github.io/bioinfo-notebook/docs/DE_analysis_edgeR_script.html