Preprocessing Pipelines - bvshvarf/bvshvarf.github.io GitHub Wiki

Author(s): Bryan Hawickhorst


Overview

DataSeq provides automated preprocessing pipelines for three types of genomic data: RNA-seq, ATAC-seq, and WGBS

Each pipeline cleans raw sequencing data and prepares it for downstream analysis by following standardized steps.

Available Pipelines

RNA-seq Pipeline

  1. Quality control (QC) on raw reads
  2. Adapter trimming based on QC results
  3. Alignment to a reference genome
  4. Generation of read count tables

ATAC-seq Pipeline

  1. Quality control (QC) on raw reads
  2. Adapter trimming
  3. Alignment to a reference genome
  4. Peak calling to identify transposase-accessible chromatin regions

WGBS Pipeline

  1. Quality control (QC) on bisulfite-treated reads
  2. Adapter trimming
  3. Alignment to a bisulfite-converted reference genome
  4. Methylation calling

User Parameters

When configuring a preprocessing job, users will be asked to provide values for specific parameters related to their dataset and desired processing settings.

Quality Control (QC)

Quality control is performed as a separate first step for all uploaded samples.

  • After QC, a MultiQC report is generated summarizing data quality.
  • Users can review the report to decide if they want to adjust parameters for subsequent steps (especially trimming).
  • QC processing does not require manual intervention after submission.

How Preprocessing Works

  • After uploading files, choosing a data type, and setting parameters on the Pre-Processing Page, DataSeq automatically runs the pipeline.
  • Processing is managed by a secure high-performance computing cluster.
  • Users can monitor job progress through the Logging page.
  • Processed files are available for download once jobs complete.

Outputs

Each pipeline produces:

  • Cleaned sequencing reads
  • Quality control reports (e.g., FastQC, MultiQC)
  • Processed data files (e.g., read count matrices, peak files, methylation calls)

These outputs can be downloaded through the Downloads page.

Notes

  • QC uses the same scripts across all data types.
  • Preprocessing steps (e.g., trimming, alignment) are customized based on data type and user-provided parameters.
  • All backend workflow management is handled automatically and does not require user action.
  • Detailed logs are available for troubleshooting if a job fails.