Alignment Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

This sub-workflow handles the alignment of short-read DNA sequence data to reference genomes. The aim is to be a pretty general pipeline that enables quick and easy alignment using current aligners BWA2 and Minimap2.

The pipeline aligns samples to their specified genomes, filtering out low quality alignments (as specified by the user), along with removing unmapped reads to reduce the footprint of the output files. Additionally, duplicate alignments are marked using the software Sambamba. After alignment, alignment statistics are generated using MosDepth and flagstat which are presented in a MultiQC report.

Arguments

The current version of the alignment pipeline has the following arguments:

--seqdir string              Directory path to paired-end reads.
--sheet string               CSV file of two columns '<sample.basename>,<reference>'.
--platform string            Specify the sequencing platform. Options: illumina, mgi.
--aligner string             Aligner to use for short-read mapping. Options: bwa2, minimap2.
--mapq integer               Minimum mapping quality threshold. Default 10.

Arguments overview

Seqdir

This argument requires a directory path to where the FASTQ files listed in the sample sheet are located (see sheet below).

Sheet

Provide the file path to a CSV file that contains two columns (without column names).

  1. Basename of the FASTQ files in the seqdir (i.e. whatever comes before _R?.fastq.gz)
  2. The file path to the reference genome you want to align the sample to

The pipeline will search the seqdir for files that match the basename you provide and create a data-channel.

An example of the CSV file is shown below

sample-AA,/home/a1645424/al/hydrophis-major/hydmaj-chromosome/reference-1.fa
sample-BB,/home/a1645424/al/hydrophis-major/hydmaj-chromosome/reference-2.fa
sample-CC,/home/a1645424/al/hydrophis-major/hydmaj-chromosome/reference-3.fa

Where sample-AA would match a file with the following extension - sample-AA_R?.fastq.gz

platform

Provide the sequencing platform the sequence data was generated on so it can be added to the BAM read-group.

Aligner

Choose which alignment tool you'd like to use. The current options include BWA-MEM2 and Minimap2. These are both fast, proven alignment tools that are suitable to most data-types.

Mapq

Specify a minimum mapping quality threshold that alignments must meet.

Pipeline schematic

⚠️ **GitHub.com Fallback** ⚠️