FastQC - MattHuff/SingleCellDocumentation_112023 GitHub Wiki

FastQC is a package used to obtain quality control stats on your input samples. This can be used to identify samples with low quality scores across all reads, less reads than other samples, or significant adapter contamination. In practice, this is always a good idea to run.

This analysis uses 10x scRNASeq data, which has already been trimmed of adapter content; our concern here is more making sure the data looks right.

Installing FastQC

I chose to install FastQC in a mamba environment. As the documentation for mamba suggests, you will need to install Miniforge to begin using it. Once mamba is installed and usable, I created my environment with the following command:

mamba create -n fastqc -c bioconda fastqc multiqc

This creates a new environment named fastqc. I specified the bioconda channel because neither of my installed programs are available in the default channel.

Overall directory set-up

Whenever I start a new project, I create a directory with a descriptive name. I created this directory within my working directory in the Palmetto Cluster.

mkdir scRNAseq_p0mice
cd scRNAseq_p0mice

Within this directory, I create two sub-directories. One will contain the raw data, and the other will contain all of my analyses.

mkdir raw_data
mkdir analysis

Get Raw Data

The raw data already exists in Palmetto Cluster, so all I will do is create symbolic links to the raw data. I do this, rather than create a hard copy, to save space.

cd raw_data
ln -s ../../../norris/fastq/*_R*_001.fastq.gz .

## return to main project directory
..

Sub-analyses directories

Within my analysis directory, I like to create sub-directories for each step of the analysis. These are always done in numerical order, so that the order in which I ran an analysis will always be clear.

cd analysis
mkdir 1_fastqc
cd 1_fastqc

Running FastQC

I typically use bash's for loop function to create one command that runs my analyses on all files, one at a time. If you want the individual jobs to run in parallel, you can end you command with a &, but I would be careful with running them on the Palmetto Cluster in this manner.

for f in /zfs/musc3/huffmat/scRNAseq_p0mice/raw_data/*.fastq.gz
do
	filename=$(basename "$f")
	base="${filename%%.fastq*}"
	echo "filename $filename base $base"
	mkdir -p $base.fastQC

	fastqc -o $base.fastQC --threads 10 $f >& $base.fastQC.out
done

The first few lines in this for loop are used to isolate the filename from the direct path. Then, the "echo" command is used to confirm that you are getting the expected filename and basename from these commands. I recommend running these first commands on their own, before adding the mkdir or fastqc commands, to confirm that you are getting sensible output; the last thing you want is for this analysis to fail due to using the wrong extension. The mkdir command makes the directory containing your fastqc results, and the -p option makes sure it will not give you an error while running.

Submitting on Palmetto Cluster

If you are running this on the Palmetto Cluster, save your file as a .qsh script, and include a header giving information on your job's name, walltime, and any other information you want to include:

#!/bin/bash

#PBS -N 1_Fastqc
#PBS -l walltime=04:00:00
#PBS -j oe

source ~/.bashrc
mamba activate fastqc
cd $PBS_O_WORKDIR

for f in /zfs/musc3/huffmat/scRNAseq/raw_data/*.fastq.gz
do
	filename=$(basename "$f")
	base="${filename%%.fastq*}"
	echo "filename $filename base $base"
	mkdir -p $base.fastQC

	fastqc -o $base.fastQC --threads 10 $f >& $base.fastQC.out
done

MultiQC - Optional

FastQC produces individual HTML files for each sample. MultiQC is a tool that collects the results of FastQC for each sample and visualizes them in a single HTML report file. The command I provided to create the mamba environment installed MultiQC as well, so as long as that environment is active, you can run MultiQC. The command to run it is simple:

multiqc .

This command searches the current directory (.) for FastQC results, which is used as input to produce the combined HTML file. This file will always be saved as multiqc_report.html, I recommend renaming it to something specific so you will know which project it came from and don't have multiple files with the same name.