Bowtie2 - MattHuff/ATACSeqDocumentation

After trimming your input reads, the next step of the pipeline aligns the trimmed reads to a reference genome. There are multiple pipelines that allow for alignment of reads to the genome, and we will be using Bowtie2 for the current analysis. You will also need to install Samtools in the same environment to make adjustments to your output files.

Obtain Reference Genome

I downloaded the human genome GRCh38.p14 from NCBI. Specifically, I downloaded the FASTA assembly and GFF annotation files and saved them to a an archive called human_assembly.zip. From there, I sent the ZIP file to the server:

rsync -avzh --remove-source-files -e ssh Downloads/human_assembly.zip <user_name>@hpcdtn01.rcd.clemson.edu:/zfs/musc3/huffmat/ATAC-Seq/analysis/3_bowtie/

Inside the server, go to the directory containing the reference files. Unzip the files with unzip human_assembly.zip.

Get Index File

Bowtie requires an index file to allow for random access during the alignment process. Consider it like an index to a textbook - you could read the book from the beginning each time you had to find a specific word, or you could use the index to find specific mentions of that word. As genomes can be large, you need an index file to properly run the alignment step.

Run this step in either an interactive session or as a PBS submission on the Cluster. This process uses enough memory and time that it would be a problem for the default, login node, so do not try to run it on its own. The general command to run looks like this:

bowtie2-build -f GCF_000001405.40_GRCh38.p14_genomic.fna GRCh38_human_genome

I recommend a job that lasts for at least two hours and has enough memory to allow it to run to completion. The final location of your index files should be in /zfs/musc3/huffmat/ATAC-Seq/analysis/3_bowtie2/reference_genome/.

Run Bowtie

#!/bin/bash

#PBS -N 3_bowtie
#PBS -l select=1:ncpus=16:mem=200gb
#PBS -l walltime=08:00:00
#PBS -j oe

source ~/.bashrc
mamba activate bowtie
cd $PBS_O_WORKDIR

for R1 in /zfs/musc3/huffmat/ATAC-Seq/analysis/2_trimmomatic/*_1.trimmed.paired.fastq
do
	R2=`sed 's/_1.trimmed/_2.trimmed/' <(echo $R1)`
	BASE=$( basename $R1 | sed 's/_1.trimmed.paired.fastq//')
	echo "R1 $R1"
	echo "R2 $R2"
	echo "BASE $BASE"

	bowtie2 \
		--very-sensitive \
		-X 1000 \
		-x /zfs/musc3/huffmat/ATAC-Seq/analysis/3_bowtie2/reference_genome/GRCh38_human_genome \
		-1 $R1 \
		-2 $R2 |
		samtools view -bSh |
		samtools sort -o ${BASE}_bowtie_sorted.bam
done

Index Files

Just as the index file for the genome is used to help Bowtie2 align reads to the genome, BAM index (.bai) files are needed for downstream programs to read the files efficiently. This is created using samtools index, and I recommend running it after you produce a new .bam file.

mamba activate bowtie

for b in /zfs/musc3/huffmat/ATAC-Seq/analysis/3_bowtie2/*.bam
do
	samtools index $b
done

Bowtie2 - MattHuff/ATACSeqDocumentation_010224 GitHub Wiki

Obtain Reference Genome

Get Index File

Run Bowtie

Index Files

⚠️ GitHub.com Fallback ⚠️

Bowtie2 - MattHuff/ATACSeqDocumentation_010224 GitHub Wiki

Obtain Reference Genome

Get Index File

Run Bowtie

Index Files

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️