How To: Cut Adapt for 16S Analysis - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki
Removing Illumina sequencing adapters and primers using cutadapt
Sequencing data is delivered from ICBR to the lab hipergator account. It will be backed up before you are given access to it. When it is ready for you to analyze, you must first remove all Illumina adapters and primers that may be in your sequencing reads. You will use the program "cutadapt" from the command line on sequence data that is in hipergator. This requires a terminal (iTerm for Mac) and an SFTP (secure file transfer protocol) program. See the HiPerGator cheat sheet for more details.
To access files via Cyberduck and HiPerGator (hpg):
-
Open up both your terminal and your SFTP and log into HiPerGator using your gatorlink credentials. Look here if you don't know how. You will use the terminal to run commands on hipergator and you will use the SFTP to visualize contents of directories on hipergator and to move files between your desktop and hipergator.
-
From your SFTP, navigate to the lab's shared directory (/blue/juliemeyer/share). Your 16S reads will be in one of the folders in the shared folder. Find your reads.
-
The forward and reverse reads for each sample will be organized into a folder dedicated to that sample. In order to work with the reads, you will need to move them out of their subfolders into one single folder (aka directory). First, navigate to the shared directory in your terminal.
cd /blue/juliemeyer/share
You should see the directory that contains your sequencing results named something like "NS3211". The names from the sequencing center will be longer, I am using NS3211 as an example. From the terminal, use the following script to create a new directory where you will put all of the sequencing reads and then move them into it. The example below assumes you are starting at /blue/juliemeyer/share and you see listed the directory "NS3211" that contains your sequencing reads. Replace "newdirectory" with a name that makes sense for you. Use the move command in linux (mv
) to move all files ending in .fastq.gz (*.fastq.gz
) to the new directory. In the example directory "NS3211" contains a bunch of subdirectories - we are telling hipergator to look in all subdirectories of NS3211 with /*/
. The asterisk is a wildcard that means all or any matches.
mkdir /blue/juliemeyer/share/newdirectory
mv NS3211/*/*.fastq.gz /blue/juliemeyer/share/newdirectory
- Once reads are moved from their individual folders into one folder, transfer them to your own personal folder (don't forget the sequenced extraction blanks!). The script below will move the whole directory if the sequencing run has only your samples on it. This will be your working directory.
mv /blue/juliemeyer/share/newdirectory /blue/juliemeyer/r.howard/newdirectory
- At this point, if you have any capital “R”s in your sample names, they need to be replaced with lowercase letters. (Next time, do not submit sample names to ICBR with capital R). This does not apply to "R1" and "R2" near the end of the file name. To do this, pick the “phrase” with the uppercase R to work with. For example, my samples had “STRI” in the filename and I changed that phrase to "stri" with the following command in the terminal:
#generic usage
for file in *.fastq.gz ; do mv $file ${file//[former file name]/[new file name]} ; done
#specific example
for file in *.fastq.gz ; do mv $file ${file//STRI/stri} ; done
- Check your files to make sure the file names have changed. From your terminal, run
ls
to see the files or refresh your SFTP to see that the file names have been fixed.
Using the terminal to run cutadapt:
- From your SFTP, find the cutadapt script in the shared folder (/blue/juliemeyer/share) named: preprocessV4Amplicons.txt. Download it to your desktop and change the email address to your own. Then upload the updated preprocessV4Amplicons.txt to your working directory that contains the sequencing reads (/blue/juliemeyer/r.howard/newdirectory). The preprocessV4Amplicons.txt script is shown below.
#!/bin/bash
#SBATCH --job-name=cutadapt_%j
#SBATCH --output=cutadapt_%j.log
#SBATCH --error=cutadapt_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=60gb
#SBATCH --time=72:00:00
####### DO NOT USE CAPITAL Rs IN SAMPLE NAMES----RENAME WITH r
########### Cutadapt: remove illumina adapters and primers
# see manual for options: http://cutadapt.readthedocs.io/en/stable/guide.html
# illumina adapters:
# -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
# V4 primers:
# -a GGACTACNVGGGTWTCTAAT -g GTGYCAGCMGCCGCGGTAA
# make directory for cutadapt output files
mkdir cutadapt
module load cutadapt
# create file names for cutadapt
files_cut=`ls | grep "R1_001.fastq.gz"`
# use loop to run cutadapt function on all original pairs; send output to cutadapt folder
for R1 in $files_cut
do
R1_cut=`echo $R1 | cut -d R -f1`R1_cut.fastq.gz
R2=`echo $R1 | cut -d R -f1`R2_001.fastq.gz
R2_cut=`echo $R1 | cut -d R -f1`R2_cut.fastq.gz
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -a GGACTACNVGGGTWTCTAAT -g GTGYCAGCMGCCGCGGTAA -n 4 -o cutadapt/$R1_cut -p cutadapt/$R2_cut $R1 $R2
done
#download *cut.fastq.gz files for input into dada2----it will quality filter
- You are now ready to run cutadapt. From the terminal, navigate to your working directory that contains the script and the sequencing reads, if you aren't already there. Submit the job to hipergator from the working directory with the following command.
sbatch preprocessV4Amplicons.txt
-
You will get an email with a job number saying that your job has begun. When the job has ended, you will get a second email saying that the job is complete.
-
Refresh your working directory folder on Cyberduck. You should see a cutadapt folder here now. Go into the folder to make sure the process has worked. If you see your cut.fastq.gz files, you have successfully run cutadapt.
-
Your samples are ready to use for downstream analysis.