Big ITS DADA2 DATA - Michael-D-Preston/PrestonLab GitHub Wiki
By Angus Ball
Introduction
By now you've gone through HPC how to guide, and the DADA2 pipeline on your home computer. This guide aims to be a complete walk through of the process in the HPC. There will be alot of overlap with the steps and explanations so I'll only be explaining the new HPC/full dataset code.
quick reference to how I'll format this document
- This is the step
This is the code you'll run, note the copy button --->
This is the output
- Step 2
- These are bonus facts that I want to say,
- or sub steps
The Protocol
- Login to UNBC's VPN with global protect portal
- Move your data onto the HPC
- On windows go to "This PC"
- Click the three dots
- Click map network drive
- the folder is \\research-files.unbc.ca\researchHome\'username'
- Move your folder containing your demultiplexed sample data here
- NO spaces or -dashes
- login to PuTTY, username is 'username'@klinaklini.unbc.ca
- Lets check if your data was transported correctly
cd /data/researchHome/\'Username\'
ls
LisaProject
- Reminder, copy/paste doesn't work in command line, copy like normal, but paste is right click!
- Therefore our data is in /data/researchHome/aball/LisaProject
- We want to remove the blank PCR samples because they aren't part of our sequences then we can continue the analysis
- With our samples in the computer lets get into a compute cluster and start running code!
- Find which compute clusters are available
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 16 idle compute[1-16]
- If some compute clusters are busy (and not "idle") Then move to that one
- Compute cluster 1 is free for me so
- To move to the compute cluster
ssh computer1
- Before starting in R we need to download cutadapt
cd /data/researchHome/\'Username\'
wget https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py --user
cd .local/bin
pip install cutadapt
- This in theory downloads cutadapt but always check
python3 -m cutadapt --version
3.7
- The file path for cutadapt is then /home/aball/.local/bin/cutadapt
The following information is incorrect. But I've left it bc its a good command to know
- you can check by running
python3 -m pip list -v
Make sure you bring in all your data and know where its stored. Mine is in /home/aball/lisadata/lisafastq
- Lets load up R
module load gnu8
module load R
R
What happens next depends on how the HPC works, so we'll assume you have all the packages you need ect.
in R.... This is pretty much the same as in the other R notebook DADA2 ITS test run so I'll only explain the new weird stuff needed to work within the HPC. Hey btw if you are working with multiple primer sets its a pain in the ass and you just have to pipe multiple cutadapt sequences together. its the same code with fw1, and fw2 ect. Heres an example: Multiple primers, cutadapt, and dada2.
library(dada2)
library(ShortRead)
library(Biostrings)
path <- "/home/aball/lisadata/lisafastq" #location of data
list.files(path) #Double check all your files are here
#organize files
fnFs <- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE)) #F orward
fnRs <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE)) #R everse
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD <- "ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
REV <- "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC"
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), FWD.ReverseReads = sapply(FWD.orients,
primerHits, fn = fnRs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ForwardReads = sapply(REV.orients, primerHits,
fn = fnFs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)))
This still only prints out the reads for the first sample, but since all the samples came from the same sequencing run, they will theoretically operate the same and we only need to spot check one or two samples.
cutadapt <- "/home/aball/.local/bin/cutadapt"
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut) #this creates a cutadapt file
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
#Hey this whole thing breaks if you have alot of spaces in your path names so uhhh don't do that
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}
#Checking if it worked
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), FWD.ReverseReads = sapply(FWD.orients,
primerHits, fn = fnRs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ForwardReads = sapply(REV.orients, primerHits,
fn = fnFs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)))
cutFs <- sort(list.files(path.cut, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "R2.fastq.gz", full.names = TRUE))
This is different than the test data! You're samples will have different naming schemes, therefore you'll have to change this code to fit your data. Now this uses regex expressions which are hard and complicated :(.
here is an unadultered sample name: MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW43_R1.fastq.gz
the first function...
get.sample.name <- function(fname) strsplit(basename(fname), "_")[1](/Michael-D-Preston/PrestonLab/wiki/1)[7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(cutFs, get.sample.name))
can be read as create an object with the function split the string (the name) at every _ and keep the 7th one i.e. you run the command and
MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW43_R1.fastq.gz
becomes
"MI.M03992" "0831.001.FLD" "ill" "028" "i7---IDT" "i5" "8.LW43" "R1.fastq.gz"
And since only the 7th is kept it becomes
"8.LW43"
Then we repeat just to save the LW43 part of the name, because thats what we care about
get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[1](/Michael-D-Preston/PrestonLab/wiki/1)[2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))
head(sample.names)
This takes
"8.LW43"
splits it by the period
Note: the period is special in regex expressions so I have to enclose it in square brackets
"8" "LW43"
Then we save the 2nd instance and the sample name becomes
"LW43"
Huzzah! actual names that are good
This is also different Because we aren't in Rstudio graphs are hard. I'm getting around this by creating objects that contain the graphs and plotting them in R studio later just to check they're fine.
Might take a hot minute to plot everything, so you can spot check a couple samples.
PQBF<-plotQualityProfile(cutFs[1:2])
saveRDS(PQBF, file = "/home/aball/lisadata/PQBF.rds")
You can then move this RDS object from the HPC to your home computer and then view it there
same goes for cutRs not cutF
PQBR<-plotQualityProfile(cutRs[1:2])
saveRDS(PQBR, file = "/home/aball/lisadata/PQBR.rds")
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
out <- filterAndTrim(
cutFs, #location of cut forward reads
filtFs, #location of where filtered forward reads will go
cutRs, #location of cut reverse reads
filtRs, #location of where filtered reverse reads will go
maxN = 0, #max number of ambiguous bases allowed in a read (dada2 REQUIRES 0 ambiguous bases)
maxEE = c(2, 2), #maxEE is the command that uses expected error rates to determine if a sample should be deleted (rather than using Q scores absolutely), 2 errors is the recommended value. Why is it a list? one for the forward reads one for the reverse
truncQ = 2, #truncates reads when any base pair quality score is 2 or below, why 2? illumina sequencing reports a 2 on really bad reads
minLen = 50, #minimum length of read, note for ITS region there is no max length
rm.phix = TRUE, #removes sequences of the bacteriophage PhiX which commonly contaminates NGS data
compress = TRUE, #compresses your files when its done
multithread = TRUE,
verbose=TRUE)
Then lets give everything a look to make sure that we didn't destroy any samples
out
reads.in
MI.M03992_0831.001.FLD_ill_001_i7---IDT_i5_5.LW01_R1.fastq.gz 61606
MI.M03992_0831.001.FLD_ill_002_i7---IDT_i5_5.LW14_R1.fastq.gz 67617
MI.M03992_0831.001.FLD_ill_003_i7---IDT_i5_5.LW27_R1.fastq.gz 56732
MI.M03992_0831.001.FLD_ill_004_i7---IDT_i5_5.LW39_R1.fastq.gz 59140
MI.M03992_0831.001.FLD_ill_005_i7---IDT_i5_5.LW52_R1.fastq.gz 58358
MI.M03992_0831.001.FLD_ill_006_i7---IDT_i5_5.LW65_R1.fastq.gz 51118
reads.out
MI.M03992_0831.001.FLD_ill_001_i7---IDT_i5_5.LW01_R1.fastq.gz 38621
MI.M03992_0831.001.FLD_ill_002_i7---IDT_i5_5.LW14_R1.fastq.gz 38388
MI.M03992_0831.001.FLD_ill_003_i7---IDT_i5_5.LW27_R1.fastq.gz 41751
MI.M03992_0831.001.FLD_ill_004_i7---IDT_i5_5.LW39_R1.fastq.gz 42798
MI.M03992_0831.001.FLD_ill_005_i7---IDT_i5_5.LW52_R1.fastq.gz 36734
MI.M03992_0831.001.FLD_ill_006_i7---IDT_i5_5.LW65_R1.fastq.gz 34769
Then we can plot the quality profiles again
filterQPF<-plotQualityProfile(filtFs[1:2])
saveRDS(filterQPF, file = "/home/aball/lisadata/filterQPF.rds")
and reverse reads
filterQPR<-plotQualityProfile(filtRs[1:2])
saveRDS(filterQPR, file = "/home/aball/lisadata/filterQPR.rds")
Then time to learn errors remmeber if you're using nextseq data error learning looks different, this tutorial is for Miseq data
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
Different again
ploterrF<-plotErrors(errF, nominalQ=TRUE)
ploterrR<-plotErrors(errR, nominalQ=TRUE)
saveRDS(ploterrF, file = "/home/aball/lisadata/ploterrF.rds")
saveRDS(ploterrR, file = "/home/aball/lisadata/ploterrR.rds")
Still different
dadaFs <- dada(filtFs, err=errF, multithread=TRUE, pool = TRUE)
dadaRs <- dada(filtRs, err=errR, multithread=TRUE, pool = TRUE)
This takes literally 5 years to run so I'm gonna save these as RDS objects just incase something happens
saveRDS(dadaFs, file = "/home/aball/lisadata/dadaFs.rds")
saveRDS(dadaRs, file = "/home/aball/lisadata/dadaRs.rds")
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
[1] 95 5536 95 samples and 5536 unique ASV's between them!
table(nchar(getSequences(seqtab)))
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
[1] 95 4053 only removed 1500 ASV's not too shabby!
sum(seqtab.nochim)/sum(seqtab)
[1] 0.9530778
but thats only a 5% loss of reads, fair enough
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN),
rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
We can look at the numbers but it'd be better if we could idk quickly compute the percent loss at each step.
library(dplyr)
track <- as.data.frame(track)
track <- mutate(track, LossFiltering = 1 - filtered / input) #what is the percent loss of samples at filtering step
track <- mutate(track, LossMerge = 1- merged / filtered) #what is the percent loss at merge step
loss of 50% of reads during filtering is completely normal and chill. loss of 50-75% of reads after merging is not "completely normal and chill" ... I'll have to look into this on a sample by sample basis
Okay! almost done! go downloaded the latest release of the UNITE (general fasta) database and upload it to the HPC.
unite.ref <- "/home/aball/lisadata/sh_general_release_dynamic_s_25.07.2023.fasta" #this is where the database is
taxa <- assignTaxonomy(seqtab.nochim,
unite.ref,
multithread = TRUE,
tryRC = TRUE) #then assign taxonomy
taxa.print <- taxa # Removing sequence rownames for display only #creates a nice taxa file, we won't really use this but alas.
rownames(taxa.print) <- NULL
head(taxa.print)
Okay! save your taxa, track, seqtab.nochim, and taxa.print
saveRDS(obj, "file")
Bring these files off of the HPC, onto your home computer and then create a phyloseq object!
Unfortunately, the seqtab.nochim has the stupid long names, heres the same code as earlier that created the nice names, but modified to change the names of the nochim object
#gets names
seqtab.nochim.df <-as.data.frame(seqtab.nochim)
namesobj<- rownames(seqtab.nochim.df)
#splits names up as above
get.sample.name <- function(fname) strsplit(basename(fname), "_")[1](/Michael-D-Preston/PrestonLab/wiki/1)[7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(namesobj, get.sample.name))
get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[1](/Michael-D-Preston/PrestonLab/wiki/1)[2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))
head(sample.names)
#find and replaces names
seqtab.nochim <-as.data.frame(seqtab.nochim)
seqtab.nochim$names <- sample.names
rownames(seqtab.nochim) <- seqtab.nochim$names
seqtab.nochim <- select(seqtab.nochim, -"names")
seqtab.nochim <- as.matrix(seqtab.nochim) #converts it back to a matric bc of otu_table command below
#dont forget to load your sample data!
Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")
library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), tax_table(taxa))
saveRDS(ps, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\ps.rds" )#add your file location here
And you're done!! unless you gotta figure out why a bunch of samples failed to merge, you can move into data analysis