Big ITS DADA2 DATA - Michael-D-Preston/PrestonLab GitHub Wiki

By Angus Ball

Introduction

By now you've gone through HPC how to guide, and the DADA2 pipeline on your home computer. This guide aims to be a complete walk through of the process in the HPC. There will be alot of overlap with the steps and explanations so I'll only be explaining the new HPC/full dataset code.

quick reference to how I'll format this document

This is the step

This is the code you'll run, note the copy button --->

This is the output

Step 2
- These are bonus facts that I want to say,
- or sub steps

The Protocol

Login to UNBC's VPN with global protect portal
Move your data onto the HPC
- On windows go to "This PC"
- Click the three dots
- Click map network drive
- the folder is \\research-files.unbc.ca\researchHome\'username'
- Move your folder containing your demultiplexed sample data here
- NO spaces or -dashes
login to PuTTY, username is 'username'@klinaklini.unbc.ca
Lets check if your data was transported correctly

 cd /data/researchHome/\'Username\'
 ls

LisaProject

Reminder, copy/paste doesn't work in command line, copy like normal, but paste is right click!
Therefore our data is in /data/researchHome/aball/LisaProject
We want to remove the blank PCR samples because they aren't part of our sequences then we can continue the analysis
With our samples in the computer lets get into a compute cluster and start running code!
Find which compute clusters are available

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

defq* up infinite 16 idle compute[1-16]

If some compute clusters are busy (and not "idle") Then move to that one
Compute cluster 1 is free for me so
To move to the compute cluster

ssh computer1

Before starting in R we need to download cutadapt

 cd /data/researchHome/\'Username\'
wget https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py --user
cd .local/bin
pip install cutadapt

This in theory downloads cutadapt but always check

python3 -m cutadapt --version

3.7

The file path for cutadapt is then /home/aball/.local/bin/cutadapt

The following information is incorrect. But I've left it bc its a good command to know

you can check by running

python3 -m pip list -v

Make sure you bring in all your data and know where its stored. Mine is in /home/aball/lisadata/lisafastq

Lets load up R

module load gnu8
module load R
R

What happens next depends on how the HPC works, so we'll assume you have all the packages you need ect.

in R.... This is pretty much the same as in the other R notebook DADA2 ITS test run so I'll only explain the new weird stuff needed to work within the HPC. Hey btw if you are working with multiple primer sets its a pain in the ass and you just have to pipe multiple cutadapt sequences together. its the same code with fw1, and fw2 ect. Heres an example: Multiple primers, cutadapt, and dada2.

library(dada2)
library(ShortRead)
library(Biostrings)

path <- "/home/aball/lisadata/lisafastq" #location of data
list.files(path) #Double check all your files are here

#organize files
fnFs <- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE)) #F orward
fnRs <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE)) #R everse

allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
    orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
        RevComp = Biostrings::reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}

FWD <- "ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
REV <- "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC"
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients

fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)

primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[1](/Michael-D-Preston/PrestonLab/wiki/1)))

This still only prints out the reads for the first sample, but since all the samples came from the same sequencing run, they will theoretically operate the same and we only need to spot check one or two samples.

cutadapt <- "/home/aball/.local/bin/cutadapt"
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut) #this creates a cutadapt file
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

#Hey this whole thing breaks if you have alot of spaces in your path names so uhhh don't do that

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}

#Checking if it worked
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[1](/Michael-D-Preston/PrestonLab/wiki/1)))

cutFs <- sort(list.files(path.cut, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "R2.fastq.gz", full.names = TRUE))

This is different than the test data! You're samples will have different naming schemes, therefore you'll have to change this code to fit your data. Now this uses regex expressions which are hard and complicated :(.

here is an unadultered sample name: MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW43_R1.fastq.gz

the first function...

get.sample.name <- function(fname) strsplit(basename(fname), "_")[1](/Michael-D-Preston/PrestonLab/wiki/1)[7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(cutFs, get.sample.name))

can be read as create an object with the function split the string (the name) at every _ and keep the 7th one i.e. you run the command and

MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW43_R1.fastq.gz

becomes

"MI.M03992" "0831.001.FLD" "ill" "028" "i7---IDT" "i5" "8.LW43" "R1.fastq.gz"

And since only the 7th is kept it becomes

"8.LW43"

Then we repeat just to save the LW43 part of the name, because thats what we care about

get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[1](/Michael-D-Preston/PrestonLab/wiki/1)[2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))
head(sample.names)

This takes

"8.LW43"

splits it by the period

Note: the period is special in regex expressions so I have to enclose it in square brackets

"8" "LW43"

Then we save the 2nd instance and the sample name becomes

"LW43"

Huzzah! actual names that are good

This is also different Because we aren't in Rstudio graphs are hard. I'm getting around this by creating objects that contain the graphs and plotting them in R studio later just to check they're fine.

Might take a hot minute to plot everything, so you can spot check a couple samples.

PQBF<-plotQualityProfile(cutFs[1:2])
saveRDS(PQBF, file = "/home/aball/lisadata/PQBF.rds")

You can then move this RDS object from the HPC to your home computer and then view it there

same goes for cutRs not cutF

PQBR<-plotQualityProfile(cutRs[1:2])
saveRDS(PQBR, file = "/home/aball/lisadata/PQBR.rds")

filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))

out <- filterAndTrim(
  cutFs, #location of cut forward reads
  filtFs, #location of where filtered forward reads will go
  cutRs, #location of cut reverse reads
  filtRs, #location of where filtered reverse reads will go
  maxN = 0, #max number of ambiguous bases allowed in a read (dada2 REQUIRES 0 ambiguous bases)
  maxEE = c(2, 2), #maxEE is the command that uses expected error rates to determine if a sample should be deleted (rather than using Q scores absolutely), 2 errors is the recommended value. Why is it a list? one for the forward reads one for the reverse
  truncQ = 2, #truncates reads when any base pair quality score is 2 or below, why 2? illumina sequencing reports a 2 on really bad reads
  minLen = 50, #minimum length of read, note for ITS region there is no max length
  rm.phix = TRUE, #removes sequences of the bacteriophage PhiX which commonly contaminates NGS data
  compress = TRUE, #compresses your files when its done
  multithread = TRUE,
  verbose=TRUE)

Then lets give everything a look to make sure that we didn't destroy any samples

out

                                                             reads.in

MI.M03992_0831.001.FLD_ill_001_i7---IDT_i5_5.LW01_R1.fastq.gz 61606

MI.M03992_0831.001.FLD_ill_002_i7---IDT_i5_5.LW14_R1.fastq.gz 67617

MI.M03992_0831.001.FLD_ill_003_i7---IDT_i5_5.LW27_R1.fastq.gz 56732

MI.M03992_0831.001.FLD_ill_004_i7---IDT_i5_5.LW39_R1.fastq.gz 59140

MI.M03992_0831.001.FLD_ill_005_i7---IDT_i5_5.LW52_R1.fastq.gz 58358

MI.M03992_0831.001.FLD_ill_006_i7---IDT_i5_5.LW65_R1.fastq.gz 51118

                                                             reads.out

MI.M03992_0831.001.FLD_ill_001_i7---IDT_i5_5.LW01_R1.fastq.gz 38621

MI.M03992_0831.001.FLD_ill_002_i7---IDT_i5_5.LW14_R1.fastq.gz 38388

MI.M03992_0831.001.FLD_ill_003_i7---IDT_i5_5.LW27_R1.fastq.gz 41751

MI.M03992_0831.001.FLD_ill_004_i7---IDT_i5_5.LW39_R1.fastq.gz 42798

MI.M03992_0831.001.FLD_ill_005_i7---IDT_i5_5.LW52_R1.fastq.gz 36734

MI.M03992_0831.001.FLD_ill_006_i7---IDT_i5_5.LW65_R1.fastq.gz 34769

Then we can plot the quality profiles again

filterQPF<-plotQualityProfile(filtFs[1:2])
saveRDS(filterQPF, file = "/home/aball/lisadata/filterQPF.rds")

and reverse reads

filterQPR<-plotQualityProfile(filtRs[1:2])
saveRDS(filterQPR, file = "/home/aball/lisadata/filterQPR.rds")

Then time to learn errors remmeber if you're using nextseq data error learning looks different, this tutorial is for Miseq data

errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)

Different again

ploterrF<-plotErrors(errF, nominalQ=TRUE)
ploterrR<-plotErrors(errR, nominalQ=TRUE)
saveRDS(ploterrF, file = "/home/aball/lisadata/ploterrF.rds")
saveRDS(ploterrR, file = "/home/aball/lisadata/ploterrR.rds")

Still different

dadaFs <- dada(filtFs, err=errF, multithread=TRUE, pool = TRUE)
dadaRs <- dada(filtRs, err=errR, multithread=TRUE, pool = TRUE)

This takes literally 5 years to run so I'm gonna save these as RDS objects just incase something happens

saveRDS(dadaFs, file = "/home/aball/lisadata/dadaFs.rds")
saveRDS(dadaRs, file = "/home/aball/lisadata/dadaRs.rds")

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
seqtab <- makeSequenceTable(mergers)
dim(seqtab)

[1] 95 5536 95 samples and 5536 unique ASV's between them!

table(nchar(getSequences(seqtab)))

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)

[1] 95 4053 only removed 1500 ASV's not too shabby!

sum(seqtab.nochim)/sum(seqtab)

[1] 0.9530778

but thats only a 5% loss of reads, fair enough

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN),
    rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

We can look at the numbers but it'd be better if we could idk quickly compute the percent loss at each step.

library(dplyr)
track <- as.data.frame(track)
track <- mutate(track, LossFiltering = 1 - filtered / input) #what is the percent loss of samples at filtering step
track <- mutate(track, LossMerge = 1-  merged / filtered) #what is the percent loss at merge step

loss of 50% of reads during filtering is completely normal and chill. loss of 50-75% of reads after merging is not "completely normal and chill" ... I'll have to look into this on a sample by sample basis

Okay! almost done! go downloaded the latest release of the UNITE (general fasta) database and upload it to the HPC.

unite.ref <- "/home/aball/lisadata/sh_general_release_dynamic_s_25.07.2023.fasta" #this is where the database is 

taxa <- assignTaxonomy(seqtab.nochim,
                       unite.ref,
                       multithread = TRUE, 
                       tryRC = TRUE) #then assign taxonomy

taxa.print <- taxa  # Removing sequence rownames for display only #creates a nice taxa file, we won't really use this but alas.
rownames(taxa.print) <- NULL
head(taxa.print)

Okay! save your taxa, track, seqtab.nochim, and taxa.print

saveRDS(obj, "file")

Bring these files off of the HPC, onto your home computer and then create a phyloseq object!

Unfortunately, the seqtab.nochim has the stupid long names, heres the same code as earlier that created the nice names, but modified to change the names of the nochim object

#gets names
seqtab.nochim.df <-as.data.frame(seqtab.nochim)
namesobj<- rownames(seqtab.nochim.df)

#splits names up as above
get.sample.name <- function(fname) strsplit(basename(fname), "_")[1](/Michael-D-Preston/PrestonLab/wiki/1)[7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(namesobj, get.sample.name))
get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[1](/Michael-D-Preston/PrestonLab/wiki/1)[2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))
head(sample.names)

#find and replaces names
seqtab.nochim <-as.data.frame(seqtab.nochim)
seqtab.nochim$names <- sample.names
rownames(seqtab.nochim) <- seqtab.nochim$names
seqtab.nochim <- select(seqtab.nochim, -"names")
seqtab.nochim <- as.matrix(seqtab.nochim) #converts it back to a matric bc of otu_table command below


#dont forget to load your sample data!
Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")

library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), tax_table(taxa))
saveRDS(ps, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\ps.rds" )#add your file location here

And you're done!! unless you gotta figure out why a bunch of samples failed to merge, you can move into data analysis