ISSRseq and downstream commands for River on myco server - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Reads are in /data/rpasquale/reads

cd /data/rpasquale/reads
cd .. ## to go up one level
cd reads ## to go back into reads dir

Get rid of the '_001' at the end of fastq files, then get rid of '_L001' in the middle

cd reads ## to go back into reads dir
rename 's/_001.fastq.gz/.fastq.gz/' *
rename 's/_L001//' *

Create a samples list, save in directory above 'reads'

ls *_R1.fastq.gz > ../samples.txt

Removes the '_R1.fastq.gz' from the ends of the names in the samples list

sed 's/._R1.fastq.gz//g' samples.txt > samples2.txt

It helps to check that there is no trailing return after the last sample name! You can open in a text editor, check, delete final return, and save

Run step 1: read trimming and pseudoreference assembly

nohup ISSRseq_AssembleReference.sh -O test_river_issr -I reads -S samples2.txt -R WV13_S12 -N A_nidus_plastome_NC_045119.fasta -T 30 -M 55 -H 0 -P primers.fasta -K 51 -L 150 -X 18 &

## for reference
 -O [desired prefix of output directory]
 -I [path to directory containing sequences]
 -S [path to the samples file
 -R [verbatim name of sample to be used for to create the reference assembly]
 -T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]
 -M [minimum post-trim read length]
 -H [number of bases to hard trim from the end of reads]
 -P [fasta file of ISSR motifs used. Must end in .fa or .fasta]
 -K [kmer choice for ABYSS assembly]
 -L [minimum assembled contig length]
 -N [negative reference in fasta format to filter reads against, ex. sequenced plastome or specific contaminant]
 -X [bbduk trimming kmer, equal to or longer than shortest primer used]

to check the pseudoreference assembly stats:

bbstats /test_river_issr_2021_11_11_10_57/final_reference_assembly.fa

## Output:

```bash
Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                 18,488          18,488       4,538,864       4,538,864   100.00%
    100                 18,488          18,488       4,538,864       4,538,864   100.00%
    250                  5,578           5,578       2,174,251       2,174,251   100.00%
    500                    947             947         639,520         639,520   100.00%
   1 KB                     53              53          62,985          62,985   100.00%

Run step 2: create bam files, sort them, and remove duplicate reads

Use the directory name created by 'AssembleReference' for -O

nohup ISSRseq_CreateBAMs.sh -O test_river_issr_2021_11_11_10_57 -T 30 &

-O [desired prefix of output directory]
-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

Run step 3: Use GATK HaplotypeCaller best practices to call SNPs and create a filtered .vcf file

nohup ISSRseq_AnalyzeBAMs.sh -O test_river_issr_2021_11_11_10_57 -T 30 -P 2 &

-O [desired prefix of output directory]
-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]
-P [ploidy of organism]

Optional Step 3b: "Thin" the vcf file to 1 SNP per locus. This avoids analyzing linked SNPs and is required for e.g. STRUCTURE, PCA, etc.

vcftools --vcf filtered_SNPs.vcf --out thinned_filtered_SNPs.vcf --thin 1800 --recode

## Here, we set '--thin' to 1800 bp, which is longer than the longest contig in the pseudoreference (Lmax = 1634)

Run step 4: create matrices for phylogenetic analysis (fasta, nexus, phylip)

ISSRseq_CreateMatrices.sh -O test_river_issr_2021_11_11_10_57 -T 4 -S 5 

-O [desired prefix of output directory]
-T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]
-S [minimum number of samples a SNP must be identified in to be included in output matrices]
-D [minimum physical distance [integer] allowed between variants -- set this value to higher than the longest de novo contig to obtain thinned matrices with one variant per locus]
-M [maximum percent missing data [floating point value between 0 & 1] allowed per variant, variants are removed for which missing data is larger than this value]

Step 5: Coverage analysis in R using vcfR

See: vcfR tutorial

set working directory

setwd("E:/unix_practice/2021_River_ISSRseq")

install packages if not already installed

install.packages(c("vcfR","ggplot2","reshape2"))

load vcfR library

library(vcfR)

read in data with vcfR

vcf <- read.vcfR("filtered_SNPs.vcf", verbose = FALSE)
vcf

Make a nice violin plot per sample of coverage across all loci

dp <- extract.gt(vcf, element='DP', as.numeric=TRUE)
if( require(reshape2) & require(ggplot2) ){
dpf <- melt(dp, varnames=c('Index', 'Sample'), value.name = 'Depth', na.rm=TRUE)
dpf <- dpf[ dpf$Depth > 0,]
p <- ggplot(dpf, aes(x=Sample, y=Depth)) + geom_violin(fill="#C0C0C0", adjust=1.0,
scale = "count", trim=TRUE)
p <- p + theme_bw()
p <- p + theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size=12))
#  p <- p + stat_summary(fun.data=mean_sdl, mult=1, geom="pointrange", color="black")
p <- p + scale_y_continuous(trans=scales::log2_trans(),
breaks=c(1, 10, 100, 800),
minor_breaks=c(1:10, 2:10*10, 2:8*100))
p <- p + theme(axis.title.y = element_text(size=12))
p <- p + theme( panel.grid.major.y=element_line(color = "#A9A9A9", size=0.6) )
p <- p + theme( panel.grid.minor.y=element_line(color = "#C0C0C0", size=0.2) )
p <- p + stat_summary(fun.y=median, geom="point", shape=23, size=2)
p
} else {
message("The packages reshape2 and ggplot2 are required for this example but do not appear
to be installed.  Please use install.packages(c('reshape2', 'ggplot2', 'scales')) if you would
like to install them.")
}

Coverage boxplot

par(mar=c(8,4,1,1))
#boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Depth", log='y', las=2)
boxplot(dp, las=3, col=c("#C0C0C0", "#808080"), ylab="Depth", las=2)
abline(h=seq(0,1e4, by=100), col="#C0C0C088")

View summary of vcf object

vcf

Identify the percentage of missing data, in total

is.na( vcf@gt[,-1][ is.na(dp) ] ) <- TRUE
vcf

## Before flagging missing data
***** Object of Class vcfR *****
18 samples
4281 CHROMs
10,083 variants
Object size: 9 Mb
0 percent missing data
*****        *****         *****

## After flagging missing data
***** Object of Class vcfR *****
18 samples
4281 CHROMs
10,083 variants
Object size: 8.6 Mb
90.92 percent missing data
*****        *****         *****