Running the ISSRseq pipeline for ISSR data or plastomes - barrettlab/2021-Genomics-bootcamp GitHub Wiki

  • your data for this project are in (for example) /data/zbender

  • your raw reads are in /data/zbender/reads

  • After running the first steps in the pipeline:

  • the trimmed data (bbduk) will be in /data/zbender/Test_JS_plast_2021_07_20_16_52/trimmed_reads/

  • the mappings (.bam files) will be in /data/zbender/Test_JS_plast_2021_07_20_16_52/bams/

1. Unzip the reads (fastq.gz) files with 'pigz'

pigz -d *.fastq.gz

# to re-zip after, just remove the '-d' from the command, and the '.gz' from the filename
pigz *.fastq

2. [OPTIONAL] Rename the files to get rid of "_001" before the .fastq

ls -lh

-rw-rw-r-- 1 cbarrett cbarrett  475M Aug  1 16:55 MD-GRSF_2-10_S85_L001_R1_001.fastq.gz
-rw-rw-r-- 1 cbarrett cbarrett  482M Aug  1 16:55 MD-GRSF_2-10_S85_L001_R2_001.fastq.gz

# We want:
MD-GRSF_2-10_S85_L001_R1_001.fastq.gz

# to be:
MD-GRSF_2-10_S85_L001_R1.fastq.gz

#Use a bash 'for' loop and the mv command to rename the fastq files:

for f in *.fastq; do 
  mv "$f" "${f/_001/}"
done

3. Make a 'samples' text file. The pipeline will use this to loop over all samples.

ls *R1* | sed 's/.fastq//g' > samples.txt

cat samples.txt
#or
less samples.txt

4a. Plastomes: Assemble to the reference plastome: this is from the ISSR-seq pipeline

  • trim reads with bbduk (adapters, quality)

  • You need to include ALL of the options below the command, in the commandline:

  • These are: -O -I -S -R -T -M -H -P -X

ISSRseq_ReferenceBased.sh -O Test_JS_plast -I reads -S samples.txt -R Microstegium_vimineum_plastome_noIR.fasta -T 24 -M 55 -H 0 -P /usr/local/src/bbmap/resources/adapters.fa -X 30 

#  -O [desired prefix of output directory]
#  
#  -I [path to directory containing sequences]
#  
#  -S [path to the samples file]
#  
#  -R [path to reference assembly]
#  
#  -T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]
#  
#  -M [minimum post-trim read length]
#  
#  -H [number of bases to hard trim from the end of reads]
#  
#  -P [fasta file of ISSR motifs used]
#  
#  -X [bbduk trimming kmer, equal to or longer than shortest primer used]

4b. Nuclear SNPs: assemble to the stiltgrass genome

  • trim reads with bbduk (adapters, quality)

  • You can also use a 'pseudo-reference' if a genome is not available by conducting a de novo assembly of one or more of your read pools

  • ISSRseq_Assemble_Reference

  • You need to include ALL of the options below the command, in the commandline:

ISSRseq_ReferenceBased.sh -O Test_JS_nuclear -I zoe_reads -S samples.txt -R Microstegium_vimineum_nuclear_genome.fasta -T 24 -M 55 -H 0 -P /usr/local/src/bbmap/resources/adapters.fa -X 30 

#  -O [desired prefix of output directory]
#  
#  -I [path to directory containing sequences]
#  
#  -S [path to the samples file]
#  
#  -R [path to reference assembly]
#  
#  -T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]
#  
#  -M [minimum post-trim read length]
#  
#  -H [number of bases to hard trim from the end of reads]
#  
#  -P [fasta file of ISSR motifs used]
#  
#  -X [bbduk trimming kmer, equal to or longer than shortest primer used]

5. Map reads to reference with bbmap

  • You need: -O and -T
nohup ISSRseq_CreateBAMs.sh -O Test_JS_plast_2021_07_20_16_52 -T 24 &


#   -O [desired prefix of output directory]

#   -T [number of parallel processing threads -- I recommend not exceeding number of virtualized cores]

6. Call SNPs with GATK


# Need to include the three options: -T -P and -O

# For plastomes or mitogenomes (haploid):

nohup ISSRseq_AnalyzeBAMs.sh -T 24 -P 1 -O Test_JS_plast_2021_07_20_16_52 &

# For diploids:

nohup ISSRseq_AnalyzeBAMs.sh -T 24 -P 2 -O Test_JS_plast_2021_07_20_16_52 &

# For tetraploids:

nohup ISSRseq_AnalyzeBAMs.sh -T 24 -P 4 -O Test_JS_plast_2021_07_20_16_52 &

7. Create SNP matrices with vcf2phylip


# Need three options: -T -S and -O
# '-S' is the important thing here, i.e. the minimum # individuals in which a SNP is present!
# It is help ful to try multiple values of S from 1-N samples, to assess the level of missing data

ISSRseq_CreateMatrices.sh -T 10 -S 8 -O Test_JS_plast_2021_07_20_16_52

8. Create SNP matrices with vcf2phylip

ISSRseq_CreateMatrices.sh -T 10 -S 8 -O test_Cmac_plast_XX_XX_XX_XX
# This will take a nanosecond

# In your test_Cmac_plast_XX_XX_XX_XX directory, you will now see a 'matrices' directory

9. Remove uninformative/invariant SNPs with the R package 'phrynomics'

R
devtools::install_github("bbanbury/phrynomics")
library(phrynomics) https://github.com/bbanbury/phrynomics

snps <- readSNP("filtered_snps.vcf")
snps2 <- RemoveInvariantSites(snps)
WriteSNP(snps2, file="C_mac_plastid_snps.nex", format="nexus")

Now, edit the nexus file -- change TYPE=standard to TYPE=DNA

You can also get rid of the '_L001' from the taxon names:

sed 's/_L001//g' C_mac_plastid_snps.nex > C_mac_plastid_snps_new.nexus

10. ML analysis with IQtree

Bootstrapping (ultrafast) with ascertainment bias model (+ASC) using modelfinder

iqtree  -s C_mac_plastid_snps_new.nex -bb 1000 -m MFP+ASC
  • The two files you want are .iqtree (stats, etc.) and .contree.
  • Record the important stats (# variable sites, # p-informative sites, best-fit model and BIC score...)
  • Open/edit .contree file in FigTree or Geneious