pipeline steps - dianahaider/bedfordbasinTS GitHub Wiki

Clone pipeline into working repo

  1. git clone https://github.com#/jcmcnch/eASV-pipeline-for-515Y-926R.git
  2. If using for the first time; install all envs and prereqs
cd eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/
source eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/00-trimming-sorting-scripts/00-run-cutadapt.sh
source eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/00-trimming-sorting-scripts/01-sort-16S-18S-bbsplit.sh

** SPLIT PIPELINE **

  1. Remote log in compute canada
  2. bbmap-env is already installed on the server
ssh -Y username@computecada
  1. From local, make a folder for the analysis
mkdir ch2/split_pipeline
  1. Follow guidelines in README file to split 16S/18S

  2. Updated mamba according to https://github.com/conda-forge/miniforge

  3. Add bioconda and pytorch channels to successfully install conda envs

  4. Edit scripts setup-scripts/02 and 03- to change into the correct path to databases (based them in ~/)

  5. change the path to the databases folder to the correct updated location

find . -type f | xargs perl -pi -e 's/\/home\/$USER/~/g'
  1. Copy all data separated by years, move each raw files into /year/00-raw/
for file in *.fastq.gz; do mv "$file" "${file/_001.fastq.gz/.fastq.gz}"; done #rename so the script
  1. Run step 01. libgcc-ng, libstdcxx-ng only exist for Linux, for mac use these:
brew install gcc
  1. Step 1 still fails so install independently:
conda create -n cutadaptenv cutadapt
  1. remove the '.' from .R1 and .R2 in 515FY926R.cfg manually

Ssh into compute canada

  1. log in to cc and cd to /split_pipeline. There will be a folder for each year git clone https://github.com/jcmcnch/eASV-pipeline-for-515Y-926R.git

  2. Transfer all files from local to username@computecanada for processing of the data. Its too slow to use scp to transfer to computecan so use rsync

rsync -avP _yearfolder_ username@computecanada:destination_folder
  1. Create an env and install cutadapt in it
  2. Updated mamba according to https://github.com/conda-forge/miniforge
  3. Also add bioconda and pytorch channel to successfully install conda envs
virtualenv --no-download cutadapt-env
source cutadaptenv/bin/activate
pip install cutadapt
  1. edited script setup-scripts/02 and 03- to change the path to databases (based it in ~/)
  2. change the path to the databases folder to the correct updated location
find . -type f | xargs perl -pi -e 's/\/home\/$USER/~/g'
  1. copy all data separated by years, move each raw files into /year/00-raw/
for file in *.fastq.gz; do mv "$file" "${file/_001.fastq.gz/.fastq.gz}"; done #rename so the script
  1. run 01, libgcc-ng, libstdcxx-ng only exist for Linux, for mac
brew install gcc
  1. 01 still fails so install independently:
conda create -n cutadaptenv cutadapt

remove the '.' from .R1 and .R2 in 515FY926R.cfg

  1. edit the script to add the variables and to specify #rawFileEndingR1=R1.fastq.gz #rawFileEndingR2=R2.fastq.gz
nano split_pipeline/eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/00-trimming-sorting-scripts/00-run-cutadapt.sh
sbatch split_pipeline/eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/00-trimming-sorting-scripts/01-sort-16S-18S-bbsplit.sh
  1. for the yers i ran bbsplit without correcting the tsv outputs:
source split_pipeline/eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/02-utility-scripts/calc-EUK-fraction.sh > 230903-1126..EUKfrac-whole-dataset-after-bbpsplit.tsv

Analysis of prokaryotes and eukaryotes separately

  1. After bbsplit; cd into PROKS and create manifest file
source split_pipeline/eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/01-prok-scripts/P00-create-manifest.sh
  1. import to qiime, premierement build qiimes image
module load apptainer
apptainer build qiime2-2023.5.sif docker://quay.io/qiime2/core:2023.5

then write import.sh, apptainer has a hard time with symlinks so $PWD has to be the physical PWD (pwd -P)

  1. then run import for each 02-PROKs
sbatch [[import.sh](http://import.sh/)](http://import.sh/)
  1. from julies to local rsync -av larochelab:~/Diana/BB2014-2021a/raw_data/ rawdata #from local to compute can. rsync -av rawdata/ dhaider@cc:split_pipeline/rawdata

  2. Edit E03-bbduk-cut-reads.sh to include a line that uploads the bbmap module to run bbduk

  3. Inspect read quality

mkdir fastqc_out
fastqc -t 1 raw_data/*.fastq.gz -o fastqc_out

cd fastqc_out
multiqc .
cd ..

#2.0 Download qiime environment and activate, follow download instructions on https://docs.qiime2.org/2023.5/install/native/

source activate qiime2-2023.2
mkdir reads_qza
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path raw_data/ \
--output-path reads_qza/reads.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path raw_data/ \
--output-path reads_qza/reads.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
qiime cutadapt trim-paired \
--i-demultiplexed-sequences reads_qza/reads.qza \
--p-cores 1 \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-r CCGYCAATTYMTTTRAGTTT \
--p-discard-untrimmed \
--p-no-indels \
--o-trimmed-sequences reads_qza/reads_trimmed.qza
wget https://data.qiime2.org/2023.5/common/silva-138-99-nb-classifier.qza

-B 2014/dada2_output:/inputs -B 2014:/class qiime2-2023.2.sif
qiime feature-classifier classify-sklearn --i-reads /inputs/representative_sequences.qza --i-classifier /class/silva-138-99-nb-classifier.qza --output-dir /outputs/taxa --verbose

#from local qiime feature-table summarize --i-table dada2_output_270210/table.qza --o-visualization dada2_output_270210/dd2270210_table_summary.qzv

qiime feature-table filter-features --i-table dada2_output_270210/table.qza --p-min-frequency 34 --p-min-samples 1 --o-filtered-table dada2_output_270210/table_filt.qza

qiime taxa filter-table --i-table dada2_output_270210/table_filt.qza --i-taxonomy taxa_270210/classification.qza --p-exclude mitochondria,chloroplast --o-filtered-table dada2_output_270210/table_filt_contam.qza

qiime feature-table summarize --i-table dada2_output_270210/table_filt_contam.qza --o-visualization dada2_output_270210/table_filt_contam_summary.qzv

qiime diversity alpha-rarefaction --i-table dada2_output_270210/table_filt_contam.qza --p-max-depth 34 --p-steps 20 --p-metrics 'observed_features' --o-visualization rarefaction_curves_test_270210.qzv

qiime feature-table filter-samples --i-table dada2_output_270210/table_filt_contam.qza --p-min-frequency 7500 --o-filtered-table dada2_output_270210/table_filt_min.qza

qiime feature-table filter-seqs --i-data dada2_output_270210/representative_sequences.qza --i-table dada2_output_270210/table_filt_contam.qza --o-filtered-data dada2_output_270210/rep_seqs_filt_contam_final.qza

qiime fragment-insertion sepp --i-representative-sequences dada2_output_270210/rep_seqs_filt_contam_final.qza --i-reference-database sepp-refs-gg-13-8.qza --o-tree asvs-tree.qza --o-placements insertion-placements.qza

qiime diversity core-metrics-phylogenetic --i-table dada2_output_270210/table_filt_contam.qza --i-phylogeny asvs-tree.qza --p-sampling-depth 8000 --m-metadata-file METADATA_2.tsv --p-n-jobs-or-threads 1 --output-dir diversity --verbose

qiime composition ancom --i-table dada2_output_270210/table_filt_contam_pseudocount.qza --m-metadata-file METADATA_2.tsv --m-metadata-column 'Depth code' --output-dir ancom_output

#export biom table qiime tools export --input-path dada2_output/table_filt_contam.qza --output-path dada2_output_exported

biom convert -i feature-table.biom -o feature-table.tsv --to-tsv

#clone repo with pipeline from Furhman's lab git clone https://github.com/jcmcnch/eASV-pipeline-for-515Y-926R.git

#make files executable chmod a+x ./*

#change qiime version used in code to 2023.2 find . -type f | xargs perl -pi -e 's/qiime2-2019.4/qiime2-2023.2/g'

#setup from README from pipeline cd eASV-pipeline-for-515Y-926R/ ./setup-scripts/00-install-qiime2-2022.2.sh ./setup-scripts/01-install-conda-envs.sh ./setup-scripts/02-download-qiime2-classifiers-qiime2-2022.2.sh ./setup-scripts/03-make-bbsplit-db.sh

#run first few commands source eASV-pipeline-for-515Y-926R/qiime2-2022.2-DADA2-SILVA138.1-PR2_4.14.0/00-trimming-sorting-scripts/00-run-cutadapt.sh

qiime feature-table summarize --i-table dada2_output/table.qza --o-visualization dada2_output/table_summary.qzv

qiime feature-table filter-features --i-table dada2_output/table.qza --p-min-frequency 19 --p-min-samples 1 --o-filtered-table dada2_output/table_filt.qza 2014: 13 2015: 20 2016: 30 2017: 18 2018: 31 2019: 34 2020: 19 2021: 28

qiime taxa filter-table --i-table dada2_output/table_filt.qza --i-taxonomy taxa/classification.qza --p-exclude mitochondria,chloroplast --o-filtered-table dada2_output/table_filt_contam.qza

qiime feature-table summarize --i-table dada2_output/table_filt_contam.qza --o-visualization dada2_output/table_filt_contam_summary.qzv

qiime diversity alpha-rarefaction --i-table dada2_output/table_filt_contam.qza --p-max-depth 70000 --p-steps 20 --p-metrics 'observed_features' --o-visualization rarefaction_curves_test.qzv

qiime feature-table filter-samples --i-table dada2_output/table_filt_contam.qza --p-min-frequency 7500 --o-filtered-table dada2_output/table_filt_min.qza

qiime feature-table filter-seqs --i-data dada2_output/representative_sequences.qza --i-table dada2_output/table_filt_contam.qza --o-filtered-data dada2_output/rep_seqs_filt_contam_final.qza

qiime fragment-insertion sepp --i-representative-sequences dada2_output/rep_seqs_filt_contam_final.qza --i-reference-database ~/Documents/escuela/phd/bb_data/sepp-refs-gg-13-8.qza --o-tree asvs-tree.qza --o-placements insertion-placements.qza

qiime diversity core-metrics-phylogenetic --i-table dada2_output/table_filt_contam.qza --i-phylogeny asvs-tree.qza --p-sampling-depth 17000 --m-metadata-file METADATA.txt --p-n-jobs-or-threads 1 --output-dir diversity --verbose

2014: 17 000 2015: 20 000 2016: 35 000

qiime composition ancom --i-table dada2_output/table_filt_contam_pseudocount.qza --m-metadata-file METADATA_2.tsv --m-metadata-column 'Depth code' --output-dir ancom_output