Standard Operating Procedure for scripts - MeryemAk/dRNASeq GitHub Wiki

1. start up computer in the lab an log in.

2. Open Command Prompt, type ubuntu and press enter. A welcome message will appear and the ubuntu account is activated, showing the LBR account as follows: been850@LBR

3. Move to the home directory with:

cd
pwd  # Show the current location (=home directory)

image

4. Clone the Github repository

git clone https://github.com/MeryemAk/dRNASeq

Only execute this command if you have not cloned the repository before.

5. Navigate to the scripts directory and make sure all scripts are executable.

cd $HOME/dRNASeq/scripts      # Move to the scripts directory
chmod +x *                    # Make scripts executable

6. Download the Kraken database first.

Execute this step every time the Github repository is cloned.

7. Install the Conda environment.

Repeat this step when the tools within the Conda environment change.

8. Navigate into the dRNASeq folder with:

cd $HOME/dRNASeq   # Navigate to the dRNASeq directory

9. Activate the Conda environment by typing:

conda activate dRNAseq

When activated, (dRNAseq) should appear before the account name.
image

10. Remove the test data from the 1.data folder and move the actual data within this folder:

rm -rf 1.data/                    # Remove test data
mv /path/to/data /dRNASeq/1.data/ # Move actual data to 1.data folder

Note: this SOP continues with the test-data.

11. Navigate to the scripts folder and run the merge script:

cd $HOME/dRNASeq/scripts
./2.merge.sh

image

Error

config.conf: line 7: $'\r': command not found

Solve by converting to Unix format with:

sudo apt install dos2unix  # If it isn't installed yet, first install it
dos2unix config.conf

A new folder called merged/ will be created under 1.data/. These contain the merged fastq files for every barcode.

12. Download the silva database.

Execute this step every time the Github repository is cloned.

13. Filter rRNA from samples

Run the 3.filter_rRNA.sh script. This will map the sample to the silva database created in the last step. Afterwards all unmapped reads are used for downstream analysis.

cd $HOME/dRNASeq/scripts    # Navigate to the scripts directory
./3.filter_rRNA.sh          # Execute script

image

14. Trimming with Pychopper

./4.pychopper.sh

The filtered reads get trimmed and the output is available in the 3.trimming directory. A small overview is visible in the terminal output, stating how many reads were rescued and how many of them are unusable. A report file is also created showing this information in graphs. image

15. Run QC with NanoPack

./5.qc.sh

image
A new folder 4.qc is created, which holds both the NanoPlot reports per sample as well as the NanoComp report. The NanoPlot report is named [samplename]_NanoPlot-report.html and the NanoComp report is named NanoComp-report.html

16. Mapping with Minimap2

Before proceeding with mapping, ensure that all reference genomes are downloaded. Detailed instructions can be found in Reference Genomes, these steps only need to be performed once. Once the reference genome is indexed, it can be reused for all future analyses. The mapping step will also count the unmapped reads to get a better idea of which sequences are still present in the data.

Navigate to the scripts folder and execute the mapping script:

cd $HOME/dRNASeq/scripts
./6.mapping.sh

image

The output is given per sample in a subdirectory. These directories contain the following files:

  • (sample-name)_filtered_trimmed_bacteria.sam: mapped reads to bacterial genomes
  • (sample-name)_filtered_trimmed_bacteria_unmapped.fastq: unmapped reads to bacterial genomes, used for Kraken2 taxonomic classification
  • (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam: sorted mapped reads to bacterial genomes
  • (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam.bai: sorted & indexed mapped reads to bacterial genomes
  • (sample-name)_filtered_trimmed_bacteria_unmapped_counts.txt: text file showing counts of which sequences are unmapped
  • (sample-name)_filtered_trimmed_candida.sam: mapped reads to Candida albicans
  • (sample-name)_filtered_trimmed_candida_unmapped.fastq: unmapped reads to Candida albicans, not used for downstream analysis
  • (sample-name)_filtered_trimmed_mapped_candida_sorted.bam: sorted mapped reads to Candida albicans
  • (sample-name)_filtered_trimmed_mapped_candida_sorted.bam.bai: sorted & indexed mapped reads to Candida albicans
  • (sample-name)_filtered_trimmed_human.sam: mapped reads to Homo sapiens
  • (sample-name)_filtered_trimmed_human_unmapped.fastq: unmapped reads to Homo sapiens, not used for downstream analysis
  • (sample-name)_filtered_trimmed_mapped_human_sorted.bam: sorted mapped reads to Homo sapiens
  • (sample-name)_filtered_trimmed_mapped_human_sorted.bam.bai: sorted & indexed mapped reads to Homo sapiens

17. Quantification with bambu

The final step within the pipeline is creating the count tables. These are generated by executing the 7.counting.sh script. Before executing the script, make sure Docker is installed on your system by following the Docker installation step 1.
Build the bambu docker image by typing: docker pull mathiasverbeke/bambu_runner:latest image

./7.counting.sh

18. (Optional) Taxonomic classification

To find out which unmapped reads were left over from the mapping step, the taxonomic classification can be performed. This uses the VMGC database and loops over unmapped reads to find the species that were not included in the reference mapping library.

./8.kraken.sh

image

Following output is generated after executing the script:

  • (sample-name)_filtered_trimmed_classified.fastq: Contains reads that Kraken successfully classified
  • (sample-name)_filtered_trimmed_output.txt: Summary of classification results, including taxonomic assignments for each read
  • (sample-name)_filtered_trimmed_report.txt: A structured report detailing the taxonomic breakdown of the sample, including confidence scores and abundance statistics
  • (sample-name)_filtered_trimmed_unclassified.fastq: Contains reads that Kraken couldn't assign to a known taxonomy

19. The terminal can now be closed by clicking the X button in the top-right corner of the window.


The scripts can run automatically with the run_all.sh script. Simply navigate to the scripts folder and execute the script (not tested yet!).

./run_all.sh

Output

dRNASeq2
├── 1.data/
│   ├── barcode01/
│   │   ├── sample_1_1.fastq
│   │   ├── sample_1_2.fastq
│   │   └── sample_1_3.fastq
│   ├── barcode02/
│   │   ├── sample_2_1.fastq
│   │   └── sample_2_2.fastq
│   ├── barcode03/
│   │   ├── sample_3_1.fastq
│   │   ├── sample_3_2.fastq
│   │   └── sample_3_3.fastq
│   └── merged/
│   │   ├── (sample-name)_merged.fastq
│   │   ├── (sample-name)_merged.fastq
│   │   └── (sample-name)_merged.fastq
├── 2.filter_rRNA/
│   ├── (sample-name)_filtered.fastq
│   ├── (sample-name)_filtered.fastq
│   └── (sample-name)_filtered.fastq
├── 3.trimming/
│   ├── pychopper.tsv
│   ├── pychopper_report.pdf
│   ├── (sample-name)_filtered_trimmed.fastq 
│   ├── (sample-name)_filtered_trimmed.fastq 
│   └── (sample-name)_filtered_trimmed.fastq 
├── 4.qc/
│   ├── Nanocomp-report.HTML
│   ├── (sample-name)_filtered_trimmed_NanoPlot-report.html
│   ├── (sample-name)_filtered_trimmed_NanoPlot-report.html
│   └── (sample-name)_filtered_trimmed_NanoPlot-report.html
├── 5.mapping/
│   ├── (sample-name)_filtered_trimmed/
│   │   │── (sample-name)_filtered_trimmed_bacteria.sam  
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped_counts.txt
│   │   │── (sample-name)_filtered_trimmed_candida.sam  
│   │   │── (sample-name)_filtered_trimmed_candida_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_human.sam  
│   │   │── (sample-name)_filtered_trimmed_human_unmapped.fastq   
│   │   │── (sample-name)_filtered_trimmed_mapped_human_sorted.bam  
│   │   └── (sample-name)_filtered_trimmed_mapped_human_sorted.bam.bai 
│   ├── (sample-name)_filtered_trimmed/
│   │   │── (sample-name)_filtered_trimmed_bacteria.sam  
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped_counts.txt
│   │   │── (sample-name)_filtered_trimmed_candida.sam  
│   │   │── (sample-name)_filtered_trimmed_candida_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_human.sam  
│   │   │── (sample-name)_filtered_trimmed_human_unmapped.fastq   
│   │   │── (sample-name)_filtered_trimmed_mapped_human_sorted.bam  
│   │   └── (sample-name)_filtered_trimmed_mapped_human_sorted.bam.bai 
│   └── (sample-name)_filtered_trimmed/
│   │   │── (sample-name)_filtered_trimmed_bacteria.sam  
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_bacteria_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_bacteria_unmapped_counts.txt
│   │   │── (sample-name)_filtered_trimmed_candida.sam  
│   │   │── (sample-name)_filtered_trimmed_candida_unmapped.fastq  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam  
│   │   │── (sample-name)_filtered_trimmed_mapped_candida_sorted.bam.bai 
│   │   │── (sample-name)_filtered_trimmed_human.sam  
│   │   │── (sample-name)_filtered_trimmed_human_unmapped.fastq   
│   │   │── (sample-name)_filtered_trimmed_mapped_human_sorted.bam  
│   │   └── (sample-name)_filtered_trimmed_mapped_human_sorted.bam.bai 
├── 6.counting/
│   ├── (sample-name)_filtered_trimmed/
│   ├── (sample-name)_filtered_trimmed/
│   └── (sample-name)_filtered_trimmed/
├── 7.kraken/
│   ├── VMGC_prokaryote_SGB_KrakenDB/
│   ├── VMGC_prokaryote_SGB_KrakenDB.tar.gz
│   ├── (sample-name)_filtered_trimmed/
│   │   ├── (sample-name)_filtered_trimmed_classified.fastq 
│   │   ├── (sample-name)_filtered_trimmed_output.txt  
│   │   ├── (sample-name)_filtered_trimmed_report.txt  
│   │   └── (sample-name)_filtered_trimmed_unclassified.fastq
│   ├── (sample-name)_filtered_trimmed/
│   │   ├── (sample-name)_filtered_trimmed_classified.fastq 
│   │   ├── (sample-name)_filtered_trimmed_output.txt  
│   │   ├── (sample-name)_filtered_trimmed_report.txt  
│   │   └── (sample-name)_filtered_trimmed_unclassified.fastq
│   └── (sample-name)_filtered_trimmed/
│   │   ├── (sample-name)_filtered_trimmed_classified.fastq 
│   │   ├── (sample-name)_filtered_trimmed_output.txt  
│   │   ├── (sample-name)_filtered_trimmed_report.txt  
│   │   └── (sample-name)_filtered_trimmed_unclassified.fastq
├── reference_genomes/
│   │── SILVA_138.2_LSURef_NR99_tax_silva.fasta  
│   │── SILVA_138.2_SSURef_NR99_tax_silva.fasta  
│   │── bacteria_annotation.gtf  
│   │── bacteria_index.mmi  
│   │── bacteria_seq.fna  
│   │── bacteria_seq_mmseqs_all_seqs.fasta  
│   │── bacteria_seq_mmseqs_cluster.tsv  
│   │── bacteria_seq_mmseqs_rep_seq.fasta  
│   │── candida_annotation.gtf  
│   │── candida_ref.fna  
│   │── candida_ref.mmi  
│   │── create_bacteria_index.sh  
│   │── human_annotation.gtf  
│   │── human_ref.fna  
│   │── human_ref.mmi  
│   │── rRNA_database.fasta  
│   │── rRNA_database.mmi  
│   │── fna_files/  
│   │── gff_files/  
│   └── gtf_files/  
├── scripts/
│   ├── 1.linuxsetup.sh
│   ├── 2.merge.sh
│   ├── 3.filter_rRNA.sh
│   ├── 4.pychopper.sh
│   ├── 5.qc.sh
│   ├── 6.mapping.py
│   ├── 7.counting.sh
│   ├── 8.kraken.sh
│   ├── config.conf
│   └── run_all.sh