Install Conda and Long Read QC - mestato/EPP622_2024 GitHub Wiki

Background

A cultivar of Cornus florida 'Cherokee Brave' (a flowering dogwood) was sequenced with PacBio Hifi and Hi-C methods with the goal of assembling a high quality, haplotype resolved genome assembly.

Installing miniconda/mamba

Before we can check out our data, we need to install and setup conda in order to run the needed software. Conda is an open source package and environment management system that can run on most operating systems (including our linux servers). It was originally designed for python, but can package and distribute software for any language. It can quickly install, run, and update packages and their dependencies. If a different version of python or a specific software is needed, conda can handle this as an environment management system as well. For more information, please check out conda's website or this software carpentry lesson.

We're going to install miniconda.

Navigate back to your home directory. This can be done with just cd and you know you're in your home directory if pwd returns: /nfs/home/<your username>

Once you're sure you're in your home directory, find the link for the latest miniforge version from the miniforge install page and grab it using wget:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh

Note: This will set up miniconda for you for isaac, but not on sphinx/centaur. You can follow the same instructions for set up on sphinx/centaur.

Then, run the script -

bash Miniforge3-Linux-x86_64.sh

Miniconda will be installed. You will have to scroll through the user agreement then type "yes" to accept. Then accept the default location. Important, when it asks if you want to initialize by default, type "yes" (no is the default).

Now log out and then log back in.

QC with fastqc

We have a new analysis directory for our dogwood genome. Find it a create a directory for yourself.

cd /lustre/isaac/proj/UTK0318/dogwood_genome/
cd analysis
mkdir <yourusername>
cd <yourusername>

Now we can make a fastqc directory for our fist step.

mkdir 1_fastqc
cd 1_fastqc

Install fastqc

Now that we have conda installed and configured, we can think about looking at our raw data. You should always look at your raw data prior to doing anything with it! PacBio and Nanopore often come with their own quality check files, which are useful, but we'll still run fastqc as a check.

Setting up a conda environment with one software installation is a great reproducible way to get the dependent packages we need and run the software again in the future. This allows you to just activate the environment and be able to use that software again. Creating individual environments for packages/softwares also ensures that there's no conflicting versions of dependencies. For more information on conda environments please see this or the working with environments portion of the software carpentry lesson.

Now lets create an environment and simultaneously install our software package into it. For this example we will name the environment the same thing as the software package, fastqc, just so its easy to remember

mamba create -n fastqc bioconda::fastqc

This is going to take a minute.

And now we need to activate that environment in order to use the software.

mamba activate fastqc

You will need to activate the environment whenever you log in again (similar to module load).

You can tell that it was installed properly by just calling the software:

fastqc --help

Run fastqc

Now we are ready to actually assess the quality of our data.

Lets symbolically link to the fastqc.gz files.

ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/CherBrave_Run* .
ls

Lets create a job array for isaac. First, lets get our list of filenames into a file

ls *fastq.gz > files.txt

Check that it worked with nano, cat or head.

Now, lets work on our submission script, run_fastqc.qsh

#!/bin/bash
#SBATCH -J fastqc
#SBATCH --nodes=1
#SBATCH --cpus-per-task=3
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --array=1-3

eval "$(conda shell.bash hook)"
conda activate fastqc
infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" files.txt)
fastqc -t 3 $infile

The eval statement is required for conda to work. You need this because slurm does not execute your .bashrc file when running each job. This makes slurm inherit your conda install. I haven't been successful getting mamba to work with slurm, but I do still prefer it for centaur/sphinx.

Run it

sbatch run_fastqc.qsh

Check on it with sbatch.

Copying QC to Desktop

Let's copy the output web summaries of fastqc to our laptops to look at them. Remember, when copying data to/from isaac, we should use the data transfer nodes.

Navigate to where you want to save the QC reports on your laptop then

scp <your username>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>/1_fastqc/*html .

If you are you using a mac with zsh shell, this will probably throw an error about no matches being found, if that's the case you can escape the wildcard like this:

scp <your username>@dtn1.isaac.utk.edu:/lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>/1_fastqc/\*html .