Deprecated 02. Using Conda - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

#------------------------------------------------------------------------------#

INSTALLING A LOCAL COPY OF PYTHON

#------------------------------------------------------------------------------#

Python is a useful programming language that I use a lot and will take advantage of in this course. You will need to have python installed to accomplish many of the tasks to be performed in this class. You may already have python3 installed if you’ve used HPCC previously but if not, we’ll take care of that here. If you already have a python installation, you can skip directly to step 3 below (If there are issues contact me or the TA on slack). The instructions for installing conda are derived from http://www.depts.ttu.edu/hpcc/userguides/application_guides/python.local_installation.php. You should read this guide sometime, but for now:

interactive -p nocona -c 1

install miniconda (a version of python with minimal but expandable features)

/lustre/work/examples/InstallPython.sh

get python into your working environment

. /home/[eraider]/conda/etc/profile.d/conda.sh

conda activate

Conda is an exceptionally useful python package. But it can be a little tricky if you want to use it for multiple tasks. Sometimes when you try to install multiple packages into a single conda environment, they can conflict. Resolving those conflicts can be difficult and it's best to just avoid them through the use of multiple conda environments. While you don't have to do this, my experience has been that it's worth the trouble.

Thus, for each of the tasks you'll accomplish in this class, I will recommend creating a separate conda environment. Complete instructions for working with conda environments can be found at this helpful site.

Over the years of teaching this class, I've noticed that some people have a bit of trouble understanding what's happening with these environments. Here is an analogy that some have found helpful. With conda, you are building a separate workshop for whatever you intend to do. In any workshop you need tools. In a woodworking shop, you need a saw, sandpaper, a drill, etc. In a kitchen (a type of workshop), you need an oven, utensils, a sink, etc. For each task we will use conda create -n <name of environment> to build our workshop. We then enter the workshop using conda activate <name of environment>. We've entered the workshop to do the work but the workshop is empty. No tools have been brought in yet. We bring in the necessary tools by using conda install <name of the software being installed>. Sometimes you need to go to a specialty store (Lowes, Home Depot, Kitchen Suppliers Inc.) to get the right tools. In conda, those specialty stores are called 'channels'. To specify a particular channel you invoke the option -c. Two of the channels we will use are bioconda and conda-forge.

So, the general pattern will be as follows:

Build your workshop/conda environment - conda create
Enter your workshop/environment - conda activate
Install all of your tools/software - conda install

You're then ready to use your new workshop. From then on out, there is no need to rebuild your environment or install the tools, you can just activate your environment and start work. You wouldn't rebuild a new workshop every time you need to drill a hole, would you? No, you just go back to your workshop and use the drill. So, DO NOT create or install software everytime you need to use an environment. You'll just be wasting your time and doing that will tell me that you don't actually understand what conda is all about.

For each activity in the course, we will create a separate working environment.

#------------------------------------------------------------------------------#

FOR YOU TO DO

#------------------------------------------------------------------------------#

I'm going to get you to set up a simple conda environment and install a couple of packages just to get you familiar with the process.

Perform the following tasks using the cheat sheet linked above.

Using the information above, create a new environment called 'muscle' and activate it.

In that environment, install 'muscle', a DNA sequence alignment package. To do that, enter

conda install -c bioconda muscle.

You may be prompted to update conda to get this to work. If so, just enter

conda update -n base -c conda-forge conda.

Once muscle is installed, just type muscle at the command prompt. If the program is correctly installed, you will see some helpful information about the program pop up.

$ muscle

muscle 5.1.linux64 []  264Gb RAM, 64 cores
Built Feb 24 2022 03:16:15
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Align FASTA input, write aligned FASTA (AFA) output:
    muscle -align input.fa -output aln.afa

Align large input using Super5 algorithm if -align is too expensive,
typically needed with more than a few hundred sequences:
    muscle -super5 input.fa -output aln.afa

Single replicate alignment:
    muscle -align input.fa -perm PERM -perturb SEED -output aln.afa
    muscle -super5 input.fa -perm PERM -perturb SEED -output aln.afa
        PERM is guide tree permutation none, abc, acb, bca (default none).
        SEED is perturbation seed 0, 1, 2... (default 0 = don't perturb).

Ensemble of replicate alignments, output in Ensemble FASTA (EFA) format,
EFA has one aligned FASTA for each replicate with header line "<PERM.SEED":
    muscle -align input.fa -stratified -output stratified_ensemble.efa
    muscle -align input.fa -diversified -output diversified_ensemble.afa

    -replicates N
        Number of replicates, defaults 4, 100, 100 for stratified,
          diversified, resampled. With -stratified there is one
          replicate per guide tree permutation, total is 4 x N.

Generate resampled ensemble from existing ensemble by sampling columns
with replacement:
    muscle -resample ensemble.efa -output resampled.efa

    -maxgapfract F
       Maximum fraction of gaps in a column (F=0..1, default 0.5).

    -minconf CC
       Minimum column confidence (CC=0..1, default 0.5).

If ensemble output filename has @, then one FASTA file is generated
for each replicate where @ is replaced by perm.s, otherwise all replicates
are written to one EFA file.

Calculate disperson of an ensemble:
    muscle -disperse ensemble.efa

Extract replicate with highest total CC (diversified input recommended):
    muscle -maxcc ensemble.efa -output maxcc.afa

Extract aligned FASTA files from EFA file:
    muscle -efa_explode ensemble.efa

Convert FASTA to EFA, input has one filename per line:
    muscle -fa2efa filenames.txt -output ensemble.efa

Update ensemble by adding two sequences of digits to each replicate, digits
are column confidence (CC) values, e.g. "73" means CC=0.73, "++" is CC=1.0:
    muscle -addconfseqs ensemble.efa -output ensemble_cc.efa

Calculate letter confidence (LC) values, -ref specifies the alignment to
compare against the ensemble (e.g. from -maxcc), output is in aligned
FASTA format with LC values 0, 1 ... 9 instead of letters:
    muscle -letterconf ensemble.efa -ref aln.afa -output letterconf.afa

    -html aln.html
        Alignment colored by LC in HTML format.

    -jalview aln.features
        Jalview feature file with LC values and colors.

More documentation at:
    https://drive5.com/muscle

Now, do two more things to complete this exercise.

Generate the last few lines of your history that include these steps and copy the output to a Word document.
List your conda environments using the appropriate terminal command. You can learn how to do that using the Conda Cheat Sheet. Copy that list to the same Word document.

Upload your Word document to Assignment 2 - Conda.

NOTE - In the 2023 iteration of this class, we had significant difficulties with conda environments conflicting with one another and variation in whether everyone could install the same packages. Thus, I'm trying a new system this year but in your own work, if you continue working with bioinformatics, this will be a useful skill and I think it's an important one to learn.