Assembling with Flye on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run Flye to assemble a genome in AWS.
Flye on Ubuntu
Note: Flye runs using Anaconda. One a Linux-based OS, we need to use an βx86β system, since Anaconda is not supported on ARM processors (at least not for our purposes).
If you are using an instance that is already assembled to run Flye, start at step 7.
(04/06/21) The current custom EC2 Flye AMI for GBI has the ID ami-0aac70556c389dab1
and name GBI_FlyeAssembler_Ubuntu_x86_r5.xlarge
. To create an instance from this, follow the instructions on the EC2 page.
If starting a brand new instance without Flye and its dependencies uploaded already, go the EC2 page to build and launch the new instance, then follow all of these steps:
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
$ ssh -i /path/to/keypairs/keypair.pem [email protected]
example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Set up the basics and dependencies:
- $ sudo apt update && sudo apt-get upgrade
- $ sudo apt install build-essential g++ make zlib1g zlib1g-dev
- Download miniconda (Anaconda Python but without unnecessary packages since this instance will only need certain packages).
- $ cd
- $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- $ bash Miniconda3-latest-Linux-x86_64.sh
Press enter to scroll down the license agreement
Enter βyesβ for the default settings
Next
- $ rm Miniconda3-latest-Linux-x86_64.sh
[ #
If you accidentally did not enter yes for the miniconda assembly to initialize conda (this sets the correct PATH for your conda environment), use the following:
- $ conda init bash
- $ source ~/.bashrc
# ]
Lastly
- $ conda update --name base conda --yes
- Install Flye
- $ cd
- $ git clone https://github.com/fenderglass/Flye
- $ cd Flye
- $ python setup.py install
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
- Assemble your genome with Flye! Here's an example for oxford nanopore data of Lambda Phage:
flye --nano-raw /home/ubuntu/lambda-ont-data/fastq_runid_5dd3f31631aaf8b094e6dfd522b916c92d81e5a/c_0.fastq --genome-size 48502 --out-dir ~/lambda-ont-flye-results/ --threads 4
- Downloading your results.
- Results from Flye are saved in the folder that you direct it to with the flag
--out-dir [folder-name]
. We can once again use the scp command from step 8 (with a slight change) to copy the results to our local storage. We will also use the flag-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp, -ir, keypair, results-on-ec2-instance, local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
example:
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected]:~/lambda-phage-data/lambda-phage-ont /Users/flintmitchell/Desktop/GBI/Results
If you would like more information, check out the Flye github: https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#examples
Resources for the above steps that may help you:
Flye GitHub/documentation: https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#examples
Install gcc and G++: https://linuxize.com/post/how-to-install-gcc-on-ubuntu-20-04/ sudo apt install gcc sudo apt install g++ sudo apt install make
Download miniconda (Anaconda Python but without unnecessary packages since this instance will only need very specific requirements). Follow these instructions: https://towardsdatascience.com/managing-project-specific-environments-with-conda-b8b50aa8be0e
Number of cores = number of threads https://forums.aws.amazon.com/thread.jspa?threadID=25011