Assembling with MaSuRCA on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run MaSuRCA to do a hybrid assembly of a genome using both short-read Illumina data and long-read PacBio or ONT data in AWS.
MaSuRCA on Ubuntu
The MaSuRCA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches (MaSuRCA github)
If you are using an instance that is already assembled to run MaSuRCA, start at step . (//) The current custom EC2 MaSuRCA AMI for GBI has the ID ________ and name ________. To create an instance from this, follow the instructions on the EC2 page.
If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
- $ ssh -i /path/to/keypairs/keypair.pem [email protected]
- example:
- $ ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Set up the basics and dependencies:
- Update/upgrade apt-get, download the build-essentials for building/installing new softwares, and clang:
- $ sudo apt update && sudo apt-get upgrade
- $ sudo apt install build-essential
- $ sudo apt install clang libboost-all-dev libopenmpi-dev
- $ sudo apt install libbz2-1.0 libbz2-dev libbz2-ocaml libbz2-ocaml-dev
- $ sudo apt install python2.7
- Install MaSuRCA https://github.com/alekseyzimin/masurca/releases
On your personal or lab computer, download the most recent distribution from link above and then copy it to the EC2 instance
- $ scp -i /path/to/keypairs/keypair.pem local/path/to/MaSuRCA-release.tar.gz [email protected]
Then back on the EC2 instance, unzip and install it using the following commands:
- $ tar -zxvf MaSuRCA-4.0.3.tar.gz
- $ cd MaSuRCA-4.0.3/
- $ BOOST_ROOT=install ./install.sh
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
- $ scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
- example:
- $ scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
- Assembling with MaSuRCA
- MaSuRCA is run by creating a config file and then using the command
masurca [config-file]
. After running this command, masurca will build a bash script calledassemble.sh
, which you can run by going into the directory you want to do the assembly in and using that script. In the following examples I use an old version of MaSuRCA (v3.4.2) because I was having issues with the newest release. In theory the newer release does not require the config file for simple assemblies, but it is probably best practice to use it because it requires a better understanding to the user of what parameters there are to change.
An example of a masurca config file can be found here: https://github.com/Green-Biome-Institute/AWS/blob/master/masurca_config_ex
Running MaSuRCA with a config file example:
# Navigate to your assembly directory
- $ cd
- $ cd athaliana-assembly
# List whats in the directory athaliana-assembly
- athaliana-assembly$ ls
athaliana-assembly
# run the masurca command
- athaliana-assembly$ ../MaSuRCA-3.4.2/bin/masurca athaliana-config.txt
# check to see if bash assembly.sh script was created
- athaliana-assembly$ ls
athaliana-assembly assemble.sh
# run the MaSuRCA assembler using the assemble.sh script
- athaliana-assembly$ ./assemble.sh
example command without a config file:
- $ MaSuRCA-4.0.4/bin/masurca -t 16 -i athal-data/short-read/SRR1946554_1.fastq.gz,athal-data/short-read/SRR1946554_2.fastq.gz -r athal-data/long-read/SRR11968809.fastq.gz
# Because you aren't using a configuration file, which includes further information about the run,
# you must use some flags to provide further information:
t = threads
i = no config file input, the following files will be the paired-end illumina reads
r = the file after this indicates the path to a certain sequencing data
- Downloading your results. The results and all information produced by MaSuRCA (logs, documentation, etc.) is put into a location that your command line interface is currently in. So if you are in a directory
/home/ubuntu/a-thaliana/
, then the output will be stored there as well. We can once again use the scp command from step 6 (with a slight change) to copy the results to our local storage. We will also use the flag-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp -ir keypair results-on-ec2-instance local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Example:
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana/a-thaliana/ /Users/flintmitchell/Desktop/GBI/Results
Resources