Assembling with MaSuRCA on EC2 - Green-Biome-Institute/AWS GitHub Wiki

Go back to GBI AWS Wiki

This page will help you if you would like to run MaSuRCA to do a hybrid assembly of a genome using both short-read Illumina data and long-read PacBio or ONT data in AWS.

MaSuRCA on Ubuntu

The MaSuRCA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches (MaSuRCA github)

If you are using an instance that is already assembled to run MaSuRCA, start at step . (//) The current custom EC2 MaSuRCA AMI for GBI has the ID ________ and name ________. To create an instance from this, follow the instructions on the EC2 page.

If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:

Start Ubuntu Instance with a 64-bit (ARM) processor
Log in through terminal:

  - $ ssh -i /path/to/keypairs/keypair.pem [email protected]

example:

  - $ ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]

Set up the basics and dependencies:

Update/upgrade apt-get, download the build-essentials for building/installing new softwares, and clang:

  - $ sudo apt update &&  sudo apt-get upgrade
  - $ sudo apt install build-essential
  - $ sudo apt install clang libboost-all-dev libopenmpi-dev
  - $ sudo apt install libbz2-1.0 libbz2-dev libbz2-ocaml libbz2-ocaml-dev
  - $ sudo apt install python2.7

Install MaSuRCA https://github.com/alekseyzimin/masurca/releases

On your personal or lab computer, download the most recent distribution from link above and then copy it to the EC2 instance
    - $ scp -i /path/to/keypairs/keypair.pem local/path/to/MaSuRCA-release.tar.gz [email protected]
Then back on the EC2 instance, unzip and install it using the following commands:
    - $ tar -zxvf MaSuRCA-4.0.3.tar.gz 
    - $ cd MaSuRCA-4.0.3/
    - $ BOOST_ROOT=install ./install.sh

Make a folder to organize your data and the results that will come from the assembly: mkdir data_folder_name

example:

mkdir my_genome_assembly

Copy the data to this new folder

From your local computer using scp:

    - $ scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name

example:

    - $ scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly

Assembling with MaSuRCA

MaSuRCA is run by creating a config file and then using the command masurca [config-file]. After running this command, masurca will build a bash script called assemble.sh, which you can run by going into the directory you want to do the assembly in and using that script. In the following examples I use an old version of MaSuRCA (v3.4.2) because I was having issues with the newest release. In theory the newer release does not require the config file for simple assemblies, but it is probably best practice to use it because it requires a better understanding to the user of what parameters there are to change.

An example of a masurca config file can be found here: https://github.com/Green-Biome-Institute/AWS/blob/master/masurca_config_ex

Running MaSuRCA with a config file example:

# Navigate to your assembly directory
    - $ cd
    - $ cd athaliana-assembly

# List whats in the directory athaliana-assembly
    - athaliana-assembly$ ls
athaliana-assembly

# run the masurca command
    - athaliana-assembly$ ../MaSuRCA-3.4.2/bin/masurca athaliana-config.txt

# check to see if bash assembly.sh script was created
    - athaliana-assembly$ ls
 athaliana-assembly    assemble.sh

# run the MaSuRCA assembler using the assemble.sh script
    - athaliana-assembly$ ./assemble.sh

example command without a config file:

    - $ MaSuRCA-4.0.4/bin/masurca -t 16 -i athal-data/short-read/SRR1946554_1.fastq.gz,athal-data/short-read/SRR1946554_2.fastq.gz -r athal-data/long-read/SRR11968809.fastq.gz 

# Because you aren't using a configuration file, which includes further information about the run, 
# you must use some flags to provide further information:
t = threads
i = no config file input, the following files will be the paired-end illumina reads
r = the file after this indicates the path to a certain sequencing data

Downloading your results. The results and all information produced by MaSuRCA (logs, documentation, etc.) is put into a location that your command line interface is currently in. So if you are in a directory /home/ubuntu/a-thaliana/, then the output will be stored there as well. We can once again use the scp command from step 6 (with a slight change) to copy the results to our local storage. We will also use the flag -r, which will copy through all the files in a given folder recursively (2 flags can be sent together, so -r and -i will be -ri [note, not -ir, order matters]) scp -ir keypair results-on-ec2-instance local file:

scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder

Example:

scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana/a-thaliana/ /Users/flintmitchell/Desktop/GBI/Results

Resources

https://github.com/alekseyzimin/masurca

Go back to GBI AWS Wiki