Assembling with ABySS on EC2 - Green-Biome-Institute/AWS GitHub Wiki
This page will help you if you would like to run ABySS to assemble a genome using short-read Illumina data in AWS.
ABySS on Ubuntu
De Bruijn Graph large genome assembly using short reads
If you are using an instance that is already assembled to run ABySS, start at step 7. (06/12/2021) The current custom EC2 ABySS AMI for GBI has the ID ami-06ea06aaeef961320 and name GBI-ABySS. To create an instance from this, follow the instructions on the EC2 page.
If starting from a brand new instance with no previously installed software, then follow all of these steps (further help can be found on EC2 page:
- Start Ubuntu Instance with a 64-bit (ARM) processor
- Log in through terminal:
$ ssh -i /path/to/keypairs/keypair.pem [email protected]
- example:
ssh -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem [email protected]
- Set up the basics and dependencies:
- Update/upgrade apt-get, download the build-essentials for building/installing new softwares, and clang:
- $ sudo apt-get update && sudo apt-get upgrade
- $ sudo apt-get install build-essential
- $ sudo apt-get install clang libboost-all-dev libopenmpi-dev
- Download and set up Google SparseHash Library
- $ git clone https://github.com/sparsehash/sparsehash.git
- $ cd sparsehash/
- /sparsehash$ ./configure
- /sparsehash$ make
- /sparsehash$ make check
- /sparsehash$ sudo make install
- /sparsehash$ sudo make install check
- /sparsehash$ cd
# add Sparsehash to the PATH
- $ cd ../../usr/local/include
- /usr/local/include$ export PATH=$PATH:$(pwd)
- Install ABySS
- $ git clone https://github.com/bcgsc/abyss.git
- $ cd abyss/
- /abyss$ ./autogen.sh
- /abyss$ ./configure
- /abyss$ make
- /abyss$ sudo make install
- /abyss$ ./configure --with-boost=/usr/local/include
- /abyss$ cd bin
- /abyss/bin$ export PATH=$PATH:$(pwd)
- Make a folder to organize your data and the results that will come from the assembly:
mkdir data_folder_name
- example:
mkdir my_genome_assembly
- Copy the data to this new folder
- From your local computer using scp:
scp -i /path/to/keypairs/keypair.pem local/path/to/data/filename.fastq [email protected]:~/data_folder_name
- example:
scp -i /Users/flintmitchell/AWS_keypairs/flints-keypair-1.pem local_path_to_data_files/sequencing_files.fastq [email protected]:~/my_genome_assembly
- Using ABySS!
- ABySS is very well documented. For a full explanation of all the customizable flags you can use, check out the documentation. With that said, I will show a basic command and explain a couple of the flags I've used.
- A basic command for ABySS looks like:
abyss-pe name=[assembly-name] j=[num-threads] v=-v k=53 in="SRR1946554_1.fastq SRR1946554_2.fastq" | tee filename.log
- First is the command itself. If you type
abyss-
and then press tab, you will see a variety of options. These seem to mostly be individual modules of the ABySS program, some of which can be used on their own for doing things like calculating statistics from the assembly. However the main program as shown above isabyss-pe
. name=[assembly-name]
, where you set the output name of the assembly. You might put some information about the assembly itself here to create a descriptive output filename, like 'apallida-k53' to signify the plant being assembled and the kmer value. Sometimes it create a bit more work by adding a longer filename, but the more descriptive you make it the easier it is when looking back on your data months later!j=[num-threads]
sets the number of threads available you want to allocate from the EC2 instance to do the assembly. Some of the processes are multi-threaded, so make sure to set the maximum amount of threads you have to expedite the assembly process.v=-v
makes the assembly give a verbose output. This can be helpful for understanding what went wrong (or right!) during the assembly afterwards.- `k=[k-mer-value] sets the k-mer value for the assembler.
- `in="filename1.filetype filename2.filetype" gives the assembler the locationa nd name of the input sequencing read data. If you are in the folder where the data is location, you don't need to add the path to the data beforehands. If you are outside of the data folder, you would need the path to the data.
| tee filename.log
creates a file where all the assembly output information is stored. By using thev=-v
flag above we can increase the amount of information saved into this log file.- Putting this all together, an example assembly using the plant Arabidopsis Thaliana, 16 threads, a k-mer value of 53, and input files named SRR1946554_1.fastq and SRR1946554_2.fastq might look like:
abyss-pe name=athaliana-k53-FM060821 j=16 v=-v k=53 in="SRR1946554_1.fastq SRR1946554_2.fastq" | tee athaliana-k53-FM060821-stdout.log
- Downloading your results. The results and all information produced by ABySS (logs, documentation, etc.) is put into a location that your command line interface is currently in. So if you are in a directory
/home/ubuntu/a-thaliana/
, then the output will be stored there as well. We can once again use the scp command from step 6 (with a slight change) to copy the results to our local storage. We will also use the flag-r
, which will copy through all the files in a given folder recursively (2 flags can be sent together, so-r
and-i
will be-ri
[note, not-ir
, order matters]) scp -ir keypair results-on-ec2-instance local file:
scp -ri /path/to/keypairs/keypair.pem [email protected]:~/data_folder_name/results_folder_name local/path/to/results_folder
- Example:
scp -ri /Users/flintmitchell/Desktop/GBI/AWS_keypairs/flints-keypair-1.pem [email protected] 2.compute.amazonaws.com:~/home/ubuntu/a-thaliana/a-thaliana/ /Users/flintmitchell/Desktop/GBI/Results
Resources
https://www.bcgsc.ca/resources/software/abyss
Parameter Optimization https://groups.google.com/g/abyss-users/c/uAX9MwJMTcc?pli=1