Section 6: Pre assembly software - Green-Biome-Institute/AWS GitHub Wiki
Go back to Section 5 : Setting up EC2 instance and Uploading Data
Learning Points for Section 6: Pre assembly software
- You'll be able to take just a subset of a data file
- You'll be able to use Jellyfish for k-mer counting
- You'll be able to use the BLAST tool in the terminal
- You'll be able to use MEGAN
- You'll be able to use FastQC for checking read quality
- You'll be able to trim sequencing data with specific quality parameters
Subsampling our Data
In order to save on time while doing this workshop, we will not need to use the entire data file. To just take the first 50,000 reads from the original file without unzipping it, we'll use the following three commands, zcat
, head
, and >
. Before doing this, it's important to remember the format of a fastq file. Each entry in a fastq file looks like this:
@SEQ_ID
GATCGATCGATCGATGATCGATCGATCGATCGATCGATCATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
+
!!''**(())++%%!!''**(())++%%!!''**(()++FCC556%%!!''**(())++%!!''**(()++FCC556%%!!'
These 4 lines represent
- A sequence identifier preceded by
@
- The raw sequence
- Optional further notes / description
- Quality values for the raw sequencing data
Since there are 4 lines for each of the sequence entries, in order to just subsample 50,000 reads, we need to take 50,000 x 4 = 200,000 lines.
This looks like:
zcat SRR1946554_1_100k.fastq.gz | head -n 200000 | gzip > SRR1946554_1_50k.fastq.gz
zcat
will output the contents of the subsequent file to stdout (to the command line). We take this output and pipe it into the command head -n 200000
, which will just output the first 200,000 new lines of the input file (in this case, the output of zcat
). This is then piped into gzip
to compress the those 200,000 lines. Lastly, we use the redirect operator >
to save this compressed file with 50k entries into a new file that we name ourselves. In this case, we add _50k
to the filename in order to make it obvious which file was created with the subsampled reads.
Jellyfish k-mer counting
“Decomposing a sequence into its k-mers for analysis allows this set of fixed-size chunks to be analysed rather than the sequence, and this can be more efficient. K-mers are very useful in sequence matching (string matching with n-grams has a rich history), and set operations are faster, easier, and there are a lot of readily-available algorithms and techniques to work with them.
A simple example: to check if a sequence S comes from organism A or from organism B, assuming the genomes of A and B are known and sufficiently different, we can check if S contains more k-mers present in A or in B.” (1)
Using the software Jellyfish, let’s count the k-mer content of our sequencing data, create a histogram of the result, and create some statistics about it.
$ jellyfish count -m 21 -s 100M -C SRR1946537_50K_1.fastq
$ jellyfish histo -i athaltest.histo mer_counts.jf
$ jellyfish stats mer_counts.jf
Manual for jellyfish can be found here: https://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual-1.1.pdf
FastQC analysis
Now, using FastQC, let’s get a better perspective on the quality of our sequencing reads. FastQC, written by Simon Andrews of Babraham Bioinformatics, is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data (2). The results from this command will be output in an html file that we can scp off of the instance and look at on our personal computer.
First, let’s use the fastqc command and input both of our fastq files:
$ fastqc SRR1946554_1_.fastq SRR1946554_2.fastq
This will output several files, two of which are .html (one for each of the .fastq files). Let’s output those from our EC2 instance to our computer and look at them. We can use the wildcard *
to scp all of the .html files at once.
$ scp -i /PATH/to/keypair.pem ubuntu@IP-address:/PATH/to/results/*.html .
Then, we can just click on these files in the directory we copied them to on your personal computers, or we can use the open
command from within the terminal:
$ open *.html
As you can see, there are a fair amount of reads with bad quality scores, let’s fix that with trimming.
Trimming the reads
This data appears to have some low-quality reads in it. In order to make sure we get rid of those, and any illumina adapter sequences still in the reads, we will use the software TrimGalore to help clean up our sequencing data.
Here, we will trim the files with both quality scores of 21 and 25 to see the difference:
Use:
$ trim_galore --quality 21 --basename SRR1946537_50K_1_phred21 --cores 2 SRR1946537_50K_1.fastq
$ trim_galore --quality 21 --basename SRR1946537_50K_2_phred21 --cores 2 SRR1946537_50K_2.fastq
And:
$ trim_galore --quality 25 --basename SRR1946537_50K_1_phred25 --cores 2 SRR1946537_50K_1.fastq
$ trim_galore --quality 25 --basename SRR1946537_50K_2_phred25 --cores 2 SRR1946537_50K_2.fastq
Re-analyzing the trimmed data with FastQC
Since we’ve just modified our sequencing reads, it’s best if we give it another look to make sure we’ve done what we wanted before moving on to the assembly step.
For the reads trimmed with a quality score of 21
$ fastqc SRR1946537_50K_1_phred21_trimmed.fq SRR1946537_50K_2_phred21_trimmed.fq
And for the reads trimmed with a quality score of 25
$ fastqc SRR1946537_50K_1_phred25_trimmed.fq SRR1946537_50K_2_phred25_trimmed.fq
Once again, we need to copy the results to our personal computer to look at them.
$ scp -i /PATH/to/keypair.pem ubuntu@IP-address:/PATH/to/results/*.html .
Then, we can just click on these files in the directory we copied them to on your personal computers, or we can use the open
command from within the terminal:
$ open *.html
Clearly in this example, the reads have been trimmed and have a higher overall quality score now. This is still just test data, however, so bear in mind that it will not be perfect.
BLAST
In order to use BLAST, we will first need to transform our fastq datafile into fasta format. To do this we will use a command from the seqtk
toolkit. From their github, we see that the syntax for this command looks like:
seqtk seq -a in.fq.gz > out.fa
So for us, this will be:
seqtk seq -a SRR1946554_1_50k.fastq.gz > SRR1946554_1_50k.fa
Now that we have the reads in fasta format we can use them with BLAST.
The BLAST command syntax for a nucleotide search that we will use is:
blastn -task blastn -num_descriptions 1 -num_alignments 1 -num_threads 2 -query SRR1946554_1_50k.fa -db /home/ubuntu/GBI-Bioinformatics-Softwares/BLAST/db/nt -out SRR1946554_1_50k_blast.out
Since this can take a moment, the outputfile is already provided on S3. We will download it to our EC2 instance to simulate blast creating it for us. In order to download it, use the following command:
aws s3 cp s3://gbi-batch1-data/athal_workshop_data/SRR1946554_1_4k_blast.out .
Now, we can SCP this BLAST result file to our own computer. In order to do this, open up a second window of the CLI of your personal computer and, similar to the SSH command used to connect to the EC2 instance, use the SCP command to download results off of it (more detailed description of this process here: https://github.com/Green-Biome-Institute/AWS/wiki/Section-6:-Interacting-with-Other-Computers-&-Virtual-Terminal-Sessions). This looks like:
scp -i [path/to/keypair.pem] ubuntu@ec2instance-ip-address:[path/to/blast/resultfile_blast.out] .
MEGAN
Now that you have the blast result file on your computer, open up MEGAN (if you need to download it, you can do so here: https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/megan6/).
Under the File
dropdown menu, select Import from BLAST...
and then find and upload the blast result file you just downloaded from the EC2 instance. Click Apply
and then wait for the program to finish. When it is done it will output a chart with a taxonomic analysis of the blast result.
For further exploration of the MEGAN tool, check out the tutorial here: https://software-ab.informatik.uni-tuebingen.de/download/public/tutorial-aug2021/DIAMOND+MEGAN-Tutorial-August-2021.pdf
Review Questions:
Which commands would you combine to take the 50,000 reads of a compressed sequencing file and save them into a new fastq file?
zcat
,head
, and>
.
What do you use Jellyfish for?
- Fast, memory-efficient counting of k-mers in DNA
What is the BLAST command and what do you use BLAST for?
blastn
, it is used to compare nucleotide or protein sequences to a sequence database.
What is MEGAN used for?
- MEGAN is used to analyze the taxonomic and functional content of large microbiome datasets.
What is FastQC used for?
- FastQC is used to look at NGS data and get quality control metrics.
What does it mean to "trim" sequencing data?
- Sequence trimming is used to remove contaminating sequences from your data, for example adapter sequences on the end of NGS data.
Resources:
(1) https://bioinfologics.github.io/post/2018/09/17/k-mer-counting-part-i-introduction/
(2) https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/
Good basic information on BLAST: https://open.oregonstate.education/computationalbiology/chapter/command-line-blast/ Move on to Section 7: Genome assembly with ABySS and SOAPdenovo2