Data extraction and blast based gene annotation - jfgout/AppliedGenomics GitHub Wiki
In this lab, we will have a closer look at the genome assembly that you obtained in the previous lab. To ensure that everyone works on the exact same file, copy the file /mnt/scratch/goutfs/CMB8013/GenomeAssembly/assembly-with-pacbio.fna into your home directory.
Question 1: What is the file extension ".fna" used for? How about the file extension ".faa"?
Hint: do a google search of "fna vs faa".
Identifying the organism that was sequenced.
Our first objective is to identify the mystery organism which was sequenced in the previous lab. While the assembly size might already give you an idea of what type of organism we are dealing with, you will now use the sequence itself to obtain a precise identification.
Use the command "head" to extract a small portion of the assembly and search for organisms with a similar sequence on the NCBI BLAST: https://blast.ncbi.nlm.nih.gov/
Question 2: Which flavor of BLAST will you be using to perform this task?
Question 3: Based on the result of the BLAST query, can you make any inference regarding the organism that was sequenced?
Identifying genes in the assembly.
Local BLAST database building
We will now identify the location of a few genes in the assembly. The first step consists in "transforming" the assembly into a database from which any sequence can be extracted (for example if you want to extract sequence covering the positions 8,000 to 9,000 of scaffold "NODE_1"). The blast queries will be run against this "transformed" (blastable) database.
This is done with the command "makeblastdb" . You can create a custom database from a FASTA file of sequences with this minimal command:
makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids
"makeblastdb" is available with: module load blast
You can read a more detailed documentation at https://www.ncbi.nlm.nih.gov/books/NBK279688/.
After you have created the local database, list the files present in the folder containing the assembly. Question 4: What do you notice?
Extracting a specific region of the genome.
Annotating all the genes present in a genome assembly requires the use of complex pipelines. Here, we will focus on small region or specific genes to illustrate the important principles behind gene annotation. Let's start by extracting the region from 38710 to 40200 of the second scaffold.
Use a combination of "grep" and "head" to find the exact name of the second scaffold from the assembly fasta file (see here: https://www.geeksforgeeks.org/piping-in-unix-or-linux/ for an explanation of chaining commands with a 'pipe'). Then, use the blastdbcmd command to extract the relevant region into a file that you will name "sequence-01.fna". See help on using blastdbcmd here: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastdbcmd_application_opti/
Searching for an open reading frame.
Use R to search for ORFs in this sequence and use the corresponding protein sequence to search for similar proteins in the NCBI BLAST database (https://blast.ncbi.nlm.nih.gov/.
Question 5: Based on the BLAST result, what is the function of the gene that you have identified?
Searching for a specific gene.
The file /mnt/scratch/goutfs/CMB8013/GenomeAssembly/yfjD.faa contains the sequence of a protein. Use blast locally to find the location of the gene that codes for this protein.
Question 6: Which flavor of BLAST will you be using?
Question 7: Where is the gene yfjD located in your genome assembly?
Question 8: Here is a link to the RNA polymerase alpha subunit in the E. coli genome: https://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Gene/Summary?g=b3295;r=Chromosome:3440040-3441029;t=AAC76320;db=core
Find the location in your assembly of the gene encoding the alpha subunit of the RNA polymerase (this is a highly conserved gene, and every species will have at least one copy of this gene). Note that the different flavors of BLAST are available on the server simply by typing the name of the corresponding blast command. You can obtain the basic help with:
blastn -h
tblastn -h
blastp -h
blastx -h
Provide some details about how you proceeded to identify this gene.