TP07 - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki
This document will guide you through some commonly used databases and how to interpret the Genbank file format for some different types of sequences.
The easiest way to find GenBank on the net is to google “ncbi” (short for National Center for Biotechnology Information). The NCBI is the organization providing GenBank. The main page is located at (http://www.ncbi.nlm.nih.gov). Sequences available from GenBank are organized by accession numbers which uniquely identifies sequences in the database.
If you search for the accession number X01714 which represent one of the sequences in the database. You can do this by entering the accession number in the search window.
The results page (below) show hits from available databases. As you can see there are one hit each for PubMed (scientific publications abstracts), PubMed central (publications in fulltext) and Nucleotide sequence files.
Click on the “Nucleotide” link to go to the sequence associated with X01714. You should see a screen similar to Fig 3.
*
A GenBank file has three parts, Header, Feature table and Sequence:
The header has at least the items in the list below in the order indicated.
Start of row | Description |
---|---|
LOCUS | A short name for the file This line also contain length, type (DNA or RNA), and a date. |
DEFINITION | A short description of the sequence. |
ACCESSION | The accession number is a unique, unchanging code assigned to each file. |
VERSION | The accession number and a version number. |
SOURCE | Common name of the organism. |
ORGANISM | Scientific name of the organism. |
COMMENT | Any comment that does not fit anywhere else. |
The feature table contain Key, Location and Qualifiers for sections of the sequence. The Key describe the class of sequence features. This can for example be be CDS for Coding Sequence which is translated into a protein. The Location is most often a range Begin..End
. Finally, qualifiers provide more information about the feature.
Key Location/Qualifiers
gene 687..3158
/gene="AXL2"
CDS 687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S. cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF"
The sequence of a GenBank file starts with the keyword "ORIGIN" and ends with "//" which is also the end of the file.
ORIGIN
1 cagagaaaat caaaaagcag gccacgcagg gtgatgaatt aacaataaaa atggttaaaa
...
1561 ggccttaacc gccgccagat gttccgccat ttccggcttc tcttccagg
//
Important
A complete Genbank file always ends with “//” to signal the end of the file.
Find out the following information by studying the Genbank file for X01714:
Question 1: What is the size of the sequence for this file? (look at the first line of the file)
Question 2: From which organism is it? (look at DEFINITION)
Question 3: What kind of protein does it encode? (look at DEFINITION)
Question 4: How many proteins are encoded by the file? ( Count how many “CDS” keys there are in the feature table)
Question 5: What is the first and the last position containing the promoter element called “="-10 region”.
Question 6: What is the first and the last position containing the promoter element called “="-35 region”.
Question 7: What is the position in the DNA sequence of the translation start of the FIRST encoded protein?
ApE is a very useful tool to work with Features. Features are displayed automatically if ApE recognizes the sequence as a Genbank file. You can visualize, add, remove and hide features using ApE.
In the example below, a local file containing the X01714 sequence from Question 1 was opened in ApE using the file dialog. You can also drag and drop the file into the main window of ApE.
The features are shown as colored highlighting on the sequence and an arrow indicating direction.
We can copy the sequence for a feature by right clicking to bring up the context menu. Choose copy as shown below:
The pasted sequence from will be ttgagc
as expected.
Warning
ApE always copy the feature sequence as it is visible on the screen.
ApE does not care if the feature is on the upper Watson strand or on the lower Crick strand. For example, consider the sequence below:
LOCUS TP7example1 37 bp DNA linear 06-MAY-2020
DEFINITION .
SOURCE .
ORGANISM .
FEATURES Location/Qualifiers
misc_feature 4..16
/locus_tag="myfeature1"
/label="myfeature1"
misc_feature complement(21..35)
/locus_tag="myfeature2"
/label="myfeature2"
ORIGIN
1 GCCCTAACTG ACAAACTGAT CGACCACAAG CCAAGCC
//
If we copy the sequence of myfeature2 we get CGACCACAAGCCAAG
which is the reverse complement of the sequence we want. This means that the user has to remember the direction of the feature.
Find the GenBank file identified by the accession number “AF018429”. This file contains the human homolog of the gene in the previously studied file. There are some differences in the structure of the file, mainly because of transcription splicing.
Question 8: How many different exons are present in the file? These are given as keys in the feature table.
Question 9: The exons are combined to form two different proteins which are transported to different sub-cellular compartments, which are these compartments?
There is more information and examples regarding the GenBank format from NCBI here.
Find the GenBank file “AF298787”. This is a a Saccharomyces cerevisiae/E. coli shuttle vector (able to replicate in both S. cerevisiae and E. coli). The gene URA3 is used for selection in a S. cerevisiae ura3 mutant.
Question 10:
Display the URA3 gene cds pUG35. Verify that the size is 804 nt. The partial seguid for this sequence is lsseguid=XFhNZn
what is the complete checksum?
Question 11:
In the same way as before, find the genbank file AJ001614.1
. This file describes a plasmid vector for E. coli called pCAPs.
The DNA sequence is read clockwise from the origin. The sequence contains the resistance gene for β-lactamase, encoded by the gene “bla”. Use the functions that you already know to display ONLY the CDS of the bla gene in the correct orientation (ATG...TAA)
The partial seguid for this sequence (size = 861 bp) is lsseguid=riT98j
, what is the complete checksum?
Question 12:
This is an individual question for each student. At the top of the Google spreadsheet for TP07, there is a link to a Google doc containing one GenBank file per student.
Search within the document (CTRL-F) to find your file using your name or mec number.
Your sequence contain two features, myfeature1 and myfeature2. Your task is to extract the sequence for each of the two features. Paste the result into the Google spreadsheet for TP07. You should find your name in the leftmost column. Put your answer in the "myfeature1" and “myfeature2” columns. Please answer with a raw DNA sequence as indicated for the first example student "Max Maximus".
© Björn Johansson 2013 - 2025