Lab 1: Databases - ricket-sjtu/bioinformatics GitHub Wiki
- Knowing where to obtain the data (resources) for Literatures, Sequences, Annotations, etc.
- Understanding the mechanism of the Web API for the above resources.
- Obtaining the batch of data from web resources using web scrawling technologies.
- Python programming for integrating data.
- Protocol: TCP/IP + HTTP/HTTPS/FTP, Request (client -> server), Response (server -> client)
- Web page development: HTML + CSS + Javascript
- Web Database (Browser Query):
- SQL query:
UCSC Genome Browser
- Web scrawler: using the provided API (application programming interface, 应用编程接口)
- ftp (direct download): ftp://ftp.ncbi.nlm.nih.gov
- HTTP REQUEST and RESPONSE:
urllib
,urllib2
,requests
- Web browser emulation (模拟浏览器)
- How to parse the RESPONSE:
re
,xpath
,BeautifulSoup (bs4)
,scrapy
,jsonpath
,pyquery
,csv
,lxml
- Dynamic page loading (
Ajax
):Selenium
,PhantomJS
(无界面浏览器) - CAPTCHA:
Tesseract
- Raw text:
open()
+read()
+split()
+re
+ ... -
HTML:
BeautifulSoup
,lxml
-
XML:
xml
,lxml
-
JSON (Javascript Object Notation):
json
-
YAML (yet another markup lanuage):
PyYAML
- Delimited file:
csv
- ...
- DTD (Document type definition,文档类型定义):可以看做XML文件的模板,定义XMl文件的元素、元素属性、元素排列方式、元素包含的内容
- XSD (XML schema defintion,XML模式定义):XML文档元素的定义与描述方式。
- XML (Extensible markup language, 扩展标记语言):网络数据传输的一种标记语言。
- HTML (Hypertext markup lanuage, 超文本标记语言):一种用以创建网页的标准标记语言。
- CSS (Cascading stylesheet, 级联样式表):为结构化文档添加样式的W3C标准语言。
- RDF (Resource description framework, 资源描述框架): 描述网络资源的W3C标准。
2. NCBI - National Center for Biotechnology Information
- Open NCBI;
- You will find a Drop-down box, where will list all the databases in NCBI.
- What does ClinVar contain?
- What's the role for the database GEO?
- Search for some reviews on sequence alignment algorithms for next generation sequencing data.
- Read the review, and answer the following questions:
- List the categories of the algorithms.
- List 2-3 alignment tools for each category.
- Write down the details for one of the algorithms of your interest.
- Since you have learn Smith-Waterman algorithm for standard local alignment, can you understand why this algorithm cannot be applied to short read alignment?
- Search for the articles on tool Basic Local Search Alignment Tools (BLAST), what's the difference between this tool and the Smith-Waterman algorithm?
- Try to understand the statistical details behind the BLAST.
- Open the database page nucleotide;
- Click the advanced option beneath the search box;
- In Builder section,
- Enter presenilin 1 in the field of Title;
- Enter human in the field of Organism;
- Press Search.
- How many entries are in the results?
- Choose the Customize... beneath the filter Molecule types within the left panel, and select mRNA from the popup checkbox;
- Choose the Customize... beneath the filter Source databases within the left panel, and select the RefSeq from the popup checkbox.
- Now how many entries are left?
- Now take a look at all the results, how many of the results satisfy your initial search?
- Check the resulting entries of your interest, and save them into a FASTA file.
- Have a look at the results in genbank and ASN.1 formats, try your best to interpret the results.
After finishing the search, you can find a textarea occurs under the term search details within the right panel of the result page. It should be like
presenilin 1[title] AND "Homo sapiens"[Organism] AND (biomol_mrna[PROP] AND refseq[filter])
Now can you figure out how to formulate a complex search using either AND, OR, or NOT?
Fore more information, refer to E-Utilities Quick Start.
- Base URL:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
- Basic Searching:
esearch.fcgi?db=<database>&term=<query>
- Example: Get the PubMed IDs for articles about breast cancer published in Science 2008
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]
- Basic Downloading:
efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type>
&retmode=<retrieval_mode>
- Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode)
- Output: Formatted data records as specified
- Example: Download nuccore GIs 34577062 and 24475906 in FASTA format
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=34577062,24475906&rettype=fasta&retmode=text
For more information, refer to Entrez Direct
Fore more information, read the tutorial Bio.Entrez
.
3. Uniprot - Universal Protein Resource
- What is the gene name of P00470?
- How does the gene name differ from the protein name?
- How many sequences are reported for this entry and what are their lengths?
- Which sequence are all the annotations based upon?
- Where is this protein found in the cell?
- What type of annotation is this from?
- How many X-Ray structures are reported from PDBe?
- How many diseases are annoated in P00470 and what is/are their name(s)?
- What annotations can you find at or overlapping residue 283?
- Are any of the annotations associated to a disease(s)?
- Using the other annotations can you conclude what has happened to the protein to cause the disease?
- Are there any protein structures that cover residue 283 and does it help in your interpretation?
- Is there a way to look for related (orthologs) proteins of P00470?
- Which is the more consistent search?
- Why do you think this is the case?
- An ontology is a glossary of keywords arranged in a structured order or a network based on the biological concepts.
- Challenges:
- Vast amount of biological data (海量生物数据)
- Different ID/names/terms for the same concepts/entities (同一概念,多种名字)
- Cross-species comparison is difficult (难以跨物种比较)
- The knowledge of phenotypes and the associated genes and their roles in cells is accumulating and changing (认知在不断增加和改变)
- Resolution: Ontology
- using a controlled vocabulary (标准化的语汇)
- can be applied to either all organisms or at least within a kindom/sub-class/family
- NOT a system of nomenclature or a list of gene products/phenotypes.
- NOT attempt to cover all aspects of biology or evolutionary relationships.
- NOT a dictated standard.
- NOT a way to unify databases.
- REMEMBER: An ontology allows the users to query the different databases using the same keywords and query strings provided those different databases have implemented the commonly adopted ontologies.
- Molecular Function (MF): transporter, enzyme, ...
- Role in a Biological Process (BP): photosynthesis, immunological response, ...
- Localization in a Cellular Component (CC): Glogi apparatus, ER, ...
- Ontology terms are composed of
- Term name
- Unique ID
- Definition (more than 75% of terms defined)
- Synonym (optional)
- Database reference (optional)
- Relationship to other terms in the same ontology
- 17000+ Gene Ontology terms
- OBO (open biological and biomedical ontology): an ontology representation language.
- OWL (ontology web language):
- RDF (resource definition format):
- RDF-XML format
- describe some stuff, rather than just name it.
- Class (
BlueThing
) does not meaning anything - Class (
BlueThing
) has a complete definition:
- Class (
owl:Thing
restriction:hasColour someValuesFrom (blue)
- OWL is a syntax-independent language that has several common representations
- Abstract Syntax
- N3
- RDF/XML
-
is-a
(i) -
part-of
(P) -
has-part
(hP) -
regulate
(R) -
positively-regulates
(R+) -
negatively-regulates
(R-)
-
http://www.candidagenome.org/cgi-bin/GO/goEvidence.pl
- EXP - Experiment
- IC - Inferred by Curator
- IDA - Inferred from Direct Assay
- IEA - Inferred from Electronic Annotation
- IEP - Inferred from Expression Pattern
- IGC - Inferred from Genomic Context
- IGI - Inferred from Genetic Interaction
- IMP - Inferred from Mutant Phenotype
- IPI - Inferred from Physical Interaction
- ISA - Inferred from Sequence Alignment
- ISM - Inferred from Sequence Model
- ISO - Inferred from Sequence Orthology
- ISS - Inferred from Sequence or Structural Similarity
- NAS - Non-traced Author Statement
- ND - No biological Data available
- RCA - Reviewed Computational Analysis
- TAS - Traceable Author Statement
- Go to http://www.geneontology.org/amigo/
- Search for human Cytochrome c in genes and gene products
- How many GO terms are associated with this protein? How many different evidences?
- Which association, in your opinion, has highest confidence? Why?
Here we will use the Python package - GOATOOLS to query the GO. This package can read the GO structure stored in OBO format, which is available from the GO website. After loading this file, it is convenient to traverse the GO hierarchy, search for particular GO terms, and find out which other terms they are related to and how.
You can install the goatools
through pip
tool.
The GOATOOLS
package contains the function obo_parser.GODag()
to
load the GO file. Each GO term in the resulting object is an instance of
the GOTerm class, which contains many useful attributes, including:
-
GOTerm.name
: textual definition; -
GOTerm.namespace
: the ontology the term belongs to (i.e., MF, BP, CC); -
GOTerm.parents
: list of parent terms; -
GOTerm.children
: list of children terms; -
GOTerm.level
: shortest distance to the root node.
- Download the GO basic file in OBO format (
go-basic.obo
), and load it using the functionobo_parser.GODag()
from the package GOATOOLS. - Answer the following questions:
- What is the name of the GO term
GO:0048527
? - What are the immediate parent(s) of the term
GO:0048527
? - What are the immediate children of the term
GO:0048527
? - Recursively find all the parent and child terms of the term
GO:0048527
. Hint: use your solutions to the previous two questions, with a recursive loop. - How many GO terms have the word "growth" in their name? Hint:
Use
re.match
to find the terms. - What is the lowest common ancestor term of
GO:0048527
andGO:0097178
? Hint: Find the common parents (ancestors) with the deepest level. - Which GO terms regulate
GO:0007124
(pseudohyphal growth)? Hint: load the relationship tags and look for terms which define regulation.
- What is the name of the GO term
The GOATOOLS package also includes functions to visualize the GO graphs. For
instance, it is possible to depict the location of a particular GO term in
the ontology using the method GOTerm.draw_lineage()
.
- Visualize the GO term
GO:0097190
. From the figure, what is the name of this term? - Using this figure, what is the most specific term that is in the parent
terms of both
GO:0097191
(extrinsic apoptotic signaling pathway) andGO:0038034
(signal transductional in absence of ligand)? This is referred to as the lowest common ancestor.
An alternative to GOATOOLS and OBO files is to retrieve information relating to a specific term from a web service. One such service is the EMBL-EBI QuickGO, which can provide descriptive information about GO terms in OBO-XML format via the following URL:
http://www.ebi.ac.uk/QuickGO/GTerm?id=<GO_ID>&format=oboxml
- Write a Python function
get_oboxml(go_id)
to- Use
urllib
package to request the OBO-XML; - Parse the XML result using the
xmltodict
package to convert it into easy-to-usedict
;
- Use
- Use the function to find the name and description of the GO term
GO:0048527
(lateral root development). Hint: print out the dictionary returned by the function or create a visualization using the Python packagevisualisedictionary
to study the structure.
In this exercise, we will learn how to parse a GAF file (GO Annotation File)
downloaded from the UniProt-GOA database using an iterator from the BioPython package (Bio.UniProt.GOA.gafiterator
):
from Bio.UniProt.GOA import gafiterator
import gzip
fname = "gene_association.goa_arabidopsis.gz"
with gzip.open(fname, "rt") as fp:
for annotation in gafiterator(fp):
print(annotation['DB_Object_ID'])
A GAF file is a tab-delimited file containing 17 fields including:
-
DB
: the protein database; -
DB_Object_ID
: protein ID; -
Qualifier
: annotation qualifier (such as NOT); -
GO_ID
: GO term; -
Evidence
: evidence code.
- Find the total number of annotations for Arabidopsis thaliana with NOT qualifiers. What is this as a percentage of the total number of annotations for this species?
- How many genes (of Arabidopsis thaliana) have the annotation
GO:0048527
(lateral root development)? - Generate a list of annotated proteins which have the word "growth" in their name.
- There are 21 evidence codes used in the Gene Ontology project. Many of these are inferred, either by curator or automatically. Find the counts of each evidence code in the Arabidopsis thaliana annotation file.
This exercise will focus upon computing the semantic similarity between GO terms, based on the either graph-based or information-theoretic metrics. The former relies on the structure of the Gene Ontology graph, while the latter also accounts for the information content of the terms.
Write a Python function to compute the corresponding similarity metrics:
- Since the Gene Ontology graph is a directed acyclic graph (有向无环图),
we can use the inverse of the the number of edges separting two terms
$(t_1, t_2)$ . - Another example of an information-theoretic metric is Resnik similarity measure - the information content of the most informative common ancestor of the two terms in question.
- KEGG
- Integrated database of biological systems, genetic building blocks and chemical building blocks
- Four major components of KEGG
The KEGG API is a REST-style Application Programming Interface to the KEGG resources using the URL form:
http://rest.kegg.jp/<operation>/<argument>[/<argument2>[/<argument3> ...]]
info - display database release information and linked db information
list - list the entry identifiers and associated definition
find - find entries with matching query keyword or other query data
get - retrieve given database entries
conv - convert KEGG identifiers to/from outside identifiers
link - find related entries by using database cross-reference
ddi - find adverse drug-drug interactions
pathway - KEGG PATHWAY maps (path:map0XXXX)
brite - KEGG BRITE functional hierarchies (br:0XXXX)
module - KEGG modules (md:M00XXX)
ko - KO functional orthologs (ko:KXXXXX)
genome - KEGG organism code (gn:TXXXXX)
<org> - GENES in KEGG organisms
vg - Genes in viruses category (vg)
ag - Genes in addendum category (ag)
compound - Small molecules (cpd)
glycan - Glycans (gl)
reaction - Biochemical reactions (rn)
rclass - Reaction class (rc)
enzyme - Enzyme nomenclature (ec)
network - Network elements (ne)
variant - Human gene variants (hsa_var)
disease - Human diseases (ds)
drug - Drugs (dr)
dgroup - Drug groups (dg)
environ - Health-related substances (ev)
genes - vg + ag
ligand - compound, glycan, reaction, enzyme
kegg - all
<medicus>- translational bioinformatics resource
<outside>- outside databases
hsa - human
eco - E.coli
...
disease_ja - Disease (ds_ja)
drug_ja - Drug (dr_ja)
dgroup_ja - Drug group (dg_ja)
environ_ja - Environment factor (ev_ja)
compound_ja - Compound (cpd_ja)
brite_ja - BRITE (br)
atc - 7-letter ATC code (atc)
jtc - Therapeutic category code (jtc)
ndc - National Drug Code (ndc)
yj - YJ code (yj)
pubmed - NCBI Pubmed database ID (pmid)
ncbi-geneid - NCbI Gene ID
ncbi-proteinid- NCBI Protein ID
uniprot - Uniprot ID (up)
pubchem - Pubchem ID
chebi - Chebi ID
- tab-delimited text from
list/find/conv/link
- flat-format from
get
- text-message from
info
KEGG PATHWAY is presented in a set of wiring diagrams of molecular interactions, reactions and relations, which is the "heart" of KEGG. It enables the mapping of individual elements of a genome (or several genomes) in the context of large-scale dynamic systems, such as metabolism and other cellular processes. It is hugely valuable resource for interpreting genes and gene products in a cellular, systems-level context.
-
Enter terpenoid in the search field and click Go. By default, the organism is
map
, which stands for generic reference pathway. -
QUESTION: How many pathways are returned for this search?
-
Click on map00900, which will show a reference map, interactive:
- clicking on a rectangular node will take you to an entry in the KEGG ORTHOLOGY (KO) database
- clicking on a circular node will take you to an entry in the KEGG COMPOUND database
- clicking on a red node will take you to the connected entry in the KEGG PATHWAY database.
-
Return to the KEGG PATHWAY search page, and enter
ksk
in the Organism field, then search again for terpenoid. -
QUESTION: How many pathways are returned for this search?
-
http://rest.kegg.jp/list/gn will list
the genomes/organisms within the KEGG database. What does the
triple-letter code
bha
stand for? And what aboutath
? - How many terms are contained in the database for the species
ath
? Are they all proteins? Hint: http://rest.kegg.jp/list/ath - How many pathways are stored in the PATHWAY database?
- What is the role for the pathway map00010? Hint: https://www.genome.jp/kegg-bin/show_pathway?map=map00010&show_description=show
- What is the role for the pathway map00030? Hint: https://www.genome.jp/kegg-bin/show_pathway?map=map00030&show_description=show
- List the connecting reactions and compounds for the above two pathways.
Write a python program to extract all the reactions and compounds to connect all-pairs of pathways.

SELECT <Attributes> FROM <Dataset> WHERE <Filter>;
-
Mart
: The host for querying -
Database
: Which database (genes/variants, ...) -
Dataset
: Which dataset (organism) would you like to extract information from. -
Filter
: Which genes you want to retrieve information for. -
Attributes
: What information you want to have for the genes of interest.
Let's start with a simple question: How many proteins in human have the retinol binding domains (IPR002449). Retrieve the list of gene names and descriptions.
- The file
data/golub_243_genes.txt
hosts the 243 differentially expressed genes between AML and ALL patients and their corresponding gene expression levels inlog2
transformation measured by Affymetrix Hgu6800 chips. - The first column is the AFFY HuGeneFL probe IDs. Could you please convert these IDs to the NCBI Entrez Gene ID.
- Then find which KEGG pathway has the highest number of associations with these differentially expressed genes.
- Which of the above genes have no orthologs in the species Saccharomyces cerevisiae?
- (Optional)You could find the Perl scripts for the querying. Can you rewrite it into Python?