Lab 1: Databases - ricket-sjtu/bioinformatics GitHub Wiki

Learning Goals

Knowing where to obtain the data (resources) for Literatures, Sequences, Annotations, etc.
Understanding the mechanism of the Web API for the above resources.
Obtaining the batch of data from web resources using web scrawling technologies.
Python programming for integrating data.

1. Background Knowledges

1.1 How to obtain the data　(获取数据)

Protocol: TCP/IP + HTTP/HTTPS/FTP, Request (client -> server), Response (server -> client)
Web page development: HTML + CSS + Javascript
Web Database (Browser Query):
- NCBI
- UniProt
- PDB
- GeneOntology
- KEGG
- BioMart
SQL query: UCSC Genome Browser
Web scrawler: using the provided API (application programming interface, 应用编程接口)
ftp (direct download): ftp://ftp.ncbi.nlm.nih.gov

1.2 Scrawler (爬虫)

HTTP REQUEST and RESPONSE: urllib, urllib2, requests
Web browser emulation (模拟浏览器)
How to parse the RESPONSE: re, xpath, BeautifulSoup (bs4), scrapy, jsonpath, pyquery, csv, lxml
Dynamic page loading (Ajax): Selenium, PhantomJS (无界面浏览器)
CAPTCHA: Tesseract

1.3 Output Data Formats and Parsing　(解析结果)

Raw text: open()+read()+split() + re + ...
HTML: BeautifulSoup,　lxml
XML: xml, lxml
JSON (Javascript Object Notation): json
YAML (yet another markup lanuage): PyYAML
Delimited file: csv
...

HTTP: Request and Response

1.4 Format definition　(内容格式的定义)

DTD (Document type definition，文档类型定义)：可以看做XML文件的模板，定义XMl文件的元素、元素属性、元素排列方式、元素包含的内容
XSD (XML schema defintion，XML模式定义)：XML文档元素的定义与描述方式。
XML (Extensible markup language, 扩展标记语言)：网络数据传输的一种标记语言。
HTML (Hypertext markup lanuage, 超文本标记语言)：一种用以创建网页的标准标记语言。
CSS (Cascading stylesheet, 级联样式表)：为结构化文档添加样式的W3C标准语言。
RDF (Resource description framework, 资源描述框架)：描述网络资源的W3C标准。

2. NCBI - National Center for Biotechnology Information

$\S$2.1 Basic NCBI Exercise

(1) How many databases are in the NCBI?

Open NCBI;
You will find a Drop-down box, where will list all the databases in NCBI.
- What does ClinVar contain?
- What's the role for the database GEO?

(2) Pubmed search

Search for some reviews on sequence alignment algorithms for next generation sequencing data.
Read the review, and answer the following questions:
- List the categories of the algorithms.
- List 2-3 alignment tools for each category.
- Write down the details for one of the algorithms of your interest.
- Since you have learn Smith-Waterman algorithm for standard local alignment, can you understand why this algorithm cannot be applied to short read alignment?
Search for the articles on tool Basic Local Search Alignment Tools (BLAST), what's the difference between this tool and the Smith-Waterman algorithm?
Try to understand the statistical details behind the BLAST.

(3) Search for human presenilin-1 mRNA sequence in refseq nucleotide database

Open the database page nucleotide;
Click the advanced option beneath the search box;
In Builder section,
- Enter presenilin 1 in the field of Title;
- Enter human in the field of Organism;
Press Search.
How many entries are in the results?
Choose the Customize... beneath the filter Molecule types within the left panel, and select mRNA from the popup checkbox;
Choose the Customize... beneath the filter Source databases within the left panel, and select the RefSeq from the popup checkbox.
Now how many entries are left?
Now take a look at all the results, how many of the results satisfy your initial search?
Check the resulting entries of your interest, and save them into a FASTA file.
Have a look at the results in genbank and ASN.1 formats, try your best to interpret the results.

After finishing the search, you can find a textarea occurs under the term search details within the right panel of the result page. It should be like

presenilin 1[title] AND "Homo sapiens"[Organism] AND (biomol_mrna[PROP] AND refseq[filter])

Now can you figure out how to formulate a complex search using either AND, OR, or NOT?

2.2 Entrez Programming Utitilites (可编程应用接口)

Fore more information, refer to E-Utilities Quick Start.

Base URL:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

Basic Searching:

esearch.fcgi?db=<database>&term=<query>

Example: Get the PubMed IDs for articles about breast cancer published in Science 2008

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]

Basic Downloading:

efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type>
&retmode=<retrieval_mode>

Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode)
Output: Formatted data records as specified
Example: Download nuccore GIs 34577062 and 24475906 in FASTA format

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=34577062,24475906&rettype=fasta&retmode=text

2.3 Standalone E-Utilities　(命令行程序)

For more information, refer to Entrez Direct

2.4 Biopython

Fore more information, read the tutorial Bio.Entrez.

3. Uniprot - Universal Protein Resource

$\S$3.1 Basic Uniprot Navigation

What is the gene name of P00470?
How does the gene name differ from the protein name?
How many sequences are reported for this entry and what are their lengths?
Which sequence are all the annotations based upon?
Where is this protein found in the cell?
What type of annotation is this from?
How many X-Ray structures are reported from PDBe?

$\S$3.2 Functional annotation (Features)

How many diseases are annoated in P00470 and what is/are their name(s)?
What annotations can you find at or overlapping residue 283?
Are any of the annotations associated to a disease(s)?
Using the other annotations can you conclude what has happened to the protein to cause the disease?
Are there any protein structures that cover residue 283 and does it help in your interpretation?

3.3 Uniprot Proteins API

Proteins API

$\S$3.4 Advanced searching

Is there a way to look for related (orthologs) proteins of P00470?
Which is the more consistent search?
Why do you think this is the case?

4. Gene Ontology

4.1 Ontology (什么是本体)

An ontology is a glossary of keywords arranged in a structured order or a network based on the biological concepts.
Challenges:
- Vast amount of biological data (海量生物数据)
- Different ID/names/terms for the same concepts/entities (同一概念，多种名字)
- Cross-species comparison is difficult (难以跨物种比较)
- The knowledge of phenotypes and the associated genes and their roles in cells is accumulating and changing (认知在不断增加和改变)
Resolution: Ontology
- using a controlled vocabulary (标准化的语汇)
- can be applied to either all organisms or at least within a kindom/sub-class/family

What an ontology does NOT?

NOT a system of nomenclature or a list of gene products/phenotypes.
NOT attempt to cover all aspects of biology or evolutionary relationships.
NOT a dictated standard.
NOT a way to unify databases.
REMEMBER: An ontology allows the users to query the different databases using the same keywords and query strings provided those different databases have implemented the commonly adopted ontologies.

Gene ontology describes a protein/gene's biochemical property

Molecular Function (MF): transporter, enzyme, ...
Role in a Biological Process (BP): photosynthesis, immunological response, ...
Localization in a Cellular Component (CC): Glogi apparatus, ER, ...

Anotomy of an ontology

Ontology terms are composed of
- Term name
- Unique ID
- Definition (more than 75% of terms defined)
- Synonym (optional)
- Database reference (optional)
- Relationship to other terms in the same ontology
17000+ Gene Ontology terms

4.2 Data format (数据格式)

OBO (open biological and biomedical ontology): an ontology representation language.
OWL (ontology web language):
RDF (resource definition format):
RDF-XML format

OWL is used to

describe some stuff, rather than just name it.
- Class (BlueThing) does not meaning anything
- Class (BlueThing) has a complete definition:

owl:Thing
restriction:hasColour someValuesFrom (blue)

OWL is a syntax-independent language that has several common representations
- Abstract Syntax
- N3
- RDF/XML

GO-term relationships

is-a (i)
part-of (P)
has-part (hP)
regulate (R)
positively-regulates (R+)
negatively-regulates (R-)

4.3 GO Evidences

http://www.candidagenome.org/cgi-bin/GO/goEvidence.pl
- EXP - Experiment
- IC - Inferred by Curator
- IDA - Inferred from Direct Assay
- IEA - Inferred from Electronic Annotation
- IEP - Inferred from Expression Pattern
- IGC - Inferred from Genomic Context
- IGI - Inferred from Genetic Interaction
- IMP - Inferred from Mutant Phenotype
- IPI - Inferred from Physical Interaction
- ISA - Inferred from Sequence Alignment
- ISM - Inferred from Sequence Model
- ISO - Inferred from Sequence Orthology
- ISS - Inferred from Sequence or Structural Similarity
- NAS - Non-traced Author Statement
- ND - No biological Data available
- RCA - Reviewed Computational Analysis
- TAS - Traceable Author Statement

$\S$4.4 Basic GO Query Exercise

Go to http://www.geneontology.org/amigo/
Search for human Cytochrome c in genes and gene products
How many GO terms are associated with this protein? How many different evidences?
Which association, in your opinion, has highest confidence? Why?

$\S$4.5 GO Programming Exercise

Here we will use the Python package - GOATOOLS to query the GO. This package can read the GO structure stored in OBO format, which is available from the GO website. After loading this file, it is convenient to traverse the GO hierarchy, search for particular GO terms, and find out which other terms they are related to and how.

You can install the goatools through pip tool.

The GOATOOLS package contains the function obo_parser.GODag() to load the GO file. Each GO term in the resulting object is an instance of the GOTerm class, which contains many useful attributes, including:

GOTerm.name: textual definition;
GOTerm.namespace: the ontology the term belongs to (i.e., MF, BP, CC);
GOTerm.parents: list of parent terms;
GOTerm.children: list of children terms;
GOTerm.level: shortest distance to the root node.

$\S$Exercise A

Download the GO basic file in OBO format (go-basic.obo), and load it using the function obo_parser.GODag() from the package GOATOOLS.
Answer the following questions:
- What is the name of the GO term GO:0048527?
- What are the immediate parent(s) of the term GO:0048527?
- What are the immediate children of the term GO:0048527?
- Recursively find all the parent and child terms of the term GO:0048527. Hint: use your solutions to the previous two questions, with a recursive loop.
- How many GO terms have the word "growth" in their name? Hint: Use re.match to find the terms.
- What is the lowest common ancestor term of GO:0048527 and GO:0097178? Hint: Find the common parents (ancestors) with the deepest level.
- Which GO terms regulate GO:0007124 (pseudohyphal growth)? Hint: load the relationship tags and look for terms which define regulation.

$\S$Exercise B

The GOATOOLS package also includes functions to visualize the GO graphs. For instance, it is possible to depict the location of a particular GO term in the ontology using the method GOTerm.draw_lineage().

Visualize the GO term GO:0097190. From the figure, what is the name of this term?
Using this figure, what is the most specific term that is in the parent terms of both GO:0097191 (extrinsic apoptotic signaling pathway) and GO:0038034 (signal transductional in absence of ligand)? This is referred to as the lowest common ancestor.

$\S$Exercise C

An alternative to GOATOOLS and OBO files is to retrieve information relating to a specific term from a web service. One such service is the EMBL-EBI QuickGO, which can provide descriptive information about GO terms in OBO-XML format via the following URL:

http://www.ebi.ac.uk/QuickGO/GTerm?id=<GO_ID>&format=oboxml

Write a Python function get_oboxml(go_id) to
- Use urllib package to request the OBO-XML;
- Parse the XML result using the xmltodict package to convert it into easy-to-use dict;
Use the function to find the name and description of the GO term GO:0048527 (lateral root development). Hint: print out the dictionary returned by the function or create a visualization using the Python package visualisedictionary to study the structure.

$\S$Exercise D: Retrieve GO Annotations

In this exercise, we will learn how to parse a GAF file (GO Annotation File) downloaded from the UniProt-GOA database using an iterator from the BioPython package (Bio.UniProt.GOA.gafiterator):

from Bio.UniProt.GOA import gafiterator
import gzip

fname = "gene_association.goa_arabidopsis.gz"
with gzip.open(fname, "rt") as fp:
	for annotation in gafiterator(fp):
		print(annotation['DB_Object_ID'])

A GAF file is a tab-delimited file containing 17 fields including:

DB: the protein database;
DB_Object_ID: protein ID;
Qualifier: annotation qualifier (such as NOT);
GO_ID: GO term;
Evidence: evidence code.

Find the total number of annotations for Arabidopsis thaliana with NOT qualifiers. What is this as a percentage of the total number of annotations for this species?
How many genes (of Arabidopsis thaliana) have the annotation GO:0048527 (lateral root development)?
Generate a list of annotated proteins which have the word "growth" in their name.
There are 21 evidence codes used in the Gene Ontology project. Many of these are inferred, either by curator or automatically. Find the counts of each evidence code in the Arabidopsis thaliana annotation file.

(Optional)$\S$Exercise E: Semantic similarity of GO Terms

This exercise will focus upon computing the semantic similarity between GO terms, based on the either graph-based or information-theoretic metrics. The former relies on the structure of the Gene Ontology graph, while the latter also accounts for the information content of the terms.

Write a Python function to compute the corresponding similarity metrics:

Since the Gene Ontology graph is a directed acyclic graph (有向无环图), we can use the inverse of the the number of edges separting two terms $(t_1, t_2)$.
Another example of an information-theoretic metric is Resnik similarity measure - the information content of the most informative common ancestor of the two terms in question.

(Optional)5. KEGG

KEGG (Kyoto Encyclopedia of Genes and Genomes)

KEGG
Integrated database of biological systems, genetic building blocks and chemical building blocks
Four major components of KEGG

KEGG API

The KEGG API is a REST-style Application Programming Interface to the KEGG resources using the URL form:

	http://rest.kegg.jp/<operation>/<argument>[/<argument2>[/<argument3> ...]]

operations

info - display database release information and linked db information
list - list the entry identifiers and associated definition
find - find entries with matching query keyword or other query data
get  - retrieve given database entries
conv - convert KEGG identifiers to/from outside identifiers
link - find related entries by using database cross-reference
ddi  - find adverse drug-drug interactions

databases

pathway  - KEGG PATHWAY maps (path:map0XXXX)
brite    - KEGG BRITE functional hierarchies (br:0XXXX)
module   - KEGG modules (md:M00XXX)
ko       - KO functional orthologs (ko:KXXXXX)
genome   - KEGG organism code (gn:TXXXXX)
<org>    - GENES in KEGG organisms 
vg       - Genes in viruses category (vg)
ag       - Genes in addendum category (ag)
compound - Small molecules (cpd)
glycan   - Glycans (gl)
reaction - Biochemical reactions (rn)
rclass   - Reaction class (rc)
enzyme   - Enzyme nomenclature (ec)
network  - Network elements (ne)
variant  - Human gene variants (hsa_var)
disease  - Human diseases (ds)
drug     - Drugs (dr)
dgroup   - Drug groups (dg)
environ  - Health-related substances (ev)
genes    - vg + ag
ligand   - compound, glycan, reaction, enzyme
kegg     - all
<medicus>- translational bioinformatics resource
<outside>- outside databases

org

hsa - human
eco - E.coli
...

medicus

disease_ja    - Disease (ds_ja)
drug_ja       - Drug (dr_ja)
dgroup_ja     - Drug group (dg_ja)
environ_ja    - Environment factor (ev_ja)
compound_ja   - Compound (cpd_ja)
brite_ja      - BRITE (br)
atc           - 7-letter ATC code (atc)
jtc           - Therapeutic category code (jtc)
ndc           - National Drug Code (ndc)
yj            - YJ code (yj)

outside databases

pubmed        - NCBI Pubmed database ID (pmid)
ncbi-geneid   - NCbI Gene ID
ncbi-proteinid- NCBI Protein ID
uniprot       - Uniprot ID (up)
pubchem       - Pubchem ID
chebi         - Chebi ID

`<kid>` = KEGG identifier

`<gene>` = Gene entry name or accession

`<entry>` = Database entry name or accession

`<dbentry>` = `<kid> | <org>:<gene> | <database>:<entry>`

`<dbentries>` = `<dbentry>1[+<dbentry>2...]`

Output format

tab-delimited text from list/find/conv/link
flat-format from get
text-message from info

$\S$5.1 Basic Search

KEGG PATHWAY is presented in a set of wiring diagrams of molecular interactions, reactions and relations, which is the "heart" of KEGG. It enables the mapping of individual elements of a genome (or several genomes) in the context of large-scale dynamic systems, such as metabolism and other cellular processes. It is hugely valuable resource for interpreting genes and gene products in a cellular, systems-level context.

Enter terpenoid in the search field and click Go. By default, the organism is map, which stands for generic reference pathway.
QUESTION: How many pathways are returned for this search?
Click on map00900, which will show a reference map, interactive:
- clicking on a rectangular node will take you to an entry in the KEGG ORTHOLOGY (KO) database
- clicking on a circular node will take you to an entry in the KEGG COMPOUND database
- clicking on a red node will take you to the connected entry in the KEGG PATHWAY database.
Return to the KEGG PATHWAY search page, and enter ksk in the Organism field, then search again for terpenoid.
QUESTION: How many pathways are returned for this search?

5.2 KEGG API

$\S$5.3 Exercise

http://rest.kegg.jp/list/gn will list the genomes/organisms within the KEGG database. What does the triple-letter code bha stand for? And what about ath?
How many terms are contained in the database for the species ath? Are they all proteins? Hint: http://rest.kegg.jp/list/ath
How many pathways are stored in the PATHWAY database?
What is the role for the pathway map00010? Hint: https://www.genome.jp/kegg-bin/show_pathway?map=map00010&show_description=show
What is the role for the pathway map00030? Hint: https://www.genome.jp/kegg-bin/show_pathway?map=map00030&show_description=show
List the connecting reactions and compounds for the above two pathways.

$\S$5.4 Programming Exercises

Write a python program to extract all the reactions and compounds to connect all-pairs of pathways.

6. BioMart

http://ensembl.org/biomart

6.1 Basic Notations

SELECT <Attributes> FROM <Dataset> WHERE <Filter>;

Mart: The host for querying
Database: Which database (genes/variants, ...)
Dataset: Which dataset (organism) would you like to extract information from.
Filter: Which genes you want to retrieve information for.
Attributes: What information you want to have for the genes of interest.

$\S$6.2 Exercise

Basic search

Let's start with a simple question: How many proteins in human have the retinol binding domains (IPR002449). Retrieve the list of gene names and descriptions.

Challenge

The file data/golub_243_genes.txt hosts the 243 differentially expressed genes between AML and ALL patients and their corresponding gene expression levels in log2 transformation measured by Affymetrix Hgu6800 chips.
The first column is the AFFY HuGeneFL probe IDs. Could you please convert these IDs to the NCBI Entrez Gene ID.
Then find which KEGG pathway has the highest number of associations with these differentially expressed genes.
Which of the above genes have no orthologs in the species Saccharomyces cerevisiae?
(Optional)You could find the Perl scripts for the querying. Can you rewrite it into Python?