2B. Day 1: Biological databases and bioinformatics e‐resources (Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Biological Databases
Biological databases are integral to modern life sciences research, serving as organised repositories of biological information that facilitate data storage, retrieval, and analysis. These databases house a vast array of molecular, genetic, and functional information, providing researchers with valuable resources for understanding the complexities of living organisms. Genomic databases, such as GenBank and Ensembl, store DNA sequences from various species, enabling the comparison of genomes and the identification of genes. Protein databases, including UniProt, catalogue protein sequences and functional annotations, aid in the study of protein structure and function. Metabolic databases, like KEGG, offer insights into biochemical pathways and metabolic networks. Biological databases are crucial for interdisciplinary research, allowing scientists to integrate genomics, proteomics, and metabolomics information. They support bioinformatics tools and software applications, facilitating the analysis of large datasets and the discovery of patterns or relationships within biological information.
These databases foster collaboration and knowledge-sharing within the scientific community, contributing to advancements in medicine, agriculture, and environmental science. As our understanding of living systems grows, biological databases play an increasingly vital role in accelerating research and driving innovation in the biological sciences.
Genomics
i. NCBI: NCBI, part of the National Library of Medicine, provides access to biomedical and genomic information through databases like GenBank, PubMed, and BLAST, facilitating research in molecular biology, bioinformatics, and genetics.
ii. EBI/EMBL: EBI, part of the European Molecular Biology Laboratory, offers access to bioinformatics data, tools, and resources, supporting research in genomics, proteomics, structural biology, and more, contributing to advancing life sciences globally.
iii. DDBJ: DDBJ collects and provides nucleotide sequence data, collaborates with NCBI and EBI in the International Nucleotide Sequence Database Collaboration, ensuring global accessibility to genetic information for scientific and medical research.
iv. SRA (Sequence Read Archive): SRA is a publicly available database serving as a high-throughput sequencing data repository. It is maintained by the National Center for Biotechnology Information (NCBI), which is part of the United States National Library of Medicine. The SRA database is a critical resource for researchers in genomics, transcriptomics, and related fields, providing access to a vast collection of raw sequencing data.
Proteomics
Proteomics is a new type of ‘omics’ that has rapidly developed, especially in therapeutics. The word proteome was created by Marc Wilkins in 1995. Proteomics is a branch of molecular biology that involves the comprehensive study of proteins, including their structures, functions, interactions, and abundances within a biological system. It aims to understand the roles and behaviours of proteins in various cellular processes, tissues, and organisms. Proteomic techniques provide insights into the dynamic and complex nature of the proteome, which is the entire set of proteins expressed by a genome, cell, tissue, or organism at a given time. Proteomics provides a better understanding of the structure and function of the organism than genomics.
Proteomics databases are crucial in organising, storing, and disseminating proteomic data. Some prominent proteomics databases include:
i. UniProt: UniProt is a comprehensive resource that provides information on protein sequences, functions, and structures. It includes the UniProtKB (Knowledgebase), UniRef (Reference Clusters), and UniParc (Protein Archive).
ii. PRIDE-PRoteomics IDEntifications Database: PRIDE is a database for MS-based proteomics data. It allows researchers to submit, browse, and analyse mass spectrometry data, including peptide and protein identifications.
iii. PeptideAtlas: PeptideAtlas is a resource that catalogues observed peptides from tandem mass spectrometry experiments, providing a reference for mass spectrometry-based proteomics.
iv. MassIVE-Mass Spectrometry Interactive Virtual Environment: MassIVE is a community resource for sharing mass spectrometry-based proteomics data. It includes datasets, tools, and analyses contributed by the community.
v. ProteomeXchange: ProteomeXchange is a consortium that facilitates the exchange of proteomics data. It integrates multiple proteomics repositories, including PRIDE, MassIVE, and others.
Metabolic and pathway databases
Metabolic and pathway databases are repositories of information that organise and provide access to data related to metabolic pathways, biochemical reactions, and the interconnected networks of molecules within living organisms. These databases are valuable resources for researchers studying metabolism, biochemistry, and systems biology. They facilitate the exploration of metabolic pathways, the identification of metabolites, and understanding the relationships between different biochemical reactions. Here are some notable metabolic and pathway databases:
i. KEGG-(Kyoto Encyclopedia of Genes and Genomes): KEGG is a comprehensive resource that includes information on pathways, diseases, drugs, and organisms. It provides a wealth of data on metabolic, signalling pathways, and other biological processes.
ii. Reactome: Reactome is a curated database covering many biological pathways, including metabolism, signal transduction, and the cell cycle. It integrates pathway information with functional annotations and data on protein-protein interactions.
iii. MetaCyc: MetaCyc is a database of experimentally determined metabolic pathways and enzymes. It covers various organisms and provides detailed information on individual reactions and pathways.
iv. HMDB-(Human Metabolome Database): HMDB is a database that focuses on human metabolism. It provides information on small molecule metabolites found in the human body, including their structures, pathways, and associated diseases.
v. BRENDA-(BRaunschweig ENzyme DAtabase): BRENDA is a comprehensive enzyme information system that includes data on enzyme function, properties, and kinetics. It is a valuable resource for researchers studying biochemical reactions and enzymology.
vi. BioCyc: BioCyc is a collection of Pathway/Genome Databases (PGDBs) that cover a wide range of organisms. Each PGDB in BioCyc is a collection of metabolic pathways, enzymes, and associated information.
vii. WIKIPathways: WIKIPathways is a collaborative platform that allows the community to contribute and edit pathway information. It provides a wiki-style interface for curating pathway diagrams and associated annotations.
viii. MetaboLights: MetaboLights is a repository for metabolomics experiments and derived information. It includes data on metabolites, experimental conditions, and associated pathways.
ix. SMPDB-(Small Molecule Pathway Database): SMPDB is a database that focuses on small molecules and their involvement in metabolic pathways. It includes information on metabolites, enzymes, and diseases associated with specific pathways.
These databases are crucial in systems biology and bioinformatics, providing researchers with structured and curated information to analyse and interpret metabolic pathways. They are essential for understanding the molecular basis of various biological processes and exploring the connections between genes, proteins, and metabolites in different organisms.
Expression databases
Expression databases are specialised repositories that store and provide access to information related to gene expression patterns across different biological conditions, tissues, developmental stages, or experimental treatments. These databases play a crucial role in understanding how genes are regulated and expressed, providing valuable insights into the functional roles of genes in various biological processes. A list of expression databases is given below:
i. GEO-(Gene Expression Omnibus): Maintained by the National Center for Biotechnology Information (NCBI), GEO is a comprehensive public repository for gene expression data, including microarray and RNA-Seq data.
ii. ArrayExpress: Hosted by the European Bioinformatics Institute (EBI), ArrayExpress is a database that archives functional genomics experiments, including gene expression data from microarrays and high-throughput sequencing.
iii. Expression Atlas: Also hosted by EBI, Expression Atlas integrates gene and protein expression data from ArrayExpress with functional information from other resources, providing a comprehensive view of gene expression across different conditions.
iv. HPA-(Human Protein Atlas): HPA offers information on the tissue-specific expression of human proteins, including mRNA expression data, immunohistochemistry images, and antibody-based profiling.
v. FGED-(Functional Genomics Data): FGED provides a collection of gene expression and functional genomics data standards, fostering data sharing and interoperability among different databases.
vi. GDC-(NCI Genomic Data Commons): GDC is a platform that provides access to various cancer genomics datasets, including gene expression data, to support cancer research and precision medicine initiatives.
vii. Gentrepid: Gentrepid integrates information on gene function, gene expression, and disease association, aiming to predict the functional consequences of genetic variation.
viii. Single Cell Expression Atlas: This EBI resource focuses on single-cell RNA-Seq data, providing information on gene expression across different tissues and conditions at the single-cell level.
ix. Gemma: Gemma is a gene expression database that allows users to search, visualise, and analyse gene expression data sets across different species.
Virus and Bacterial Database
ICTVdB contains taxonomic information for thousands of viruses.
BacDrive: contains taxonomic information for thousands of bacteria.
Animal Specific database
i. Animal QTL Database: This database collects publicly available trait mapping data, including QTL (phenotype/expression, eQTL), candidate gene and association data (GWAS), and copy number variations (CNV) mapped to livestock animal genomes. It aims to facilitate locating and comparing discoveries within and between species.
ii. FAANG Community: The Functional Annotation of Animal Genomes (FAANG) is an international project aiming to create comprehensive maps of functional elements in animal genomes. This community is likely an essential resource for researchers interested in functional genomics data for various animal species.
iii. Animal SNPAtlas: This database could potentially provide information about single nucleotide polymorphisms (SNPs) in animals, allowing researchers to explore genetic variability within and between populations of different animal species.
iv. Fish SNP Database: This database likely focuses on providing information about single nucleotide polymorphisms (SNPs) in fish species. It's a valuable resource for researchers interested in genetic variations specific to fish.
v. Cattle Gene Atlas: This database provides comprehensive information on the genetic elements and functionalities specific to cattle. It may include data on gene expression, regulation, and other genomic features specific to cattle.
Time for Hands on session
QTL_Exercise.pdf : Exploration of Animal QTL Database