Course: VT25 Bioinformatics 1 (SC00037)

The aim of these exercises is to introduce you to:

The frequently used NCBI system for accessing molecular biology information
The UCSC Genome Browser, a web-based system for accessing genome information and other molecular biology data
Ensembl and their retrieval engine BioMart

Here we will examine the blood coagulation Factor IX, which is associated with the bleeding disease: Hemophilia B.

NCBI

Global search

Go to NCBI . You will see all the different databases listed on your left.

Enter Factor IX and see how many hits you get in the different datasets.

Q1. What would happen if instead of searching for Factor IX you search F9?

From the Gene section (database), locate the gene that codes for the human Factor IX and display the record. Scroll down and see what information you can find about this gene.

Q2. How many different names (aliases) do you find for this gene?

Q3. In what tissue is this gene the most expressed? (under Expression, select RNA from 20 human tissues)

Q4. What medical conditions have been associated with this gene? (Look under Phenotypes)

Q5. How many orthologues genes can you find? (Look under General gene information)

Q6. How many different isoforms are there of this gene? (Look under NCBI Reference Sequences (RefSeq))

Let's have a look at isoform 1. Click on NP_000124.1

Q7. How many amino acids are there in this isoform 1?

Click on Graphics. Now you will see the features of the protein and not the gene!

Q8. Why do you think one part of the protein is called heavy chain and the other light chain?

3D structure database

Go back to the Gene report.

Under the NCBI Reference Sequences (RefSeq) section follow the link to UniProtKB/Swiss-Prot ID P00740. Scroll down and scan what information you can get here.

Under Subcellular location:

Q9. Where can F9 be found, according to the GO Annotation? See under the subcellular location section

Posttranslational modifications (PTMs) are covalent processing events that change the properties of a protein by proteolytic cleavage and adding a modifying group, such as acetyl, phosphoryl, glycosyl and methyl, to one or more amino acids.

Under PTM:

Q10. What is a propeptide? How many glycosilated residues does F9 have? See under the PTM/Processing section

Under Structure. Locate the 3D structure with the entry ID 1RFN and follow any of the external database links. This will open one of the portals for the PDB database.

This structure was elucidated via X-ray diffraction using a solution that besides containing F9, contained three ligands.

Q11. What are those ligands?

Q12. What are the weights of the two different chains? (Look under Structure analysis)

OMIM database

Information about diseases associated with this gene can be found at the OMIM (Mendelian Inheritance in Man (OMIM) database.

Go back to the the Gene report and click OMIM under Related information.

Click on the entry 300746 - Coagulation factor IX; F9.

There is a lot of information related to this specific gene.

Q13. What is the mode of inheritance for Hemophilia B?

Under Allelic Variants there is a list of mutations that are linked to disease. One of the allelic variants is called SEATTLE-2 and causes severe hemophilia B.

Q14. What is the difference between this variant and the normal gene? What is the effect on the protein with this change in the gene?

Query retrieval

In this section you will try some queries with the NCBI search tool to understand the difference in the syntax as well as learn how to perform advanced searches to retrieve specific data.

To begin with, we will use the human MLH1 gene, implicated in colon cancer, to illustrate some principles.

Boolean operators

Go to the Nucleotide Database, perform a search for the following terms and make note of the number of hits you get.

Term	Counts
MLH1 AND cancer AND human
MLH1 AND cancer OR human
MLH1 AND (cancer OR human)
MLH1 AND cancer NOT human

Q15. Why do you get different number of hits? Draw a Venn diagram for each term that illustrate what is included in the different searches

Go back and search "MLH1 cancer human". Did you get any hits? Try without the “ ”.

To the right you can see the Search details (i.e how the search engine interpreted your question in the form of a boolean expression). You can change the settings in this box and see how the counts change. For instance, remove: OR human[All Fields], and then add AND colon[Title]

Q16. How does the count change with these two settings?

Advanced Search Builder

As you saw above you can be more specific and specify in what fields to search for a term.

Click on Advanced under the search box.

Q17. How many records do you retrieve when searching for sequences where MLH1 is found in the title of the record and where the sequence is from human.

Q18. How many chimpanzee sequences that have the word mRNA in their titles are available?

Q19. How many of the previous sequences are less than 100 nucleotides? Hint: Enter a range of Sequence length where you specify a range in the format 1:100

Q20. How many of the previous sequences are left if you filter away sequences described as partial?

You might also be interested to filter out a list of genes that have been reported with a certain condition. Go to Gene database and try to identify genes involved in Mitochondrial biogenesis in humans.

Q21. What is your search query?

Q22. How many genes did you find?

As a note of caution: Always browse your results, for instance, check if they are curated or predicted entries, if there are alternative spliced transcripts, if they are non-coding, etc.

UCSC Genome Browser

The Browser

Go to the UCSC Genome browser. At the top of the page you see several links, choose Genome Browser. Under Genomes choose Human hg38 (otherwise you may have different results).

Type Factor IX in the search box. You will get a list of matching sequences. Follow the link F9 (ENST00000218099.7).

In the track section, you can configure the page according to your interest, displaying or hiding genomic information.

For now set the following tracks (all other tracks that may be set the to hide):

Base Position -> dense
NCBI RefSeq -> full
OMIM Alleles -> squish
Encode Regulation -> show
Conservaton -> full

Click on refresh

The first thing you should notice is that there are two F9 records for in the NCBI RefSeq genes track

Q23. Why is that? What is the difference?

Zoom in on the smallest exon until you see the aminoacid sequence:

Q24. What is the amino acid sequence of this exon?

A nice feature of this browser is that we can inspect the genomic environment. Zoom out to see other genes in the screen (you may zoom out 100x).

Q25. What human gene is located "to the right" of Factor IX, and what strand is it on (plus or minus)? Is it the same strand as F9?

The OMIM Alleles track shows all variants in the OMIM database that have been associated with dbSNP identifiers (single nucleotide polymorphisms that are common in the population).

Zoom again so you only have F9 displayed.

Q26. Why are they mostly distributed over the exons? You can click on OMIM Allelic Variant Phenotypes to display all the individual variants. Click again and you will condense the information. If you right-click you can use different format views.

Now, zoom in on the sixth exon of the first transcript (the one with 8 exons). Right-click on Omim Alleles and select full.

Q27. What other diseases/phenotypes besides Hemophilia B are related to the variants in this exon? Click on them to learn why are they different fom Hemophilia B

The ENCODE Regulation section shows information relevant to the regulation of transcription at different levels. By default the H3K27Ac Mark track is displayed (this show where enhancer regions may be due to the modification of histone proteins)

Right-click on the track and select Configure Layered H3K27Ac track set ... (there will be 9 tracks)
Click on TF Clusters (this track shows DNA regions where transcription factors, proteins responsible for modulating gene transcription, bind)
Select full as Display mode
Select All as Filter by factor
Click Submit

Q28. Mention some transcription factors (the darker the rectangle the more confidence there is for the specific TF).

Looking at the Conservation section, by default you will see:

the Cons 100 Verts track: that shows the conservation across 100 vertebrates Base-wise obtained by PhyloP
the Multiz Align track: that shows the alignment of 100 Vertebrates in pack mode obtained by Multiz

Q29. According to the Cons 100 Verts track, what are the most conserved regions? Why are these regions strongly conserved across different organisms?

Tools

You can also use the browser to identify a specific sequence. Below is a primer pair designed to amplify one exon from a human gene:

1) TCTCTCCAACTTTGCACTTTTC  Forward
2) AAGGCTAAGGTCAGCCATGA    Reverse

Try BLAT under Tools to find the location of the sequences. Click on the browser link to visualize the results (zoom to wee both hits)

Q30. Are the primers unique? For both primers what is the start and end position in bps? To what strand do they bind?

Now run In-Silico PCR using the same sequences and with default parameters.

Q31.What is the name of the gene and the exon number that is supposed to be amplified? Hint: Click on the link to get to the browser

The Table Browser

UCSC also provides text-based access to the genome assemblies and annotation data stored in the Genome Browser.

Let's suppose we would like to get the sequences of all genes in chrX that are transcribed from the positive strand and that overlap with expression data.

Go to the Table Browser under Tools and make sure the following is selected, since we want human sequences from RefSeq:

clade: mammal
genome: human
assembly: Dec. 2013 (GRCh38/hg38)
group: Genes and predictions
track: NCBI RefSeq
table: RefSeq Curated
Click Submit

Now we need to filter the data, set the following:

position: chrX
filter: create, set strand to +

intersection: create, then
- group: expression
- track: GTEx Gene V8
Click Submit

This will select genes only on the positive strand from the chrX and that has an overlap with some kind of expression data from GTEx.

Now, set the output format as sequence. Click on summary/statistics to see how many records matched your criteria.

Q32. How many sequence items are you counting?

You can now download the data by clicking get output, for further studies/processing. But we will skip this for the time being.

Visualization of your own data

If you have information obtained from your own experimental work you may upload it to the UCSC browser. If for instance, you have located novel variants, or novel expressed transcripts it could be interesting to view their location in context of several information resources. Here is a very simple toy example of how to add a custom track to the browser.

Go to the Genome Browser
Click on add/manage custom tracks
Paste the following in the Paste URLs or data box

browser position chr22:20114500-20115500
track name=coords description="Chromosome coordinates list" visibility=2, color=255,0,0,  
chr22 20114574 20114685
chr22 20114760 20114875
chr22 20114966 20115079

Submit

In the resulting page all your custom tracks will be displayed.

Check that Genome Browser is selected as view in
Select the Track to Display
Click go to first annotation

You should now see a new track with lines corresponding to the regions listed above in red.

Q33. In this case what is the name of the gene overlapping with your custom tracks?

Ensembl

The Browser

Go to the Ensembl genome browser and choose the human assembly GRCh38.

Search for F9 or directly use its ensemblID: ENSG00000101981

Q34. How many hits did you find? what kind of hits?

Select the gene record.

Q35. How many orthologues does this gene have in Ensembl? Was it the same in NCBI?

Q36. How many transcripts/splice variants does this gene have? How reliable are they? Hint: Click on Show transcript table and explore the Flags colum

Q37. What is the length in bps of the protein coding transcripts?

Click on Summary in the left menu. In the viewer locate the Regulatory Build. One of the annotated regulators is CTCF

Q38. Is this annotation consistent to the UCSC annotation on Transcription factors?

Ensembl has some nice features to highlight functional elements in the nucleotide or amino- acid sequence.

Click on Sequence in the left menu and scroll down.

Q39. What is the nucleotide sequence for exon no 2 (copy/paste)?

Now click on the transcript F9-201 in the transcript table, and then Exons in the left menu. Scroll down and check how the sequence has now been annotated.

Q40. How many known stop gained variants can you find in exon 2?

Click on Protein in the left menu and scroll down.

Q41. Locate the same amino acid sequence as we previously identified in UCSC for the smallest exon. Is it the same? (copy/paste)

BioMart

Now let's do some data mining. We will identify sheep genes that are orthologues to human coding genes related to coagulation.

Go to BioMart.

First, let's identify the human coding genes that are related to coagulation.

Select the human database:

Database: Ensembl Genes 111
Dataset: Human genes (GRCh38.p14)

Select protein coding genes that are involved in the coagulation process:

On your left click on Filters
Gene -> Gene type -> protein_coding
Gene Ontology -> GO Term Name
- Type: coagulation

Select the information we would like to display:

On the left menu click on Attributes
Features -> External -> GO -> Go Term Definition

By clicking on Results you will see the list of genes, and we can corroborate that they are involved in coagulation by reading the GO term definition.

Q42. How many coding genes are involved in coagulation? Hint: Click on Count

Now, let's extract the sheep ortologues of these genes:

Under Filters
MULTI SPECIES COMPARISONS -> Homologue filters -> Orthologous Sheep Genes
Under Attributes
Homologues -> ORTHOLOGUES [P-T] -> Sheep Orthologues
Select the following:
- Sheep gene name
- Sheep chromosome/scaffold name
- Sheep chromosome/scaffold start (bp)
- Sheep chromosome/scaffold end (bp)
- Sheep orthology confidence [0 low, 1 high]

Count and get the results.

Q43. How many genes did you find that were orthologues?

Home: Bioinformatics 1 (SC00037)

NCBI: Developed by Tore Samuelsson, Marcela Dávila, 2010. Modified by Marcela Dávila and Katarina Truvé, 2017. Updated by Marcela Dávila, 2022
UCSC: Developed by Tore Samuelsson and Marcela Dávila, 2010. Updated by Marcela Dávila, 2022
Ensembl: Developed by Katarina Truvé, 2017. Updated by Marcela Dávila, 2022

B1 I: Genome Browsers - BDC-training/VT25 GitHub Wiki

NCBI

Global search

3D structure database

OMIM database

Query retrieval

Boolean operators

Advanced Search Builder

UCSC Genome Browser

The Browser

Tools

The Table Browser

Visualization of your own data

Ensembl

The Browser

BioMart

Home: Bioinformatics 1 (SC00037)

⚠️ GitHub.com Fallback ⚠️

B1 I: Genome Browsers - BDC-training/VT25 GitHub Wiki

NCBI

Global search

3D structure database

OMIM database

Query retrieval

Boolean operators

Advanced Search Builder

UCSC Genome Browser

The Browser

Tools

The Table Browser

Visualization of your own data

Ensembl

The Browser

BioMart

Home: Bioinformatics 1 (SC00037)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️