B1 I: Genome Browsers - BDC-training/VT25 GitHub Wiki
Course: VT25 Bioinformatics 1 (SC00037)
The aim of these exercises is to introduce you to:
-
The frequently used NCBI system for accessing molecular biology information
-
The UCSC Genome Browser, a web-based system for accessing genome information and other molecular biology data
-
Ensembl and their retrieval engine BioMart
Here we will examine the blood coagulation Factor IX, which is associated with the bleeding disease: Hemophilia B.
Go to NCBI . You will see all the different databases listed on your left.
Enter Factor IX
and see how many hits you get in the different datasets.
Q1. What would happen if instead of searching for Factor IX you search F9?
From the Gene
section (database), locate the gene that codes for the human Factor IX
and display the record. Scroll down and see what information you can find about this gene.
Q2. How many different names (aliases) do you find for this gene?
Q3. In what tissue is this gene the most expressed? (under
Expression
, selectRNA from 20 human tissues
)
Q4. What medical conditions have been associated with this gene? (Look under
Phenotypes
)
Q5. How many orthologues genes can you find? (Look under
General gene information
)
Q6. How many different isoforms are there of this gene? (Look under
NCBI Reference Sequences (RefSeq)
)
Let's have a look at isoform 1. Click on NP_000124.1
Q7. How many amino acids are there in this isoform 1?
Click on Graphics
. Now you will see the features of the protein and not the gene!
Q8. Why do you think one part of the protein is called
heavy chain
and the otherlight chain
?
Go back to the Gene
report.
Under the NCBI Reference Sequences (RefSeq)
section follow the link to UniProtKB/Swiss-Prot ID P00740
.
Scroll down and scan what information you can get here.
Under Subcellular location
:
Q9. Where can F9 be found, according to the
GO Annotation
? See under thesubcellular location
section
Posttranslational modifications (PTMs) are covalent processing events that change the properties of a protein by proteolytic cleavage and adding a modifying group, such as acetyl, phosphoryl, glycosyl and methyl, to one or more amino acids.
Under PTM
:
Q10. What is a
propeptide
? How many glycosilated residues does F9 have? See under thePTM/Processing
section
Under Structure
. Locate the 3D structure with the entry ID 1RFN
and follow any of the external database links. This will open one of the portals for the PDB
database.
This structure was elucidated via X-ray diffraction using a solution that besides containing F9
, contained three ligands.
Q11. What are those ligands?
Q12. What are the weights of the two different chains? (Look under
Structure analysis
)
Information about diseases associated with this gene can be found at the OMIM (Mendelian Inheritance in Man (OMIM) database.
Go back to the the Gene report and click OMIM
under Related information
.
Click on the entry 300746 - Coagulation factor IX; F9
.
There is a lot of information related to this specific gene.
Q13. What is the mode of inheritance for Hemophilia B?
Under Allelic Variants
there is a list of mutations that are linked to disease.
One of the allelic variants is called SEATTLE-2
and causes severe hemophilia B.
Q14. What is the difference between this variant and the normal gene? What is the effect on the protein with this change in the gene?
In this section you will try some queries with the NCBI search tool to understand the difference in the syntax as well as learn how to perform advanced searches to retrieve specific data.
To begin with, we will use the human MLH1 gene
, implicated in colon cancer, to illustrate some principles.
Go to the Nucleotide Database
, perform a search for the following terms and make note of the number of hits you get.
Term | Counts |
---|---|
MLH1 AND cancer AND human | |
MLH1 AND cancer OR human | |
MLH1 AND (cancer OR human) | |
MLH1 AND cancer NOT human |
Q15. Why do you get different number of hits? Draw a Venn diagram for each term that illustrate what is included in the different searches
Go back and search "MLH1 cancer human"
. Did you get any hits? Try without the “ ”
.
To the right you can see the Search details
(i.e how the search engine interpreted your question
in the form of a boolean expression).
You can change the settings in this box and see how the counts change.
For instance, remove: OR human[All Fields]
, and then add AND colon[Title]
Q16. How does the count change with these two settings?
As you saw above you can be more specific and specify in what fields to search for a term.
Click on Advanced
under the search box.
Q17. How many records do you retrieve when searching for sequences where MLH1 is found in the title of the record and where the sequence is from human.
Q18. How many chimpanzee sequences that have the word
mRNA
in their titles are available?
Q19. How many of the previous sequences are less than 100 nucleotides? Hint: Enter a range of
Sequence length
where you specify a range in the format 1:100
Q20. How many of the previous sequences are left if you filter away sequences described as
partial
?
You might also be interested to filter out a list of genes that have been reported with a certain condition. Go to Gene database
and try to identify genes involved in Mitochondrial biogenesis
in humans.
Q21. What is your search query?
Q22. How many genes did you find?
As a note of caution: Always browse your results, for instance, check if they are curated or predicted entries, if there are alternative spliced transcripts, if they are non-coding, etc.
Go to the UCSC Genome browser. At the top of the page you see several links, choose Genome Browser
. Under Genomes
choose Human hg38
(otherwise you may have different results).
Type Factor IX
in the search box. You will get a list of matching sequences. Follow the link F9 (ENST00000218099.7)
.
In the track section, you can configure the page according to your interest, displaying or hiding genomic information.
For now set the following tracks (all other tracks that may be set the to hide):
- Base Position ->
dense
- NCBI RefSeq ->
full
- OMIM Alleles ->
squish
- Encode Regulation ->
show
- Conservaton ->
full
Click on refresh
The first thing you should notice is that there are two F9 records for in the NCBI RefSeq genes track
Q23. Why is that? What is the difference?
Zoom in on the smallest exon until you see the aminoacid sequence:
Q24. What is the amino acid sequence of this exon?
A nice feature of this browser is that we can inspect the genomic environment. Zoom out to see other genes in the screen (you may zoom out 100x).
Q25. What human gene is located "to the right" of Factor IX, and what strand is it on (plus or minus)? Is it the same strand as F9?
The OMIM Alleles
track shows all variants in the OMIM database that have been associated with dbSNP identifiers (single nucleotide polymorphisms that are common in the population).
Zoom again so you only have F9
displayed.
Q26. Why are they mostly distributed over the exons? You can click on
OMIM Allelic Variant Phenotypes
to display all the individual variants. Click again and you will condense the information. If you right-click you can use different format views.
Now, zoom in on the sixth exon of the first transcript (the one with 8 exons). Right-click on Omim Alleles
and select full
.
Q27. What other diseases/phenotypes besides Hemophilia B are related to the variants in this exon? Click on them to learn why are they different fom Hemophilia B
The ENCODE Regulation
section shows information relevant to the regulation of transcription at different levels. By default the H3K27Ac Mark track is displayed (this show where enhancer regions may be due to the modification of histone proteins)
- Right-click on the track and select
Configure Layered H3K27Ac track set ...
(there will be 9 tracks) - Click on
TF Clusters
(this track shows DNA regions where transcription factors, proteins responsible for modulating gene transcription, bind) - Select
full
as Display mode - Select
All
as Filter by factor - Click Submit
Q28. Mention some transcription factors (the darker the rectangle the more confidence there is for the specific TF).
Looking at the Conservation
section, by default you will see:
- the Cons 100 Verts track: that shows the conservation across 100 vertebrates Base-wise obtained by PhyloP
- the Multiz Align track: that shows the alignment of 100 Vertebrates in pack mode obtained by Multiz
Q29. According to the
Cons 100 Verts
track, what are the most conserved regions? Why are these regions strongly conserved across different organisms?
You can also use the browser to identify a specific sequence. Below is a primer pair designed to amplify one exon from a human gene:
1) TCTCTCCAACTTTGCACTTTTC Forward
2) AAGGCTAAGGTCAGCCATGA Reverse
Try BLAT
under Tools
to find the location of the sequences. Click on the browser link
to visualize the results (zoom to wee both hits)
Q30. Are the primers unique? For both primers what is the start and end position in bps? To what strand do they bind?
Now run In-Silico PCR
using the same sequences and with default parameters.
Q31.What is the name of the gene and the exon number that is supposed to be amplified? Hint: Click on the link to get to the browser
UCSC also provides text-based access to the genome assemblies and annotation data stored in the Genome Browser.
Let's suppose we would like to get the sequences of all genes in chrX that are transcribed from the positive strand and that overlap with expression data.
Go to the Table Browser
under Tools
and make sure the following is selected, since we want human sequences from RefSeq:
- clade:
mammal
- genome:
human
- assembly:
Dec. 2013 (GRCh38/hg38)
- group:
Genes and predictions
- track:
NCBI RefSeq
- table:
RefSeq Curated
- Click Submit
Now we need to filter the data, set the following:
- position:
chrX
- filter:
create
, set strand to+
- intersection:
create
, then- group:
expression
- track:
GTEx Gene V8
- group:
- Click Submit
This will select genes only on the positive strand from the chrX and that has an overlap with some kind of expression data from GTEx.
Now, set the output format as sequence
. Click on summary/statistics
to see how many records matched your criteria.
Q32. How many sequence items are you counting?
You can now download the data by clicking get output, for further studies/processing. But we will skip this for the time being.
If you have information obtained from your own experimental work you may upload it to the UCSC browser. If for instance, you have located novel variants, or novel expressed transcripts it could be interesting to view their location in context of several information resources. Here is a very simple toy example of how to add a custom track to the browser.
- Go to the
Genome Browser
- Click on
add/manage custom tracks
- Paste the following in the
Paste URLs or data
box
browser position chr22:20114500-20115500
track name=coords description="Chromosome coordinates list" visibility=2, color=255,0,0,
chr22 20114574 20114685
chr22 20114760 20114875
chr22 20114966 20115079
Submit
In the resulting page all your custom tracks will be displayed.
- Check that
Genome Browser
is selected as view in - Select the Track to Display
- Click
go to first annotation
You should now see a new track with lines corresponding to the regions listed above in red.
Q33. In this case what is the name of the gene overlapping with your custom tracks?
Go to the Ensembl genome browser and choose the human assembly GRCh38
.
Search for F9 or directly use its ensemblID: ENSG00000101981
Q34. How many hits did you find? what kind of hits?
Select the gene record.
Q35. How many orthologues does this gene have in Ensembl? Was it the same in NCBI?
Q36. How many transcripts/splice variants does this gene have? How reliable are they? Hint: Click on
Show transcript table
and explore theFlags
colum
Q37. What is the length in bps of the protein coding transcripts?
Click on Summary
in the left menu. In the viewer locate the Regulatory Build
. One of the annotated regulators is CTCF
Q38. Is this annotation consistent to the UCSC annotation on Transcription factors?
Ensembl has some nice features to highlight functional elements in the nucleotide or amino- acid sequence.
Click on Sequence
in the left menu and scroll down.
Q39. What is the nucleotide sequence for exon no 2 (copy/paste)?
Now click on the transcript F9-201
in the transcript table, and then Exons
in the left menu. Scroll down and check how the sequence has now been annotated.
Q40. How many known stop gained variants can you find in exon 2?
Click on Protein
in the left menu and scroll down.
Q41. Locate the same amino acid sequence as we previously identified in UCSC for the smallest exon. Is it the same? (copy/paste)
Now let's do some data mining. We will identify sheep genes that are orthologues to human coding genes related to coagulation.
Go to BioMart.
First, let's identify the human coding genes that are related to coagulation.
Select the human database:
- Database:
Ensembl Genes 111
- Dataset:
Human genes (GRCh38.p14)
Select protein coding genes that are involved in the coagulation process:
- On your left click on
Filters
Gene -> Gene type -> protein_coding
-
Gene Ontology -> GO Term Name
- Type:
coagulation
- Type:
Select the information we would like to display:
- On the left menu click on
Attributes
Features -> External -> GO -> Go Term Definition
By clicking on Results
you will see the list of genes, and we can corroborate that they are involved in coagulation
by reading the GO term definition.
Q42. How many coding genes are involved in coagulation? Hint: Click on
Count
Now, let's extract the sheep ortologues of these genes:
- Under
Filters
MULTI SPECIES COMPARISONS -> Homologue filters -> Orthologous Sheep Genes
- Under
Attributes
Homologues -> ORTHOLOGUES [P-T] -> Sheep Orthologues
- Select the following:
Sheep gene name
Sheep chromosome/scaffold name
Sheep chromosome/scaffold start (bp)
Sheep chromosome/scaffold end (bp)
Sheep orthology confidence [0 low, 1 high]
Count and get the results.
Q43. How many genes did you find that were orthologues?
NCBI: Developed by Tore Samuelsson, Marcela Dávila, 2010. Modified by Marcela Dávila and Katarina Truvé, 2017. Updated by Marcela Dávila, 2022
UCSC: Developed by Tore Samuelsson and Marcela Dávila, 2010. Updated by Marcela Dávila, 2022
Ensembl: Developed by Katarina Truvé, 2017. Updated by Marcela Dávila, 2022