B1 II: Examining HIV genes and proteins - BDC-training/VT25 GitHub Wiki

Course: VT25 Bioinformatics 1 (SC00037)


The purpose of these exercises is to introduce common procedures in biological sequence alignment and software frequently used.



HIV (Human Immunodeficiency Virus) is a lentivirus (a type of retrovirus), i.e. it has an RNA genome and replicates through a DNA intermediate. The HIV genome contains only 9 genes. A schematic view of the HIV genome is shown above.

For many of the following exercises we will make use of programs that are part of the EMBOSS package (The European Molecular Biology Open Source Software Suite). Use any of these EMBOSS Explorer servers:

Open Reading Frames

Protein synthesis (translation) always starts at the codon AUG (translated to M). Let's examine the complete nucleotide sequence of the HIV genome: genome_dna.fa. The EMBOSS program plotorf identifies open reading frames that start with such a codon and shows them graphically. Locate the program in one of the servers and run with default parameters.

Compare the ORFs you got to these proteins:

Protein name Nucleotide positions
GAG 336...1838
POL 1631...4642
VIF 4587...5165
VPR 5105...5341
VPU 5608...5856
ENV 5771...8341
NEF 8343...8714

Q1. Do the predictions from plotorf correspond to the real enconded proteins?

Translation of nucleotide sequences

An mRNA molecule directs the synthesis of proteins. Each codon (=triplet, three base sequence) specifies an amino acid according to the genetic code. sixpack is another program from the EMBOSS package that translates a sequence into its six possible reading frames. Use the program to translate the mRNA of the GAG gene: gag_mrna.fa, using default parameters.

The program will create two output files in the same page: outfile and outseq.

Look at the amino acid sequences for the first 120 nucleotides in the sixpack outfile

Q2. Is any of the reading frames likely to encode a protein?

This is the GAG protein:

>Gag_protein gi|2801504|gb|AAC82593.1| Gag [Human immunodeficiency virus 1]
MGARASVLSGGELDRWEKIRLRPGGKKKYKLKHIVWASRELERFAVNPGLLETSEGCRQILGQLQPSLQT
GSEELRSLYNTVATLYCVHQRIEIKDTKEALDKIEEEQNKSKKKAQQAAADTGHSNQVSQNYPIVQNIQG
QMVHQAISPRTLNAWVKVVEEKAFSPEVIPMFSALSEGATPQDLNTMLNTVGGHQAAMQMLKETINEEAA
EWDRVHPVHAGPIAPGQMREPRGSDIAGTTSTLQEQIGWMTNNPPIPVGEIYKRWIILGLNKIVRMYSPT
SILDIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPAATLEEMMTAC
QGVGGPGHKARVLAEAMSQVTNSATIMMQRGNFRNQRKIVKCFNCGKEGHTARNCRAPRKKGCWKCGKEG
HQMKDCTERQANFLGKIWPSYKGRPGNFLQSRPEPTAPPEESFRSGVETTTPPQKQEPIDKELYPLTSLR
SLFGNDPSSQ

Q3. Comparing this protein sequence with your sixpack results, is the GAG protein encoded by this mRNA? Hint: You can use the needle program fomr the emboss package

Finding repeats

Repeat regions within the genome can be identified when the sequence is compared to itself. The dottup program performs such comparisons and plots them for visualization. The hiv_ltr_dna.fa sequence has a portion of the HIV genome covering the first and last 500 nucleotides. Run dottup with default parameters.

In the schematic representation of the HIV genome (see above) you can see a region in the 5' terminal part similar to a region in the 3' terminal end.

Q4. Are you able to identify this region in the dottup plot? How long is this repeat (approximately from the graph)?

BLAST and homologues to HIV proteins

Acquired Immune Deficiency Syndrome (AIDS) is caused by two closely related variants of Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV.

To identify similar proteins in other organisms we use BLAST, which you can reach at NCBI. BLAST is configured so it only gives you the best hits (Expected threshold = 0.05). To find distant related sequences you need to lower this threshold.

Set up the following:

  • Query Sequence: rev_prot.fa (the reverse transcriptase in HIV)
  • Database: Swissprot
  • Under Algorithm parameters:
    • Max target sequences: 5000
    • Expect threshold: 10
  • Click on BLAST

You should find numerous hits that correspond to REV proteins from different HIV-1 isolates. Most have good e-values (e-values smaller than 0.001). Compare the first REV protein hit and the last REV protein hit.

Q5. List some differences

You may also find hits to REV proteins from human HIV-2 and monkey viruses named Simian immunodeficiency virus (SIV).

Q6. How many hits to HIV-2 REV and SIV REV do you find? Hint: you can order the descriptions or look at the Taxonomy tab

Multiple alignments - Origin of HIV

env_prot.fa has 13 different protein sequences from isolates of HIV1, HIV2, chimpanzee (SIVCZ) and macaque monkey (SIVM1 and SIVML).

Go to EBI. Click on Find data resources and locate Clustal Omega, the web interface of a multiple sequence alignment program. Upload the env_prot.fa file. Check that you are selecting protein as format and run with default parameters.

Q7. You should be able to identify two major groups. List the sequence names from each group

Click on the Phylogenetic Tree tab. You will see the phylogenetic tree (evolutionary order) of your sequences, both in text as a phylogram. The tree is a little difficult to read, so click on highlighted node in the picture below, and select Reroot on this node

Q8. What does this tree tell us about the phylogenetic relationship of HIV-1, HIV-2 and SIV?

Multiple alignments - HIV drug resistance

A number of drugs against HIV have been developed. One example is AZT which acts as an inhibitor to the reverse transcriptase (RT) encoded by the HIV genome. AZT binds to the active site of the RT and as a result blocks its polymerase activity. However, the mutation frequency of the HIV genome is very high, and resistance to AZT develops easily. This typically occurs by changing amino acids close to the active site so that the affinity for AZT is reduced.

The rt_isolates.fa file contains amino acid sequences of the RT from AZT resistant as well as sensitive strains.

Use Clustal Omega again and make a multiple alignment of the RT isolates.

Examine closely the alignment. You should be able to identify two aminoacid positions that have been mutated in all the AZT resistant strains but NOT in the sensitive strain. These mutations are responsible of their resistance to the drug.

Q9. What are these positions and what are the aminoacid changes?

HIV-1 RT functional domains

Now let's identify the functional domains of the RT isolates, from which we just identified mutations that may give antibiotic resistance.

Go to InterPro, a database that has a collection of protein families, using predictive models. Analyze the sensitive strain by pasting the sequence in the Search by sequence box and leaving all default values as they are.

Q10. From the resulting information, what is the funtion of the RT protein?

Click on the Entries tab. Click on the Select your database drop down menu and select the Pfam database.

Q11. How many Pfam domains did you find? What are their role?

Let's investigate the Reverse transcriptase thumb domain:

  • Click on the Family RVT_thumb link
  • Go to the Alignment section
  • Under Available alignments, select Seed

The seed sequences are selected sequences used to generate statistical models which in turn will be used to scan sequences for this domain.

Q12. What organisms were included for this purpose?

Q13. Just by eye, can you identify any conserved region? If you do, which sequence would it be?

Click on Signature at the left side menu.

Q14. What is the most conserved aminoacid? Was this aminoacid in the sequence you suggested above? What is the probability of that residue? Hint: Click on the aminoacid (in the graph)

Now let’s look at the species distribution of RNase_H and RVT_connect. Go back to the main results page. Select RNase_H and click on Taxonomy on the left menu. You will see an interactive Krona plot (a chart of hierarchical data to show the abundance of organisms). Do the same for Reverse transcriptase connection domain

Q15. Are they distributed among the same species? Why?

HIV-1 RT structure

This is an animation describing in a simple manner the life cycle of the HIV virus and explains how the virus may be battled through inhibition of critical mechanisms.

We will focus on the Reverse Transcriptase RT. Let's identify the key elements that are targeted to generate treatments for an HIV infection and understand how the HIV virus responds by creating resistance to these drugs.

1RTD is an X-ray crystallography structure of HIV-1 RT in complex with DNA. Open the link, this will take you to the Structure database at NCBI. In the Molecular Graphic window, you will find a link to launch the full-featured 3D viewer. Follow the link, you will see how the whole structure is displayed.

Select:

Style -> Surface Type -> Molecular Surface

Look at the structure from different angles by dragging with the mouse. You will identify the following:

  • chainA in light gray (1RTD_A). It is composed of two domains:
    • the polymerase domain: that catalyzes the polymerization of a complementary DNA strand using an RNA template
    • the RNase H domain: that catalyzes degradation of the RNA template
  • chainB in yellow (1RTD_B). Composed only of a polymerase domain
  • In blue and pink you see the DNA-RNA complex (1RTD_E and 1RTD_F)
  • In green, a magnesium (Mg)can be spotted

To identify other components, select:

Style -> Remove Surface

and then:

Style -> Proteins -> Hide

Now you can easily see the DNA-RNA complex together with:

  • Four magnesium ions in green (1RTD_MG, 1RTD_MG2,1RTD_MG2 and 1RTD_MG4) that are needed to stabilize the structural conformation
  • A thyamine (T), that will be incorporated to the DNA by this machinery

Let's put back chainA.

In the Sequences and Annotations menu, under Proteins, select Protein 1RTD_A and then:

Style -> Protein -> Ribbon

You can see how the DNA-RNA complex is sitting along the protein guiding it to incorporate the thyamine.

Let's focus on the hand:

Style -> Protein -> Hide

To highlight it:

Select -> Advanced

In the new window fill in with the following values:

Select: .A:1-324
Name:   polymerase 

Click on Save Selection and then:

Style -> Protein -> Ribbon
Color -> Unicolor -> Red

To highlight the thumb, under the Select -> Advanced, create a new selection:

Select: .A:245-324
Name:   thumb

Click on Save Selection and then:

Style -> Protein -> Ribbon
Color -> Unicolor -> Yellow

Rotate the figure until you see the hand as in the figure:

Let's add some catalytical aspartates, that are critical for the polymerase function. Under the Select -> Advanced, create a new selection:

Select: .A:110,185,186
Name:   aspartates

Click on Save Selection and then:

Style -> Protein -> Sphere
Color -> Unicolor -> Cyan

And finally, a key tyrosine that stabilizes the template-primer with a hydrogen bond. Under the Select -> Advanced, create a new selection:

Select: .A:183
Name:   tyrosine

Click on Save Selection and then:

Style -> Protein -> Sphere
Color -> Unicolor -> Grey

All these key elements are situated close to each other, coordinating the elongation of the DNA-RNA complex.

In a previous exercise, you aligned RT sequences from AZT resistant strains using Clustal Omega. You identified two residues that are mutated in all three AZT resistant isolates. Highlight these positions in the structure.

Q16. Take a screenshot of the structure showing all relevant elements and add it to your report.

You will see how these mutations are located where the main process, the reverse transcription, takes place. As the treatments for HIV are sequence based, when these mutations are present, the virus is resistant to the drug.



Developed by Tore Samuelsson and Marcela Dávila, 2010. Modified by Marcela Dávila, 2017. Modified by Marcela Dávila, 2019. Updated by Marcela Dávila, 2022. Updated by Marcela Dávila, 2023.

⚠️ **GitHub.com Fallback** ⚠️