DNA RNA Protein Basics - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

DNA, RNA & Protein Basics

Module 1.1: DNA, RNA & Protein Basics
Prerequisites: None
Next: Biostatistics Refresher


1. Concept & Motivation

All bioinformatics workflows start with the central dogma of molecular biology:

  1. DNA stores the genetic information.
  2. RNA is the transcribed intermediate.
  3. Proteins are the translated functional products.

In this tutorial you will:

  • Generate or load a DNA sequence
  • Compute base‐composition statistics
  • Transcribe DNA → RNA
  • Translate DNA → Protein
  • Interpret the numeric and biological meaning of each step

2. Data Description

We’ll work with a synthetic DNA sequence of 300 bp. This ensures everyone can follow along without downloading large files.


3. Hands-On Code

3.1 Generate a Random DNA Sequence

from Bio.Seq import Seq
import random

# 1) Create a 300-bp random DNA string
bases = ["A", "C", "G", "T"]
dna_str = "".join(random.choices(bases, k=300))

# Wrap in a Biopython Seq object
dna = Seq(dna_str)

# Preview first 60 nt
print("DNA (1–60):", dna[:60], "…")

Output:

DNA (1–60): CCTCAGTAACCGAACTGATAACGAAGTCAAGCCGAAACGTGCTAGGACATGACTTCGGCA …

3.2 Compute Base Composition

from Bio.Seq import Seq
import random
from collections import Counter
import pandas as pd

# Generate the DNA
bases = ["A", "C", "G", "T"]
dna = Seq("".join(random.choices(bases, k=300)))

# Count each base
counts = Counter(dna)
total = sum(counts.values())

# Build a DataFrame of counts and percentages
comp = {
    base: {"Count": cnt, "Percentage": cnt/total*100}
    for base, cnt in counts.items()
}
df_comp = pd.DataFrame.from_dict(comp, orient="index")
df_comp

Output:

image

3.3 Transcription (DNA → RNA)

# Transcribe DNA into RNA
rna = dna.transcribe()
print("RNA (1–60):", rna[:60], "…")

What happens: All T bases are replaced with U, mimicking mRNA.

Output:

RNA (1–60): GCAGCUCUCGUGACUCAACCAUCGUGCAUGGCUGAUGAGGAUCUCGUUUUCCCGUGUUGG …

3.4 Translation (DNA → Protein)

# Translate DNA into Protein (stop at first stop codon)
protein = dna.translate(to_stop=True)
print("Protein (1–30):", protein[:30], "…")
print("Protein length:", len(protein), "amino acids")

What happens: The sequence is read in codons (3 nt → 1 aa) until a stop codon is encountered.

Output:

Protein (1–30): AALVTQPSCMADEDLVFPCWVSWGSFAP …
Protein length: 28 amino acids

4. Interpretation & Discussion

  • Base composition: GC content (C + G) influences DNA stability and sequencing bias.
  • RNA sequence: Shows how transcription swaps TU.
  • Protein: The length (len(protein)) tells you how many amino acids are encoded before a stop codon; unusual amino acids or premature stops can indicate frameshifts or sequencing errors.

5. Exercises

  1. Vary the sequence length (e.g. 500 bp, 1000 bp) and observe how percentages stabilise.
  2. Inject a known motif (e.g. TATAAT) at positions 50 – 55, then search for it in the DNA and RNA.
  3. Translate without to_stop to see internal stop codons.
dna.translate()

Output:

Seq('AALVTQPSCMADEDLVFPCWVSWGSFAP*TGPPSLRWCSRIWSAQIATVEN*VM...SST')
  1. Load a real gene from UniProt or NCBI and compare its composition to your synthetic example.