DNA RNA Protein Basics - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
DNA, RNA & Protein Basics
Module 1.1: DNA, RNA & Protein Basics
Prerequisites: None
Next: Biostatistics Refresher
1. Concept & Motivation
All bioinformatics workflows start with the central dogma of molecular biology:
- DNA stores the genetic information.
- RNA is the transcribed intermediate.
- Proteins are the translated functional products.
In this tutorial you will:
- Generate or load a DNA sequence
- Compute base‐composition statistics
- Transcribe DNA → RNA
- Translate DNA → Protein
- Interpret the numeric and biological meaning of each step
2. Data Description
We’ll work with a synthetic DNA sequence of 300 bp. This ensures everyone can follow along without downloading large files.
3. Hands-On Code
3.1 Generate a Random DNA Sequence
from Bio.Seq import Seq
import random
# 1) Create a 300-bp random DNA string
bases = ["A", "C", "G", "T"]
dna_str = "".join(random.choices(bases, k=300))
# Wrap in a Biopython Seq object
dna = Seq(dna_str)
# Preview first 60 nt
print("DNA (1–60):", dna[:60], "…")
Output:
DNA (1–60): CCTCAGTAACCGAACTGATAACGAAGTCAAGCCGAAACGTGCTAGGACATGACTTCGGCA …
3.2 Compute Base Composition
from Bio.Seq import Seq
import random
from collections import Counter
import pandas as pd
# Generate the DNA
bases = ["A", "C", "G", "T"]
dna = Seq("".join(random.choices(bases, k=300)))
# Count each base
counts = Counter(dna)
total = sum(counts.values())
# Build a DataFrame of counts and percentages
comp = {
base: {"Count": cnt, "Percentage": cnt/total*100}
for base, cnt in counts.items()
}
df_comp = pd.DataFrame.from_dict(comp, orient="index")
df_comp
Output:
3.3 Transcription (DNA → RNA)
# Transcribe DNA into RNA
rna = dna.transcribe()
print("RNA (1–60):", rna[:60], "…")
What happens: All T bases are replaced with U, mimicking mRNA.
Output:
RNA (1–60): GCAGCUCUCGUGACUCAACCAUCGUGCAUGGCUGAUGAGGAUCUCGUUUUCCCGUGUUGG …
3.4 Translation (DNA → Protein)
# Translate DNA into Protein (stop at first stop codon)
protein = dna.translate(to_stop=True)
print("Protein (1–30):", protein[:30], "…")
print("Protein length:", len(protein), "amino acids")
What happens: The sequence is read in codons (3 nt → 1 aa) until a stop codon is encountered.
Output:
Protein (1–30): AALVTQPSCMADEDLVFPCWVSWGSFAP …
Protein length: 28 amino acids
4. Interpretation & Discussion
- Base composition: GC content (
C + G
) influences DNA stability and sequencing bias. - RNA sequence: Shows how transcription swaps
T
→U
. - Protein: The length (
len(protein)
) tells you how many amino acids are encoded before a stop codon; unusual amino acids or premature stops can indicate frameshifts or sequencing errors.
5. Exercises
- Vary the sequence length (e.g. 500 bp, 1000 bp) and observe how percentages stabilise.
- Inject a known motif (e.g.
TATAAT
) at positions 50 – 55, then search for it in the DNA and RNA. - Translate without
to_stop
to see internal stop codons.
dna.translate()
Output:
Seq('AALVTQPSCMADEDLVFPCWVSWGSFAP*TGPPSLRWCSRIWSAQIATVEN*VM...SST')
- Load a real gene from UniProt or NCBI and compare its composition to your synthetic example.