DNA RNA Protein Basics - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

DNA, RNA & Protein Basics

Module 1.1: DNA, RNA & Protein Basics
Prerequisites: None
Next: Biostatistics Refresher

1. Concept & Motivation

All bioinformatics workflows start with the central dogma of molecular biology:

DNA stores the genetic information.
RNA is the transcribed intermediate.
Proteins are the translated functional products.

In this tutorial you will:

Generate or load a DNA sequence
Compute base‐composition statistics
Transcribe DNA → RNA
Translate DNA → Protein
Interpret the numeric and biological meaning of each step

2. Data Description

We’ll work with a synthetic DNA sequence of 300 bp. This ensures everyone can follow along without downloading large files.

3. Hands-On Code

3.1 Generate a Random DNA Sequence

from Bio.Seq import Seq
import random

# 1) Create a 300-bp random DNA string
bases = ["A", "C", "G", "T"]
dna_str = "".join(random.choices(bases, k=300))

# Wrap in a Biopython Seq object
dna = Seq(dna_str)

# Preview first 60 nt
print("DNA (1–60):", dna[:60], "…")

Output:

DNA (1–60): CCTCAGTAACCGAACTGATAACGAAGTCAAGCCGAAACGTGCTAGGACATGACTTCGGCA …

3.2 Compute Base Composition

from Bio.Seq import Seq
import random
from collections import Counter
import pandas as pd

# Generate the DNA
bases = ["A", "C", "G", "T"]
dna = Seq("".join(random.choices(bases, k=300)))

# Count each base
counts = Counter(dna)
total = sum(counts.values())

# Build a DataFrame of counts and percentages
comp = {
    base: {"Count": cnt, "Percentage": cnt/total*100}
    for base, cnt in counts.items()
}
df_comp = pd.DataFrame.from_dict(comp, orient="index")
df_comp

Output:

3.3 Transcription (DNA → RNA)

# Transcribe DNA into RNA
rna = dna.transcribe()
print("RNA (1–60):", rna[:60], "…")

What happens: All T bases are replaced with U, mimicking mRNA.

Output:

RNA (1–60): GCAGCUCUCGUGACUCAACCAUCGUGCAUGGCUGAUGAGGAUCUCGUUUUCCCGUGUUGG …

3.4 Translation (DNA → Protein)

# Translate DNA into Protein (stop at first stop codon)
protein = dna.translate(to_stop=True)
print("Protein (1–30):", protein[:30], "…")
print("Protein length:", len(protein), "amino acids")

What happens: The sequence is read in codons (3 nt → 1 aa) until a stop codon is encountered.

Output:

Protein (1–30): AALVTQPSCMADEDLVFPCWVSWGSFAP …
Protein length: 28 amino acids

4. Interpretation & Discussion

Base composition: GC content (C + G) influences DNA stability and sequencing bias.
RNA sequence: Shows how transcription swaps T → U.
Protein: The length (len(protein)) tells you how many amino acids are encoded before a stop codon; unusual amino acids or premature stops can indicate frameshifts or sequencing errors.

5. Exercises

Vary the sequence length (e.g. 500 bp, 1000 bp) and observe how percentages stabilise.
Inject a known motif (e.g. TATAAT) at positions 50 – 55, then search for it in the DNA and RNA.
Translate without to_stop to see internal stop codons.

dna.translate()

Output:

Seq('AALVTQPSCMADEDLVFPCWVSWGSFAP*TGPPSLRWCSRIWSAQIATVEN*VM...SST')

Load a real gene from UniProt or NCBI and compare its composition to your synthetic example.