Reference Genome Preparation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Reference Genome Preparation
Before you can map reads, you need a FASTA file of your reference genome and an index for your chosen aligner.
1. Obtain or build a FASTA reference
Download a published genome (e.g. E. coli K-12 MG1655 RefSeq):

wget -O ref_genome.fa.gz \
  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip ref_genome.fa.gz

(Optional) Build your own reference by concatenating contigs:

cat contig1.fa contig2.fa … > ref_genome.fa

Verify the FASTA header & sequence:

head -n5 ref_genome.fa
       # >NC_000913.3 Escherichia coli K-12 MG1655, complete genome

Index the reference for fast mapping
Each aligner needs its own index format:
- BWA‐MEM
```
bwa index ref_genome.fa
```
  Creates .amb, .ann, .bwt, .pac, .sa files.
- Bowtie2
```
bowtie2-build ref_genome.fa ref_bt2_index
```
  Produces ref_bt2_index.* index files.
- Minimap2 (for long reads)
```
minimap2 -d ref_minimap2.mmi ref_genome.fa
```
  Generates a single ref_minimap2.mmi index.

With the FASTA and its index in place, you’re ready to align your reads efficiently in the next step.