Reference Genome Preparation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

  • Reference Genome Preparation
    Before you can map reads, you need a FASTA file of your reference genome and an index for your chosen aligner.

    1. Obtain or build a FASTA reference
  • Download a published genome (e.g. E. coli K-12 MG1655 RefSeq):

wget -O ref_genome.fa.gz \
  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip ref_genome.fa.gz
  • (Optional) Build your own reference by concatenating contigs:
cat contig1.fa contig2.fa … > ref_genome.fa
  • Verify the FASTA header & sequence:
head -n5 ref_genome.fa
       # >NC_000913.3 Escherichia coli K-12 MG1655, complete genome
  1. Index the reference for fast mapping
    Each aligner needs its own index format:

    • BWA‐MEM

      bwa index ref_genome.fa
      

      Creates .amb, .ann, .bwt, .pac, .sa files.

    • Bowtie2

      bowtie2-build ref_genome.fa ref_bt2_index
      

      Produces ref_bt2_index.* index files.

    • Minimap2 (for long reads)

      minimap2 -d ref_minimap2.mmi ref_genome.fa
      

      Generates a single ref_minimap2.mmi index.

With the FASTA and its index in place, you’re ready to align your reads efficiently in the next step.