Reference genomes - MeryemAk/dRNASeq GitHub Wiki

Download Reference genomes and annotation files for all species

Contents:

Download reference genome and annotation file for Homo sapiens

  1. Go to NCBI. Type Homo sapiens in the search bar and hit the search button.
  1. The official reference genome is shown with a verified sign next to it, click on the assembly name.
  1. On the next page click on the Download button. A new window will open, here select both the FASTA & GTF file. Click on the Download button.
  1. A zip file will be downloaded in the Downloads/ directory of the local machine. Extract the zip file and hit enter on the following window that opens.
  1. Double click on the extracted file and follow the path below. This end directory should contain two files: genomic.gtf and GCF_000001405.40_GRCh38.p14_genomic.fna.
ncbi_dataset > ncbi_dataset > data > GCF_000001405.40/
  1. Change the names of these two files:
  • genomic.gtf--> human_annotation.gtf
  • GCF_000001405.40_GRCh38.p14_genomic.fna --> human_ref.fna
    other
  1. Move the extracted files to the reference_genomes directory in the dRNASeq pipeline.

Note:
If files like Zone.Identifier show up, just delete those with:

find . -name "*Zone.Identifier" -type f -delete
  1. The reference genome needs to be indexed for Minimap2. Indexing is necessary because it significantly speeds up alignment and reduces memory usage. To do this open the terminal and execute the following commands:
cd $HOME/dRNASeq/reference_genomes/                 # Navigate to reference_genomes/ directory
minimap2 -x map-ont -d human_ref.mmi human_ref.fna  # Execute the indexing command

This command creates human_ref.mmi, which is necessary for mapping.
The general indexing command looks like this:
minimap2 -x map-ont -d <output_index_file> <input_reference_genome>

  • -d → Creates an index
  • <output_index_file> → The .mmi file that stores the indexed genome
  • <input_reference_genome> → The .fna file containing the reference sequences

Download reference genome and annotation file for Candida albicans

Follow the same steps from Homo sapiens for Candida albicans or any other fungi.

After downloading the files, change their name within the 'GCF' folder:

  • genomic.gtf--> candida_annotation.gtf
  • fna file --> candida_ref.fna

Download reference genome and annotation file for any bacteria

  1. Go to NCBI. Type Staphylococcus aureus (or any other bacteria name) in the search bar and hit the search button.
  1. The official reference genome is shown with a verified sign next to it. Copy it's RefSeq number (red square).
  1. Navigate to the reference_genomes/ directory in the terminal and check whether the script create_bacteria_index.sh is located there.
cd $HOME/dRNASeq/reference_genomes/   # Navigate to reference_genomes/ directory
ls                                    # Check contents of this directory

image

  1. Open the script in an editor. Here we will be using nano, if this doesn't work edit the file in notepad on your Windows Operating System. In the terminal type the command and the nano text editor will open.
nano -l create_bacteria_index.sh

Move down to line 17 and press enter. Add the copied RefSeq number underneath the previous bacteria, following the same format used in other examples within the script.

  1. Repeat steps 1-4 for any bacteria you want to add.

  2. Save the file with Ctrl + S and exit the nano editor with Ctrl + X.

  3. Download the NCBI genome download tool in the terminal with:

pip install --no-cache-dir ncbi-genome-download
  1. Run the script to download all reference genomes and annotation files.
chmod +x create_bacteria_index.sh        # Make the script executable
./create_bacteria_index.sh               # Run the script
  1. Several files are created after the script is done executing.
  • bacteria_annotation.gtf: annotation files for the selected bacteria, necessary for counting with bambu
  • bacteria_index.mmi: clustered and indexed reference genomes for the selected bacteria, necessary for mapping with Minimap2
  • bacteria_seq.fna: reference genomes for the selected bacteria in fasta format, necessary for counting with bambu
  • bacteria_seq_mmseqs_all_seqs.fasta: contains all sequences from clustering process, including duplicates (not used for downstream analysis)
  • bacteria_seq_mmseqs_cluster.tsv: provides information on cluster relationships, showing which sequences belong to which cluster
  • bacteria_seq_mmseqs_rep_seq.fasta: contains representative sequences from each cluster, this file was used to create bacteria_index.mmi
  • fna_files/: directory containing all fna files for the selected bacteria.
  • gff_files/: directory containing all gff files for the selected bacteria.
  • gtf_files/: directory containing all gtf files for the selected bacteria.

image

Note:
To view any file, type: cat <filename>
To view the contents of any directory, type: ls <directory name>

⚠️ **GitHub.com Fallback** ⚠️