Reference genomes - MeryemAk/dRNASeq GitHub Wiki
Contents:
- Go to NCBI. Type
Homo sapiensin the search bar and hit thesearchbutton.
- The official reference genome is shown with a verified sign next to it, click on the assembly name.
- On the next page click on the
Downloadbutton. A new window will open, here select both the FASTA & GTF file. Click on theDownloadbutton.
- A zip file will be downloaded in the
Downloads/directory of the local machine. Extract the zip file and hit enter on the following window that opens.
- Double click on the extracted file and follow the path below. This end directory should contain two files:
genomic.gtfandGCF_000001405.40_GRCh38.p14_genomic.fna.
ncbi_dataset > ncbi_dataset > data > GCF_000001405.40/- Change the names of these two files:
-
genomic.gtf-->human_annotation.gtf -
GCF_000001405.40_GRCh38.p14_genomic.fna-->human_ref.fna
- Move the extracted files to the
reference_genomesdirectory in the dRNASeq pipeline.
Note:
If files likeZone.Identifiershow up, just delete those with:find . -name "*Zone.Identifier" -type f -delete
- The reference genome needs to be indexed for Minimap2. Indexing is necessary because it significantly speeds up alignment and reduces memory usage. To do this open the terminal and execute the following commands:
cd $HOME/dRNASeq/reference_genomes/ # Navigate to reference_genomes/ directory
minimap2 -x map-ont -d human_ref.mmi human_ref.fna # Execute the indexing commandThis command creates human_ref.mmi, which is necessary for mapping.
The general indexing command looks like this:
minimap2 -x map-ont -d <output_index_file> <input_reference_genome>
- -d → Creates an index
- <output_index_file> → The .mmi file that stores the indexed genome
- <input_reference_genome> → The .fna file containing the reference sequences
Follow the same steps from Homo sapiens for Candida albicans or any other fungi.
After downloading the files, change their name within the 'GCF' folder:
-
genomic.gtf-->candida_annotation.gtf - fna file -->
candida_ref.fna
- Go to NCBI. Type
Staphylococcus aureus(or any other bacteria name) in the search bar and hit thesearchbutton.
- The official reference genome is shown with a verified sign next to it. Copy it's RefSeq number (red square).
- Navigate to the
reference_genomes/directory in the terminal and check whether the scriptcreate_bacteria_index.shis located there.
cd $HOME/dRNASeq/reference_genomes/ # Navigate to reference_genomes/ directory
ls # Check contents of this directory
- Open the script in an editor. Here we will be using
nano, if this doesn't work edit the file in notepad on your Windows Operating System. In the terminal type the command and the nano text editor will open.
nano -l create_bacteria_index.shMove down to line 17 and press enter. Add the copied RefSeq number underneath the previous bacteria, following the same format used in other examples within the script.

-
Repeat steps 1-4 for any bacteria you want to add.
-
Save the file with
Ctrl + Sand exit the nano editor withCtrl + X. -
Download the NCBI genome download tool in the terminal with:
pip install --no-cache-dir ncbi-genome-download
- Run the script to download all reference genomes and annotation files.
chmod +x create_bacteria_index.sh # Make the script executable
./create_bacteria_index.sh # Run the script
- Several files are created after the script is done executing.
-
bacteria_annotation.gtf: annotation files for the selected bacteria, necessary for counting with bambu -
bacteria_index.mmi: clustered and indexed reference genomes for the selected bacteria, necessary for mapping with Minimap2 -
bacteria_seq.fna: reference genomes for the selected bacteria in fasta format, necessary for counting with bambu -
bacteria_seq_mmseqs_all_seqs.fasta: contains all sequences from clustering process, including duplicates (not used for downstream analysis) -
bacteria_seq_mmseqs_cluster.tsv: provides information on cluster relationships, showing which sequences belong to which cluster -
bacteria_seq_mmseqs_rep_seq.fasta: contains representative sequences from each cluster, this file was used to createbacteria_index.mmi -
fna_files/: directory containing all fna files for the selected bacteria. -
gff_files/: directory containing all gff files for the selected bacteria. -
gtf_files/: directory containing all gtf files for the selected bacteria.

Note:
To view any file, type:cat <filename>
To view the contents of any directory, type:ls <directory name>