Download Reference genomes and annotation files for all species

Contents:

Download reference genome and annotation file for Homo sapiens

Go to NCBI. Type Homo sapiens in the search bar and hit the search button.

The official reference genome is shown with a verified sign next to it, click on the assembly name.

On the next page click on the Download button. A new window will open, here select both the FASTA & GTF file. Click on the Download button.

A zip file will be downloaded in the Downloads/ directory of the local machine. Extract the zip file and hit enter on the following window that opens.

Double click on the extracted file and follow the path below. This end directory should contain two files: genomic.gtf and GCF_000001405.40_GRCh38.p14_genomic.fna.

ncbi_dataset > ncbi_dataset > data > GCF_000001405.40/

Change the names of these two files:

genomic.gtf--> human_annotation.gtf
GCF_000001405.40_GRCh38.p14_genomic.fna --> human_ref.fna

Move the extracted files to the reference_genomes directory in the dRNASeq pipeline.

Note:
If files like Zone.Identifier show up, just delete those with:
find . -name "*Zone.Identifier" -type f -delete

The reference genome needs to be indexed for Minimap2. Indexing is necessary because it significantly speeds up alignment and reduces memory usage. To do this open the terminal and execute the following commands:

cd $HOME/dRNASeq/reference_genomes/                 # Navigate to reference_genomes/ directory
minimap2 -x map-ont -d human_ref.mmi human_ref.fna  # Execute the indexing command

This command creates human_ref.mmi, which is necessary for mapping.
The general indexing command looks like this:
minimap2 -x map-ont -d <output_index_file> <input_reference_genome>

-d → Creates an index
<output_index_file> → The .mmi file that stores the indexed genome
<input_reference_genome> → The .fna file containing the reference sequences

Download reference genome and annotation file for Candida albicans

Follow the same steps from Homo sapiens for Candida albicans or any other fungi.

After downloading the files, change their name within the 'GCF' folder:

genomic.gtf--> candida_annotation.gtf
fna file --> candida_ref.fna

Download reference genome and annotation file for any bacteria

Go to NCBI. Type Staphylococcus aureus (or any other bacteria name) in the search bar and hit the search button.

The official reference genome is shown with a verified sign next to it. Copy it's RefSeq number (red square).

Navigate to the reference_genomes/ directory in the terminal and check whether the script create_bacteria_index.sh is located there.

cd $HOME/dRNASeq/reference_genomes/   # Navigate to reference_genomes/ directory
ls                                    # Check contents of this directory

Open the script in an editor. Here we will be using nano, if this doesn't work edit the file in notepad on your Windows Operating System. In the terminal type the command and the nano text editor will open.

nano -l create_bacteria_index.sh

Move down to line 17 and press enter. Add the copied RefSeq number underneath the previous bacteria, following the same format used in other examples within the script.

Repeat steps 1-4 for any bacteria you want to add.
Save the file with Ctrl + S and exit the nano editor with Ctrl + X.
Download the NCBI genome download tool in the terminal with:

pip install --no-cache-dir ncbi-genome-download

Run the script to download all reference genomes and annotation files.

chmod +x create_bacteria_index.sh        # Make the script executable
./create_bacteria_index.sh               # Run the script

Several files are created after the script is done executing.

bacteria_annotation.gtf: annotation files for the selected bacteria, necessary for counting with bambu
bacteria_index.mmi: clustered and indexed reference genomes for the selected bacteria, necessary for mapping with Minimap2
bacteria_seq.fna: reference genomes for the selected bacteria in fasta format, necessary for counting with bambu
bacteria_seq_mmseqs_all_seqs.fasta: contains all sequences from clustering process, including duplicates (not used for downstream analysis)
bacteria_seq_mmseqs_cluster.tsv: provides information on cluster relationships, showing which sequences belong to which cluster
bacteria_seq_mmseqs_rep_seq.fasta: contains representative sequences from each cluster, this file was used to create bacteria_index.mmi
fna_files/: directory containing all fna files for the selected bacteria.
gff_files/: directory containing all gff files for the selected bacteria.
gtf_files/: directory containing all gtf files for the selected bacteria.

Note:
To view any file, type: cat <filename>
To view the contents of any directory, type: ls <directory name>

Reference genomes - MeryemAk/dRNASeq GitHub Wiki

Download Reference genomes and annotation files for all species

Download reference genome and annotation file for Homo sapiens

Download reference genome and annotation file for Candida albicans

Download reference genome and annotation file for any bacteria

⚠️ GitHub.com Fallback ⚠️

Reference genomes - MeryemAk/dRNASeq GitHub Wiki

Download Reference genomes and annotation files for all species

Download reference genome and annotation file for Homo sapiens

Download reference genome and annotation file for Candida albicans

Download reference genome and annotation file for any bacteria

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️