Looking up missing gene data via the command line

Problem:

You have gene information (e.g. some genes of interest for a certain project), but the data is incomplete.

Solution:

"ncbi-entrez-direct" is a command line tool to get desired information from the NCBI database, providing certain gene identifiers.

Installation:

sudo apt install ncbi-entrez-direct

It is also possible to create a conda environment for this tool

Usage:

esearch -db gene -query "{gene_name}[Gene] AND {taxon_id}[Organism]" | efetch -format docsum | xtract -pattern DocumentSummary -element {element_1} {element_} {element_} {...}

Example:

esearch -db gene -query "RAD9A[Gene] AND txid9606[Organism]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id Name Description

This looks up the gene "RAD9A" ind the organism "Homo sapiens" (which has the taxon id "txid9606") and produces the gene ID, the genes standard name and a description. The output of this example is printed to the terminal and looks like this:

5883	RAD9A	RAD9 checkpoint clamp component A

So, gene ID, gene name and a description, all tab separated. Please note, the description string in this case contains space characters.

It is certainly possible, to search other queries (https://www.ncbi.nlm.nih.gov/books/NBK49540/) or produce other elements (please ask the internet for other examples). And this tool can of course also be integrated into a (python) script to process e.g. a list of genes.

Looking up missing gene data - core-unit-bioinformatics/knowledge-base GitHub Wiki

Looking up missing gene data via the command line

⚠️ GitHub.com Fallback ⚠️

Looking up missing gene data - core-unit-bioinformatics/knowledge-base GitHub Wiki

Looking up missing gene data via the command line

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️