Looking up missing gene data - core-unit-bioinformatics/knowledge-base GitHub Wiki
Problem:
- You have gene information (e.g. some genes of interest for a certain project), but the data is incomplete.
Solution:
- "ncbi-entrez-direct" is a command line tool to get desired information from the NCBI database, providing certain gene identifiers.
Installation:
sudo apt install ncbi-entrez-direct
It is also possible to create a conda environment for this tool
Usage:
esearch -db gene -query "{gene_name}[Gene] AND {taxon_id}[Organism]" | efetch -format docsum | xtract -pattern DocumentSummary -element {element_1} {element_} {element_} {...}
Example:
esearch -db gene -query "RAD9A[Gene] AND txid9606[Organism]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id Name Description
This looks up the gene "RAD9A" ind the organism "Homo sapiens" (which has the taxon id "txid9606") and produces the gene ID, the genes standard name and a description. The output of this example is printed to the terminal and looks like this:
5883 RAD9A RAD9 checkpoint clamp component A
So, gene ID, gene name and a description, all tab separated. Please note, the description string in this case contains space characters.
It is certainly possible, to search other queries (https://www.ncbi.nlm.nih.gov/books/NBK49540/) or produce other elements (please ask the internet for other examples). And this tool can of course also be integrated into a (python) script to process e.g. a list of genes.