Metadata Retrieval - prekijpatel/MetaMiner GitHub Wiki

Data Retrieval

MetaMiner retrieves genomic metadata from NCBI server using NCBI’s own command-line tool, the NCBI Datasets CLI, which operates in the background of the software. This ensures high accuracy and integrity of metadata during retrieval. CLI-based metadata is typically retrieved in JSON Lines format. However, users may have metadata downloaded from the NCBI Datasets's web interface, which may appear in JSON format.

Both the JSON and JSON Lines formats, while rich in structure, are not ideal for large-scale filtering, sorting, or data manipulation—especially when working with thousands of records. Therefore, a transformation step was introduced before any cleaning or downstream processing.


Data Transformation

MetaMiner transforms the raw JSON/JSONL to a structured, tabular format (DataFrame), making it easier to work with in large-scale analyses. This is not something that only MetaMiner does—NCBI also provides a CLI tool called DataFormat for similar purposes.

However, to us, the transformation approach taken with MetaMiner seems simpler.

Example Transformation:

# JSON snippet of BioSample attributes 
{
  "accession": "SAMN12345678",

  "attributes": {
    {"name":"strain","value":"06-00048"},
    {"name":"collection_date","value":"2006"},
    {"name":"isolate_name_alias","value":"CFSAN004178"},
    {"name":"geo_loc_name","value":"USA:CA"},
    {"name":"serovar","value":"O36:H14"},
    {"name":"host","value":"homo sapiens"},
    {"name":"host_disease","value":"None, Healthy"},
    {"name":"isolation_source","value":"Nares"},
  }
}

MetaMiner-based transformation:

accession strain collection_date isolate_name_alias geo_loc_name serovar host host_disease isolation_source
SAMN12345678 06-00048 2006 CFSAN004178 USA:CA O36:H14 homo sapiens None, Healthy Nares

Dataformat-based Transformation

accession attribute_name attribute_value
SAMN12345678 strain 06-00048
SAMN12345678 collection_date 2006
SAMN12345678 isolate_name_alias CFSAN004178
SAMN12345678 geo_loc_name USA:CA
SAMN12345678 serovar O36:H14
SAMN12345678 host homo sapiens
SAMN12345678 host_disease None, Healthy
SAMN12345678 isolation_source Nares

Once the data is transformed, MetaMiner saves the transformed metadata (which essentially becomes your raw metadata) at the selected location in TSV format. This ensures that users can always revisit or reprocess the original parsed data if needed.

After saving, MetaMiner moves on to the next phase—cleaning, sorting, categorizing the data, aka step 3.