Metadata Retrieval - prekijpatel/MetaMiner GitHub Wiki
Data Retrieval
MetaMiner retrieves genomic metadata from NCBI server using NCBI’s own command-line tool, the NCBI Datasets CLI
, which operates in the background of the software. This ensures high accuracy and integrity of metadata during retrieval. CLI-based metadata is typically retrieved in JSON Lines
format. However, users may have metadata downloaded from the NCBI Datasets
's web interface, which may appear in JSON
format.
Both the JSON
and JSON Lines
formats, while rich in structure, are not ideal for large-scale filtering, sorting, or data manipulation—especially when working with thousands of records. Therefore, a transformation step was introduced before any cleaning or downstream processing.
Data Transformation
MetaMiner transforms the raw JSON/JSONL
to a structured, tabular format (DataFrame), making it easier to work with in large-scale analyses. This is not something that only MetaMiner does—NCBI also provides a CLI tool called DataFormat for similar purposes.
However, to us, the transformation approach taken with MetaMiner seems simpler.
Example Transformation:
# JSON snippet of BioSample attributes
{
"accession": "SAMN12345678",
"attributes": {
{"name":"strain","value":"06-00048"},
{"name":"collection_date","value":"2006"},
{"name":"isolate_name_alias","value":"CFSAN004178"},
{"name":"geo_loc_name","value":"USA:CA"},
{"name":"serovar","value":"O36:H14"},
{"name":"host","value":"homo sapiens"},
{"name":"host_disease","value":"None, Healthy"},
{"name":"isolation_source","value":"Nares"},
}
}
MetaMiner-based transformation:
accession | strain | collection_date | isolate_name_alias | geo_loc_name | serovar | host | host_disease | isolation_source |
---|---|---|---|---|---|---|---|---|
SAMN12345678 | 06-00048 | 2006 | CFSAN004178 | USA:CA | O36:H14 | homo sapiens | None, Healthy | Nares |
Dataformat-based Transformation
accession | attribute_name | attribute_value |
---|---|---|
SAMN12345678 | strain | 06-00048 |
SAMN12345678 | collection_date | 2006 |
SAMN12345678 | isolate_name_alias | CFSAN004178 |
SAMN12345678 | geo_loc_name | USA:CA |
SAMN12345678 | serovar | O36:H14 |
SAMN12345678 | host | homo sapiens |
SAMN12345678 | host_disease | None, Healthy |
SAMN12345678 | isolation_source | Nares |
Once the data is transformed, MetaMiner saves the transformed metadata (which essentially becomes your raw metadata) at the selected location in TSV
format. This ensures that users can always revisit or reprocess the original parsed data if needed.
After saving, MetaMiner moves on to the next phase—cleaning, sorting, categorizing the data, aka step 3.