2.2. sspi_raw_api_data - tjmisko/sspi-data-webapp GitHub Wiki
The wrapper class for sspi_raw_api_data implements three methods of interest:
-
sspi_raw_api_data.raw_insert_one(document)insertsdocumentinto the raw database. It is a drop-in replacement forinsert_one, the usual pymongo collection method, which runs some validation logic, handles the formatting of**kwargs(more on which below) as dictionary entries, and implements autofragmentation, which splits up documents which are too large to fit into MongoDB. -
sspi_raw_api_data.raw_insert_many(document_list)inserts all documents indocument_listinto the raw database, running all the same validation logic above. In fact, all it does under the hood is callraw_insert_onefor each document in the list! You can use it as a drop-in replacement for pymongo'sinsert_manywhen working with raw data. -
sspi_raw_api_data.fetch_raw_data(IndicatorCode)replaces a call to pymongo'sfind, returning all data for the given indicator and, importantly, automatically handling the reassembly of fragmented documents produced by the methods above. You may pass keyword arguments to filter queries (see below).
- All data in the
sspi_raw_api_datadatabase is inserted by acollectroute as the result of a call to an external data API. - Raw data is queried by a
cleanroute, which cleans it and stores it insspi_clean_api_data.
Important
Raw data should be stored in as raw a format as possible for reproducibility and error-catching. For example, if the data you're pulling is a.csvfile, then "as raw as possible" would mean something like decoding the file as a string and storing that raw string, as opposed to trying to load that file as a dataframe then dump that dataframe as a JSON object. We prefer the string because it limits the number of places that mistakes can creep in. It is easy to evaluate whether the string in the database matches the string you'd get from the source by other means. That means you don't have to debug your collector function if something unexpected happens: you know the error is not there.
-
Rawis the raw data returned by the API call. If the API call returns an array of JSON objects, then pass the list toraw_insert_many. Otherwise, useraw_insert_one. -
Sourceis the all-important source information. This is how we will query data out of the database later. Of particular importance are theQueryCodeand theOrganizationCode, which must be present for every raw observation. Basically, you can think of this pair as a way of specifying where to go and which query to run. -
CollectionInfomust contain aDateand aUsername. This is handled automatically by theraw_insert_onemethod.
{
"CollectionInfo": {
"Date": "2025-09-03 21:43",
"Username": "tjmisko"
},
"Raw": {
"country": {
"id": "ZH",
"value": "Africa Eastern and Southern"
},
"countryiso3code": "AFE",
"date": "2024",
"decimal": 1,
"indicator": {
"id": "SI.POV.GINI",
"value": "Gini index"
},
"obs_status": "",
"unit": "",
"value": null
},
"Source": {
"BaseURL": "https://api.worldbank.org/v2/country/all/indicator/SI.POV.GINI",
"OrganizationCode": "WB",
"OrganizationName": "World Bank",
"OrganizationSeriesCode": "SI.POV.GINI",
"QueryCode": "SI.POV.GINI",
"URL": "https://api.worldbank.org/v2/country/all/indicator/SI.POV.GINI?per_page=1000&format=json&page=1"
}
}