2.2. sspi_raw_api_data - tjmisko/sspi-data-webapp GitHub Wiki

< Back to Collections

SSPI Raw API Data

Database Methods

The wrapper class for sspi_raw_api_data implements three methods of interest:

  1. sspi_raw_api_data.raw_insert_one(document) inserts document into the raw database. It is a drop-in replacement for insert_one, the usual pymongo collection method, which runs some validation logic, handles the formatting of **kwargs (more on which below) as dictionary entries, and implements autofragmentation, which splits up documents which are too large to fit into MongoDB.
  2. sspi_raw_api_data.raw_insert_many(document_list) inserts all documents in document_list into the raw database, running all the same validation logic above. In fact, all it does under the hood is call raw_insert_one for each document in the list! You can use it as a drop-in replacement for pymongo's insert_many when working with raw data.
  3. sspi_raw_api_data.fetch_raw_data(IndicatorCode) replaces a call to pymongo's find, returning all data for the given indicator and, importantly, automatically handling the reassembly of fragmented documents produced by the methods above. You may pass keyword arguments to filter queries (see below).

Inputs and Outputs

  • All data in the sspi_raw_api_data database is inserted by a collect route as the result of a call to an external data API.
  • Raw data is queried by a clean route, which cleans it and stores it in sspi_clean_api_data.

Important
Raw data should be stored in as raw a format as possible for reproducibility and error-catching. For example, if the data you're pulling is a .csv file, then "as raw as possible" would mean something like decoding the file as a string and storing that raw string, as opposed to trying to load that file as a dataframe then dump that dataframe as a JSON object. We prefer the string because it limits the number of places that mistakes can creep in. It is easy to evaluate whether the string in the database matches the string you'd get from the source by other means. That means you don't have to debug your collector function if something unexpected happens: you know the error is not there.

Required Fields

  • Raw is the raw data returned by the API call. If the API call returns an array of JSON objects, then pass the list to raw_insert_many. Otherwise, use raw_insert_one.
  • Source is the all-important source information. This is how we will query data out of the database later. Of particular importance are the QueryCode and the OrganizationCode, which must be present for every raw observation. Basically, you can think of this pair as a way of specifying where to go and which query to run.
  • CollectionInfo must contain a Date and a Username. This is handled automatically by the raw_insert_one method.

Example Sturcture

{
  "CollectionInfo": {
    "Date": "2025-09-03 21:43",
    "Username": "tjmisko"
  },
  "Raw": {
    "country": {
      "id": "ZH",
      "value": "Africa Eastern and Southern"
    },
    "countryiso3code": "AFE",
    "date": "2024",
    "decimal": 1,
    "indicator": {
      "id": "SI.POV.GINI",
      "value": "Gini index"
    },
    "obs_status": "",
    "unit": "",
    "value": null
  },
  "Source": {
    "BaseURL": "https://api.worldbank.org/v2/country/all/indicator/SI.POV.GINI",
    "OrganizationCode": "WB",
    "OrganizationName": "World Bank",
    "OrganizationSeriesCode": "SI.POV.GINI",
    "QueryCode": "SI.POV.GINI",
    "URL": "https://api.worldbank.org/v2/country/all/indicator/SI.POV.GINI?per_page=1000&format=json&page=1"
  }
}
⚠️ **GitHub.com Fallback** ⚠️