2.3. sspi_clean_api_data - tjmisko/sspi-data-webapp GitHub Wiki

< Back to Collections

SSPI Clean API Data

Contents and Format

The sspi_clean_api_data collection contains clean datasets for the SSPI produced by processing raw data.

All sspi_clean_api_data documents must be identified at the DatasetCode, CountryCode, Year level. Passing a document list to insert_many which contains two documents with identical DatasetCode, CountryCode, Year will result in an error. It typically implies that you have not filtered the data correctly, but there are many possibilities. Go back and examine how a duplicate document made it into your dataflow.

Required fields

  • DatasetCode: An all-caps string identifying the dataset which the document corresponds. Refer to ./datasets for metadata and information about datasets and dataset codes.
  • CountryCode: Recorded as the three letter all caps [ISO-3166-1 alpha-3] country code. Use the country lookup utility functions to convert any other formats to the standard format.
  • Year: Store year variables as integers, not strings.
  • Value: Stores the observation value as an integer or floating point number.
  • Unit: Stores field which stores a string identifying the units in which Value is measured. If the data are sourced from another published index, record the units as "Index".

Loading Data

All data in the sspi_clean_api_data database is inserted by a cleaner function. Cleaners can be found in

./sspi_flask_app/api/core/datasets/[DATASET_CODE].py

files. By convention, they are named clean_[dataset_code], and they must be identified with the @dataset_cleaner("[DATASET_CODE]") decorator to be found by the clean route.

To initiate the cleaning process after collecting raw data (see collect), call sspi clean [SERIES_CODE] or navigate to /api/v1/clean/{SERIES_CODE} in your browser. If SERIES_CODE is a DATASET_CODE, then only the data for that dataset will be cleaned. If SERIES_CODE is an ITEM_CODE, e.g. an indicator code, then all the datasets on which that item depends will be cleaned if data is available in sspi_raw_api_data.

N.B. By default, running clean will delete any clean data with a matching Dataset Code in sspi_clean_api_data database in order to remove duplicates. Since raw data is the source of truth, this is not particularly destructive, but it is something to be aware of as you're running the operation.

⚠️ **GitHub.com Fallback** ⚠️