2.3. sspi_clean_api_data - tjmisko/sspi-data-webapp GitHub Wiki
The sspi_clean_api_data collection contains clean datasets for the SSPI produced by processing raw data.
All sspi_clean_api_data documents must be identified at the DatasetCode, CountryCode, Year level. Passing a document list to insert_many which contains two documents with identical DatasetCode, CountryCode, Year will result in an error. It typically implies that you have not filtered the data correctly, but there are many possibilities. Go back and examine how a duplicate document made it into your dataflow.
-
DatasetCode: An all-caps string identifying the dataset which the document corresponds. Refer to./datasetsfor metadata and information about datasets and dataset codes. -
CountryCode: Recorded as the three letter all caps [ISO-3166-1 alpha-3] country code. Use the country lookup utility functions to convert any other formats to the standard format. -
Year: Store year variables as integers, not strings. -
Value: Stores the observation value as an integer or floating point number. -
Unit: Stores field which stores a string identifying the units in whichValueis measured. If the data are sourced from another published index, record the units as"Index".
All data in the sspi_clean_api_data database is inserted by a cleaner function. Cleaners can be found in
./sspi_flask_app/api/core/datasets/[DATASET_CODE].py
files. By convention, they are named clean_[dataset_code], and they must be identified with the @dataset_cleaner("[DATASET_CODE]") decorator to be found by the clean route.
To initiate the cleaning process after collecting raw data (see collect), call sspi clean [SERIES_CODE] or navigate to /api/v1/clean/{SERIES_CODE} in your browser. If SERIES_CODE is a DATASET_CODE, then only the data for that dataset will be cleaned. If SERIES_CODE is an ITEM_CODE, e.g. an indicator code, then all the datasets on which that item depends will be cleaned if data is available in sspi_raw_api_data.
N.B. By default, running clean will delete any clean data with a matching Dataset Code in sspi_clean_api_data database in order to remove duplicates. Since raw data is the source of truth, this is not particularly destructive, but it is something to be aware of as you're running the operation.