3.2. Clean - tjmisko/sspi-data-webapp GitHub Wiki
Input: SeriesCode
Behavior: Cleans raw data previously collected for a dataset or item, transforming it into structured, analysis-ready form and saving it to sspi_clean_api_data.
Output: Event Stream or JSON (depending on request)
Command:
sspi clean [SeriesCode] [--remote]Local URL:
http://localhost:5000/api/v1/clean/<SeriesCode>
Remote URL:
https://sspi.world/api/v1/clean/<SeriesCode>
A clean route applies a dataset’s registered @dataset_cleaner function to the corresponding raw data.
This process transforms raw, heterogeneous data from API responses (e.g., XML, CSV, XLSX, JSON) into the standard SSPI series document format stored in sspi_clean_api_data.
Each dataset must define an @dataset_cleaner that specifies how its raw data should be parsed, filtered, and reformatted for downstream computation and visualization.
-
SeriesCodeidentifies the dataset(s) to clean. The cleaner will automatically resolve dependencies (e.g., sub-datasets required for computation).- Providing a
DatasetCodewill run the cleaner for that dataset only. - Providing an
ItemCodewill run the cleaner for all datasets on which the item depends.
- Providing a
- Cleaning functions are registered at the dataset level using the
@dataset_cleanerdecorator. - Core functionality of the
DatasetCleaner:- deletes previous cleaned records for the dataset (to prevent duplication),
- fetches corresponding raw data from
sspi_raw_api_data, - parses and filters that data according to dataset-specific logic,
- inserts the cleaned results into
sspi_clean_api_data, - updates metadata such as temporal coverage in
sspi_metadata.
- Variable return values:
- If a
SeriesCodedepends on only one dataset, the cleaner is executed directly, and the cleaned dataset is returned as JSON data. - If multiple datasets are involved, an event stream is returned showing sequential progress for each dataset.
- If a
Clean a single dataset locally:
sspi clean UNSDG_FRSHWTClean the same dataset using the remote server:
sspi clean UNSDG_FRSHWT --remoteRunning sspi clean BIODIV runs the cleaners for all datasets on which BIODIV depends, returning an event stream:
Cleaning dataset UNSDG_TERRST ( 1 of 3 )
Cleaning dataset UNSDG_FRSHWT ( 2 of 3 )
Cleaning dataset UNSDG_MARINE ( 3 of 3 )
Clean routes are registered in the dataset_bp Blueprint (sspi_flask_app/api/core/datasets/); the CLI command is at cli/commands/clean.
[!IMPORTANT] Common Issue #1 When making changes to
@dataset_cleanerfunctions, you must kill the Flask Development Server and reload it in order for those changes to take effect, even if running the development server with the--debugflag.
The reason for the issue above is that the registry is built when the modules containing the datasets are imported, which happens at the startup of the application. Another design could have been used in which these reload dynamically when they are changed, but the added complexity simply isn't worth the effort.
Querying from the raw database (during the fetch phase documented in step 2 above) depends on the sspi_metadata.get_source_info and sspi_raw_api_data.fetch_raw_data methods. These in turn depend on the underlying structure of the DatasetDetail metadata, which are loaded from the corresponding documentation.md files specified in datasets/. This coupling is mediated by the DatasetCode.
- The crucial fields for raw data fetching in the documentation frontmatter are
Source.OrganizationCodeandSource.QueryCode. These two fields are used directly to query the data in the methods above and should be called in every clean route. - Ensure also that the dataset files and dataset metdata files are correctly organized (grouped in directories by
OrganizationCode) and named. Mismatches can be responsible for errors.
[!IMPORTANT] Common Issue #2 Be sure to keep the dataset documentation file in sync with the dataset cleaner file! After making changes to the metdata files, you must reload them by calling
sspi metadata reload.
Each cleaner follows a similar structure:
- Delete existing cleaned data.
- Fetch corresponding raw data.
- Transform and filter into structured tabular form.
- Insert cleaned results into the database.
- Record dataset metadata (e.g., available years, last update).
@dataset_cleaner("UNSDG_FRSHWT")
def clean_unsdg_frshwt():
sspi_clean_api_data.delete_many({"DatasetCode": "UNSDG_FRSHWT"})
source_info = sspi_metadata.get_source_info("UNSDG_FRSHWT")
raw_data = sspi_raw_api_data.fetch_raw_data(source_info)
extracted_unsdg_frshwt = extract_sdg(raw_data)
idcode_map = {"ER_PTD_FRHWTR": "UNSDG_FRSHWT"}
rename_map = {"units": "Unit", "seriesDescription": "Description"}
drop_list = ["goal", "indicator", "series", "seriesCount", "target", "geoAreaCode", "geoAreaName"]
unsdg_frshwt = filter_sdg(extracted_unsdg_frshwt, idcode_map, rename_map, drop_list)
sspi_clean_api_data.insert_many(unsdg_frshwt)
sspi_metadata.record_dataset_range(unsdg_frshwt, "UNSDG_FRSHWT")
return unsdg_frshwt@dataset_cleaner("WB_SANSRV")
def clean_wb_sansrv():
sspi_clean_api_data.delete_many({"DatasetCode": "WB_SANSRV"})
source_info = sspi_metadata.get_source_info("WB_SANSRV")
raw_data = sspi_raw_api_data.fetch_raw_data(source_info)
cleaned_data = clean_wb_data(raw_data, "WB_SANSRV", "Percent")
sspi_clean_api_data.insert_many(cleaned_data)
sspi_metadata.record_dataset_range(cleaned_data, "WB_SANSRV")
return parse_json(cleaned_data)