3.2. Clean - tjmisko/sspi-data-webapp GitHub Wiki

Clean

Input: SeriesCode
Behavior: Cleans raw data previously collected for a dataset or item, transforming it into structured, analysis-ready form and saving it to sspi_clean_api_data.
Output: Event Stream or JSON (depending on request)

Command:

sspi clean [SeriesCode] [--remote]

Local URL:

http://localhost:5000/api/v1/clean/<SeriesCode>

Remote URL:

https://sspi.world/api/v1/clean/<SeriesCode>

Details and Notes

A clean route applies a dataset’s registered @dataset_cleaner function to the corresponding raw data. This process transforms raw, heterogeneous data from API responses (e.g., XML, CSV, XLSX, JSON) into the standard SSPI series document format stored in sspi_clean_api_data.

Each dataset must define an @dataset_cleaner that specifies how its raw data should be parsed, filtered, and reformatted for downstream computation and visualization.

SeriesCode identifies the dataset(s) to clean. The cleaner will automatically resolve dependencies (e.g., sub-datasets required for computation).
- Providing a DatasetCode will run the cleaner for that dataset only.
- Providing an ItemCode will run the cleaner for all datasets on which the item depends.
Cleaning functions are registered at the dataset level using the @dataset_cleaner decorator.
Core functionality of the DatasetCleaner:
1. deletes previous cleaned records for the dataset (to prevent duplication),
2. fetches corresponding raw data from sspi_raw_api_data,
3. parses and filters that data according to dataset-specific logic,
4. inserts the cleaned results into sspi_clean_api_data,
5. updates metadata such as temporal coverage in sspi_metadata.
Variable return values:
- If a SeriesCode depends on only one dataset, the cleaner is executed directly, and the cleaned dataset is returned as JSON data.
- If multiple datasets are involved, an event stream is returned showing sequential progress for each dataset.

Example Usages

Clean a single dataset locally:

sspi clean UNSDG_FRSHWT

Clean the same dataset using the remote server:

sspi clean UNSDG_FRSHWT --remote

Running sspi clean BIODIV runs the cleaners for all datasets on which BIODIV depends, returning an event stream:

Cleaning dataset UNSDG_TERRST ( 1 of 3 )
Cleaning dataset UNSDG_FRSHWT ( 2 of 3 )
Cleaning dataset UNSDG_MARINE ( 3 of 3 )

Notes for Development

Clean routes are registered in the dataset_bp Blueprint (sspi_flask_app/api/core/datasets/); the CLI command is at cli/commands/clean.

[!IMPORTANT] Common Issue #1 When making changes to @dataset_cleaner functions, you must kill the Flask Development Server and reload it in order for those changes to take effect, even if running the development server with the --debug flag.

The reason for the issue above is that the registry is built when the modules containing the datasets are imported, which happens at the startup of the application. Another design could have been used in which these reload dynamically when they are changed, but the added complexity simply isn't worth the effort.

Synchronizing Data and Metadata

Querying from the raw database (during the fetch phase documented in step 2 above) depends on the sspi_metadata.get_source_info and sspi_raw_api_data.fetch_raw_data methods. These in turn depend on the underlying structure of the DatasetDetail metadata, which are loaded from the corresponding documentation.md files specified in datasets/. This coupling is mediated by the DatasetCode.

The crucial fields for raw data fetching in the documentation frontmatter are Source.OrganizationCode and Source.QueryCode. These two fields are used directly to query the data in the methods above and should be called in every clean route.
Ensure also that the dataset files and dataset metdata files are correctly organized (grouped in directories by OrganizationCode) and named. Mismatches can be responsible for errors.

[!IMPORTANT] Common Issue #2 Be sure to keep the dataset documentation file in sync with the dataset cleaner file! After making changes to the metdata files, you must reload them by calling sspi metadata reload.

Example Cleaner Implementations

Each cleaner follows a similar structure:

Delete existing cleaned data.
Fetch corresponding raw data.
Transform and filter into structured tabular form.
Insert cleaned results into the database.
Record dataset metadata (e.g., available years, last update).

Example: UNSDG Freshwater Dataset

@dataset_cleaner("UNSDG_FRSHWT")
def clean_unsdg_frshwt():
    sspi_clean_api_data.delete_many({"DatasetCode": "UNSDG_FRSHWT"})
    source_info = sspi_metadata.get_source_info("UNSDG_FRSHWT")
    raw_data = sspi_raw_api_data.fetch_raw_data(source_info)
    extracted_unsdg_frshwt = extract_sdg(raw_data)
    idcode_map = {"ER_PTD_FRHWTR": "UNSDG_FRSHWT"}
    rename_map = {"units": "Unit", "seriesDescription": "Description"}
    drop_list = ["goal", "indicator", "series", "seriesCount", "target", "geoAreaCode", "geoAreaName"]
    unsdg_frshwt = filter_sdg(extracted_unsdg_frshwt, idcode_map, rename_map, drop_list)
    sspi_clean_api_data.insert_many(unsdg_frshwt)
    sspi_metadata.record_dataset_range(unsdg_frshwt, "UNSDG_FRSHWT")
    return unsdg_frshwt

Example: World Bank Sanitation Service Dataset

@dataset_cleaner("WB_SANSRV")
def clean_wb_sansrv():
    sspi_clean_api_data.delete_many({"DatasetCode": "WB_SANSRV"})
    source_info = sspi_metadata.get_source_info("WB_SANSRV")
    raw_data = sspi_raw_api_data.fetch_raw_data(source_info)
    cleaned_data = clean_wb_data(raw_data, "WB_SANSRV", "Percent")
    sspi_clean_api_data.insert_many(cleaned_data)
    sspi_metadata.record_dataset_range(cleaned_data, "WB_SANSRV")
    return parse_json(cleaned_data)