3.3. Compute - tjmisko/sspi-data-webapp GitHub Wiki
Input: IndicatorCode
Behavior: Fetches data from sspi_raw_api_data and/or sspi_bulk_data, extracts and cleans data from the raw data documents, runs computations necessary to compute the Intermediate and Indicator Value and Score, and inserts documents identified at the IndicatorCode, CountryCode, and Year Level into the sspi_clean_api_data database.
Output: Returns a data message on how many documents were inserted into the database.
Every compute method looks different, since our needs are different for each indicator. A couple general guidelines:
- Helper methods make your code more readable and easier to reason about. A good rule of thumb is to give a helper a single job and to describe its job in the function name with a verb phrase. An example might be
extract_all_seriesorflatten_nested_dictionary. - Where possible, try to write methods that are reusable for indicators from the same data source. Store these reusable helper methods in
sspi-data-webapp/sspi_flask_app/api/datasource/<source>.py. - All that being said: don't kill yourself over making it pretty. Quick and dirty and done is better than beautiful and only in your head.
Here are the steps I would take to implement a compute route for the more complicated indicators which have a number of intermediates:
- Read in and process raw data one intermediate at a time. Depening on the form of your raw data, this may be take more or less work. By the end of this step, the goal is to have a list of documents that will eventually become your intermediates dictionary. Each document in the list should have an
IntermediateCode,CountryCode,Year,Value, andUnitassociated with it. - Append together your lists of intermediate documents into one long list. Your data is now in long format under the five headings above.
- Call the
score_indicatorfunction (imported from utilities.resources) on your list of documents. This document does the heavy lifting for organizing the intermediates into the correct format. The score function should be a lambda function which takes normalized intermediates in the data as arguments. You should not goalpost the intermediates yourself, since this gets handled automatically by thescore_indicatorfunction. Do not worry about handling cases for missing data: it gets handled behind the scenes.
For example, here's how I would handle setting up the compute BIODIV route. Assuming I've cleaned the data to the point of getting the intermediate list biodiv_intermediate_list (which can take some work, to be sure), I can simply call the zip intermediates function below, and it will spit out properly formatted data that's ready to be inserted into sspi_clean_api_data.
biodiv_intermediate_list = [
{"IntermediateCode": "TERRST", "CountryCode": "AUS", "Year": 2018, "Value": 0.5, "Unit": "Index"},
{"IntermediateCode": "FRSHWT", "CountryCode": "AUS", "Year": 2018, "Value": 0.5, "Unit": "Index"},
{"IntermediateCode": "MARINE", "CountryCode": "AUS", "Year": 2018, "Value": 0.5, "Unit": "Index"},
{"IntermediateCode": "TERRST", "CountryCode": "URU", "Year": 2018, "Value": 0.5, "Unit": "Index"},
{"IntermediateCode": "FRSHWT", "CountryCode": "URU", "Year": 2018, "Value": 0.5, "Unit": "Index"},
...
]
clean_data, incomplete_data = score_indicator(
biodiv_intermediate_list,
IndicatorCode="BIODIV"
score_function=lambda TERRST, FRSHWT, MARINE: sum(TERRST, FRSHWT, MARINE)/3,
unit="Index"
)
sspi_clean_api_data.insert_many(clean_data)The score_indicator function is a big helper function which standardizes how we operate on our partially cleaned data to make it easier to work with:
- We validate that all documents in
intermediate_document_listhave exactly the format we require by checking that all fields are accounted for, and that no extra fields are present. - Depending on the value of
"ScoreBy", we prepare the intermediate documents to be scored.- If
ScoreByis"Score"(the default), we score each intermediate according to its goalposts. To obtain goalpost informtion, we execute a many-to-one merge on the metadata which is loaded fromIntermediateDetails.csvto obtain goalpost information for each of the intermediates inintermediate_documents_list. This information is entered into each document under the"LowerGoalpost"and"UpperGoalpost"fields. We then generate the"Score"for each indicator accordingly. - If
ScoreByis"Value", then we skip the addition of goalposts, so that each intermediate document has no"Score","LowerGoalpost","UpperGoalpost"field. Because we have skipped the goalposting step, we manipulate the"Value"of fields of each intermediate directly in the"ScoreFunction"instead of the"Score"field.
- If
- We group the intermediates by
"CountryCode"and"Year"format the documents - If all the intermediates are defined and
"ScoreBy"is"Score", we apply the ScoreFunction to the"Score"values of the intermediates to generate the"Score"field for the indicator document. otherwise we record that