TimePartitioning - prio-data/viewser Wiki

ViEWS Time-Partitioning Scheme


ViEWS use a time-partitioning scheme that splits the available data into three partitions/periods: training, calibration, and testing/forecasting. The time periods for these partitions are defined based on the time stamps for the observed outcomes. The approach is described in-depth in Appendix A of the Hegre et al. (2020).

τ refers to calendar time, 4 but we add subscripts to identify when the partitions start and end. Because the partitions differ between evaluation and true forecasting, we have also added the superscript e to all notations of our evaluation partitions. The periodization table below shows the partitioning of data for estimating model weights, hyper-parameter tuning, evaluation, and forecasting.

Periodization

After calibration EBMA, and hyper-parameter tuning, we retrain our models using both the training and calibration partitions

Usage


1. define the partitioning scheme

partitioner = data_partitioner.DataPartitioner.from_legacy_periods([
    legacy.Period("A",
                  train_start=121,train_end=396,
                  predict_start=397,predict_end=432)
])

2. Apply the partitioner

training_a = partitioner("A","train",hh_data_model)
print(training_a.index.get_level_values(0)[0,-1](/prio-data/viewser/wiki/0,-1))

3. Train the model

from stepshift import views
from sklearn.ensemble import RandomForestRegressor
mdl = views.StepshiftedModels(
    RandomForestRegressor(), 
    [*range(1,4)], 
    "ln_ged_sb_dep")

4. Generate the predictions

predictions = mdl.predict(partitioner("A","predict",hh_data)) 

The resulting object contains rows starting at predict_start=397 and ending at predict_end=432

time unit step_pred_1 step_pred_2 step_pred_3 step_combined
397 530 3.248208 0.068794 0.071361 3.248208
398 530 0.060043 3.139737 0.071361 3.139737
399 530 0.060043 0.068794 3.100956 3.100956
400 530 0.060043 0.068794 0.071361 NaN
401 530 0.060043 0.068794 0.071361 NaN
402 530 0.060043 0.068794 0.071361 NaN
403 530 0.060043 0.068794 0.071361 NaN
404 530 1.534400 0.068794 0.071361 NaN
405 530 0.060043 1.540886 0.071361 NaN
406 530 0.060043 0.068794 1.649546 NaN
407 530 0.060043 0.068794 0.071361 NaN
408 530 3.075878 0.068794 0.071361 NaN
409 530 3.255441 2.440132 0.071361 NaN
410 530 0.060043 3.509400 2.962537 NaN
411 530 0.060043 0.068794 3.711153 NaN
412 530 0.060043 0.068794 0.071361 NaN
413 530 0.060043 0.068794 0.071361 NaN
414 530 0.060043 0.068794 0.071361 NaN
415 530 0.060043 0.068794 0.071361 NaN
416 530 1.748024 0.068794 0.071361 NaN
417 530 0.060043 1.963898 0.071361 NaN
418 530 0.060043 0.068794 1.900073 NaN
419 530 0.060043 0.068794 0.071361 NaN
420 530 0.060043 0.068794 0.071361 NaN
421 530 2.670264 0.068794 0.071361 NaN
422 530 0.060043 2.285300 0.071361 NaN
423 530 0.060043 0.068794 2.428119 NaN
424 530 0.060043 0.068794 0.071361 NaN
425 530 0.060043 0.068794 0.071361 NaN
426 530 0.969150 0.068794 0.071361 NaN
427 530 0.060043 1.005730 0.071361 NaN
428 530 0.060043 0.068794 1.022231 NaN
429 530 0.060043 0.068794 0.071361 NaN
430 530 0.060043 0.068794 0.071361 NaN
431 530 0.060043 0.068794 0.071361 NaN
432 530 0.060043 0.068794 0.071361 NaN

Naming conventions and partition definition

To retain the name conventions we have established the columns currently called step_pred_1, step_pred_2 etc should be called ss_1, ss_2, etc., or, if preferred, step_spec_1, step_spec_2. The column called step_combined is in line with convention (but were called sc in views2). The advantage with the two-letter abbreviation is that when we are generating ensembles, there will a large number of columns with different model-name prefix and then _ss_1 or _step_spec_1.

Throughout, the partition is defined in terms of the month of the actuals we are targeting, not in terms of the last month with data or the last month in the training set. The partial exception is the step_combined, which is defined both in terms of the last month in the training set and the month of the actual. Accordingly, the step_combined series starts at predict_start and ends at predict_start plus the number of steps in the call to StepshiftedModels.

The figures below are illustrations of the process from Hegre et al. (2020).

Timeshifting

Test_vs_forecast_2020

Time-Partitioning Illustrations


The following diagram shows predictions from a step = 1 model, which explains why there are leading missing values for predictions and lagging missing values for the independent variables.

ViEWS 3 - Frame 3

The following diagram shows how stepshifting is used to predict into the future.

ViEWS 3 - Point of stepshifting (1)

References