11 Time Series Modelling - Observatorio-do-Trabalho-de-Pernambuco/documentation GitHub Wiki

Modeling the CAGED Data using SARIMA

This page documents our approach to modeling the CAGED employment data, with a focus on the use of SARIMA and SARIMAX models. It covers the statistical background, data preprocessing, modeling decisions, and strategies for handling methodology changes in the CAGED series.


Overview

  • Objective: Forecast CAGED employment data with robust statistical models.
  • Approach: Leverage the SARIMA and SARIMAX frameworks for univariate and multivariate time series forecasting.
  • Tools: darts Python package for streamlined time series modeling.

CAGED Data Description

CAGED (Cadastro Geral de Empregados e Desempregados) is a Brazilian government database that tracks formal employment statistics. It records monthly data on employment movements, including admissions and dismissals, providing insights into the labor market dynamics.

In our modeling, we focus on the monthly balance (saldo), which represents the net difference between admissions and dismissals. This saldo reflects the workforce movement and is a key indicator of employment trends in Brazil.

It's important to note that there was a methodology change in the CAGED system, transitioning from "CAGED Antigo" to "CAGED Novo." This change affects the data's structure and reporting, which we account for in our modeling approach.

What are SARIMA and SARIMAX?

SARIMA: Seasonal AutoRegressive Integrated Moving Average

SARIMA extends ARIMA by explicitly modeling seasonality in a time series. The model is denoted as:

$$ SARIMA(p, d, q)(P, D, Q)_m $$

Where:

  • ( p ): order of non-seasonal autoregression
  • ( d ): order of non-seasonal differencing
  • ( q ): order of non-seasonal moving average
  • ( P ): order of seasonal autoregression
  • ( D ): order of seasonal differencing
  • ( Q ): order of seasonal moving average
  • ( m ): number of periods per season

Mathematical formulation:

$$ \Phi_P(B^m)\phi_p(B)\nabla^d\nabla_m^D y_t = \Theta_Q(B^m)\theta_q(B)\varepsilon_t $$

where ( \nabla^d ) is non-seasonal differencing and ( \nabla_m^D ) is seasonal differencing.

SARIMAX: SARIMA with Exogenous Variables

SARIMAX generalizes SARIMA by allowing external (exogenous) variables:

$$ y_t = \text{SARIMA terms} + \beta_1 X_{1,t} + ... + \beta_k X_{k,t} + \varepsilon_t $$

This allows the model to incorporate additional explanatory information.


Stationarity and Differencing

Before fitting SARIMA-type models, stationarity of the time series is crucial. We tested the differentiated series for stationarity using statistical tests and visual inspection. Differencing was applied as necessary to remove trends and seasonality, ensuring the validity of model assumptions.


Modeling with Darts

All modeling was performed using the darts time series package, which simplifies training, prediction, and evaluation of statistical and machine learning time series models.

  • Data Loading: The data was integrated using the costum query_athena_to_polars function of the our athena_utils.py module. It allows to directly load a query on athena into a polars as a lazyframe or dataframe. From there on we cleaned the data for salary outliers and continued modeling the movement of the CAGED series.
  • Train/Test Split:
    The original CAGED series was split into a training and test set, with a dedicated 1-year (12 months) test horizon for model validation and comparison.
  • Validation:
    Model performance was assessed on the holdout year to ensure generalization and prevent over-fitting.

Handling Methodology Change: CAGED Antigo vs. CAGED Novo

The CAGED data experienced a methodology change, switching from the "CAGED Antigo" to the "CAGED Novo" system. To properly account for this structural break:

  • Several approaches were compared for modeling the change.
  • The most effective strategy was to introduce a dummy variable that indicates the periods before and after the methodology switch.
  • This dummy variable was included as an exogenous regressor in the SARIMAX model, allowing the model to adapt to the level shift caused by the new methodology.
  • Other exogenous regressors—such as average age and average salary—were also tested in the model. However, their inclusion did not improve forecasting performance, indicating that there was no simple linear influence of these variables on the CAGED employment series in this context.

Model Choice for Short-Term Forecasting

After testing and evaluation, we have decided to deploy the SARIMA model as our primary approach for short-term forecasting of the CAGED data. This decision is based on two main factors:

  • Interpretability: SARIMA models provide clear insights into trend, seasonality, and the effects of exogenous variables, making it easier to communicate results and justify forecasts to stakeholders.
  • Strong Performance: In our validation tests, SARIMA consistently delivered robust and accurate forecasts for the 1-year test horizon, often outperforming more complex or less transparent alternatives.

Given these advantages, SARIMA will remain our model of choice for operational short-term forecasting tasks.

TDLR

  • SARIMA and SARIMAX models were implemented to forecast the CAGED employment series.
  • Stationarity was ensured through appropriate differencing.
  • The darts package streamlined model development and evaluation.
  • A 1-year holdout period was used for robust validation.
  • The methodology change in CAGED was handled via a dummy variable, improving model accuracy and interpretability.