Backlog - 504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 GitHub Wiki

Welcome to the CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 wiki!

Last update: 30 October 2024

Weekly meetings:

Every Tuesday 13:45 at de Bouwplaats

Research Question:

How can time-series forecasting methods be used to predict future parcel volume in warehouses?

Subquestions:

What does the data look like?
What models can potentially yield effective modeling results?

Data

The data consists of date and sorting belt-specific entries yielding a number of events, which represent the amount of packages that is sorted by that chute on that day.
More features are added by deriving from the sorting date and the volume in the previous day (so-called lag features).

Tech stack

Operating systems: Windows
Server-Side Programming: Python
Version control: Github
Development environment: VS Code, JupyterLab
Documentation: Github Wiki

Backlog Sprint week 7 & 8

Work out storyline
Finish model engineering
Compare model performances
Create standard input & output

https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated1/1730303330024-72faacd7-fbf9-4705-b6cc-9d23ecfa24ec_1.jpg https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated/1730303330183-115241d8-31ff-4ccc-8f92-ebdd8d82a282_1.jpg https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated/1730303330183-115241d8-31ff-4ccc-8f92-ebdd8d82a282_2.jpg

Update meetings

Update 29-10-2024

Progress tasks last week

Work out storyline - include research questions, conclusion (pro and cons)
- Introduction - finished
- Research question - finished
- Conclusion - in progress
Finish model engineering - finished
Compare model performances - results collected, still needs to be commented
Create standard input & output - finished

New tasks

Hand in notebook
Prepare slides for the presentation

Meeting 7 - 22-10-2024 - 13:45-16:00

Agenda

Progress tasks last week
Sprint planning
Questions Nicolas
Discussion methods
Planning coming week

Progress tasks last week

Tasks update

Storyline (Tom)
- Correlations - finished
- Restructuring - finished
- Outlier analysis - finished
- Capital / rural analysis (include map) - finished
Linear regression (Kristian)
- different features / lags - add some text, finished, can be included in main notebook
ARIMA (Tom)
- add to main - included to main notebook, some bugs
CEEMDAN
- parallel computing (Sun) - tried, but doesn't work that well
- explanation (Jelmer & Sun) - wave decomposition, bad results for all output belts
- comparisons Gru vs LSTM (Jelmer) - Gru decomposition aggregate results
- results, but not dashboard
Neural Network (Mats)
- two files - extra model for rural, currently merging those files
- feature engineering - tried cosine, sine weekdays, but embeddings work better
- explanation - currently working on it
Dashboard forecast (Tom)
- Standardize input + output - structure created
- Make accessible for forecast - working for ARIMA, can be extended to different models

Sprint planning

Work out storyline - include research questions, conclusion (pro and cons)
Finish model engineering - finished, add to main notebook, align with preparation
Compare model performances - Align models with same measures
Create standard input & output - Created for ARIMA, need to align for other models

Discussion methods

Include comparison capital vs rural (one hot encoding)
Report running time
RMSE, VSE, MAE
Add comments
Assymmetric loss function
- The key to addressing your problem is to use an asymmetric loss function that penalizes overpredictions more heavily than underestimations.
- Mats will look into that

Sprint planning (week 7 & 8)

Work out storyline
Finish model engineering
Compare model performances
Create standard input & output

Planning coming week

Work out storyline - work out introduction, research questions (Kristian), conclusion (pro and cons) (Jelmer)
Finish Gantt chart (Jelmer)
Store running time (Everyone)
Add models to main notebook (Everyone)
Add comments (Everyone)
Add plots for three different outputs belts (Everyone)
Align with preparation (Tom)
Create standard input & output - Created for ARIMA, need to align for other models (Tom)
Compare model performances - Align models with same measures (Tom)
Compare model performances - Running (Mats)
Look into assymetric loss function (Mats)
Notebook finished before end of the weekend! (Everyone)

Meeting 6 - 15-10-2024 - 13:45-17:00

Agenda

Progress tasks last week
Sprint planning
Questions Nicolas
Discussion methods
Planning coming week

Progress tasks last week

Tasks update

ARIMA (Tom)
Visualisations and code align (Tom)
- Standardized
Linear regression uitwerken (Kristian)
- Experiment with different lags
- Polynomial regression
- Multi-linear regression not possible for time-series
- Day of the week as extra feature
LSTM (Sun & Jelmer)
- Aggregated prediction at the moment
- 3 minutes training time per output belt
- Lag of meaningfull features
- Wave deconstruction (can be limited)
- Gru instead of LSTM
RNN (Mats)
- LSTM instead of RNN
- 365x150 dataset gives good average results but not daily

Sprint retrospective

Finish visualization + outlier analysis
- Visualization + outlier analysis is finished
Predict correlations using linear regression
- Single correlations calculated, quite low
- Cross correlations need to be calculated (ask Nicolas)
Finish model engineering
- ARIMA finished
- ANN not finished yet
- LSTM not finished yet
Split train, test and validation set
- Train: January-October
- Test: 2 weeks
Choose best model
- Continue on different models
- Evaluate on
  - KPIs (RMSE & MAE) - absolute value and square per day per output belt, normalization (square root, number of observations) over all data - MSE and Variance SE over each day of the planning horizon
  - Running time (Google Colab, same computer)
  - Interpretability

Questions Nicolas

Can you guys do manual feature engineering?
- I liked the idea of rural vs the capital
- One hot encoding for the locations (feature)
Do you guys have the timing for training the models?
- Add running time of the training (repeatable part or pre-repeatable part?)
How are you handling predictions?
- Predicting per output belt for time horizon of two weeks
Focus is on predicting two weeks ahead
- How are you handling this?
- A simple approach could be to predict the first day, and then use the prediction of that day to predict the following day. Repeat until 14 days are finished
- I think most of you are using the true data to predict the next day, ignoring that you wouldn’t have it
- I think we predict on 2 weeks simultaneously
Correlations (cross-correlations)
- Don't look at cross-correlations
- Look at correlation with day before
Linear regression
- 6 lags days
- Different models (RF, Linear)
- One-hot encoding sorting center & output_belt
- Sine, cosine (date in time)
Check if there is relation between # output belt & number of events
RMSE & MAE
Add comments
Assymmetric loss function
- The key to addressing your problem is to use an asymmetric loss function that penalizes overpredictions more heavily than underestimations.

Discussion methods

RNN is replaced by LSTM (Mats)
- 365 x 150 dataset
LSTM and Gru both considered (Jelmer & Sun)
- currently aggregate level
- separate outliers from normal model to prevent underfit
Linear regression (Kristian)
- Day of the week as extra feature

Sprint planning (week 7 & 8)

Work out storyline
Finish model engineering
Compare model performances
Create standard input & output

Planning coming week

Storyline (Tom)
- Correlations
- Restructuring
- Outlier analysis
- Capital / rural analysis (include map)
Linear regression (Kristian)
- different features / lags
ARIMA (Tom)
- add to main
CEEMDAN
- parallel computing (Sun)
- explanation (Jelmer & Sun)
- comparisons Gru vs LSTM (Jelmer)
Neural Network (Mats)
- feature engineering
- explanation
Dashboard forecast (Tom)
- Standardize input + output
- Make accessible for forecast

Meeting 5 - 08-10-2024 - 14:00-16:00

Agenda

Progress tasks last week
Sprint planning
Discussion methods
Planning coming week

Progress tasks last week

Tasks finished

Include boxplots and plots (Everyone) - Tom will align everything
Check outliers (Jelmer) - Later
Multi-regression uitwerken (Kristian) - Small adjustment needed
Cosine + sine + day of the year (Kristian) - Done
LSTM (Sun) - Needs further explanation
RNN (Mats) - Needs further explanation
ARIMA (Tom) - First results
Storyline + presentation (Jelmer) - Storyline later

Sprint planning

Finish visualization + outlier analysis
- Tom will finish visualization this week
- Finish outlier analysis after finishing models
Predict correlations using linear regression
Finish model engineering
- ARIMA
- ANN
- LSTM
Split train, test and validation set
- Train: January-October
- Test: 2 weeks
Choose best model

Discussion methods

Include RMSE and MAE

Planning coming week

ARIMA (Tom)
Visualisations and code align (Tom)
Linear regression uitwerken (Kristian)
LSTM (Sun & Jelmer)
RNN (Mats)

Midterm PrimeVision - 08-10-2024 - 10:00-11:00

Feature embedding might be overkill (when only considering day, week, month)
Embedding with (trained) PCA
No bildirectional LSTM, since you are predicting the future
Why mix LSTM with transformer? Attention is already included in LSTM, delve into that or discard it
ARIMA: try to see how the results the accuracy falls of when predicting further ahead
Take a look at smaller sections (for example look at individual months, bundle extremes (high outliers or low outliers)
Mean absolute error and Mean Squared error data
Feature importance for the model
- Feature weight
- Lean split
Absolute error over time
Variance actuals vs predictions
Report running time

Meeting 4 - 01-10-2024 - 13:45-16:00

Agenda

Progress tasks last week
Sprint retrospective
Meeting PrimeVision
Meeting Nicolas
Discussion methods
Sprint planning coming weeks (week 5 & 6)
Planning coming week

Progress tasks last week

Tasks finished

Make boxplots (year, month, weekday, sorting center) (Jelmer) - Adjust to small plots
Create main file for visualizations (Tom) - Done
- Aggregate planning horizon (Mats) - Add and adjust for VANTAA
- boxplots (Jelmer) - Add and adjust for VANTAA
- Day of month (Sun) - Adjust for VANTAA
- Chute outliers (Tom) - Adjust for less chutes (no focus)
Create Artificial Neural Network as baseline (Mats) - Adjust splitting train and test
Investigate ARIMA (Tom) - Continue
Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian) - Continue

Sprint retrospective

Import and load data - finished
Clean data - finished
Visualize data - boxplots and decide which are interesting
Remove outliers - investigate with PrimeVision
Add extra features - day, month, weekday added (experiment later on with seasonality, festives)

Notes: some unclarities in data & goal led to delay -> Improvement: ask questions directly and make choices early

Meeting PrimeVision

Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?
- Not all chutes are opened always -> just assume that chute is always used for some region
- Send email for extra clarification
What is the final model that they want to have?
- Input: last days, last year (data we have)? For what period do you want the forecast?
  - Directly into the future
  - Check how much historical data is needed for good prediction (the less, the better)
- Output: Predictions two weeks for all output belts and all sorting centers?
  - Try for all output belts and otherwise only for all sorting centers
What to do with outliers? When are they outliers?
- Check on performance

Meeting Nicolas (TA)

Answer mail Nicolas + schedule meeting
Outliers
- Check whether there is time-trend
- Correlation
- Better to overshoot or undershoot?
Date
- Use sine and cosine to connect day 7 to day 1
- Add day of the year
- Output belt should be a feature
Testing
- Use previous months to predict this month

Discussion methods

Multiple regression instead of linear regression (compare which variables have most influence)
- Weekday most performance
- Week and month are similar -> remove month

Presentation

Explain problem
Explain data
Visualization
Methods
Next steps

Sprint planning coming weeks (week 5 & 6)

Finish visualization + outlier analysis
Predict correlations using linear regression
Finish model engineering
- ARIMA
- ANN
- LSTM
Split train, test and validation set
Choose best model

Planning coming week

Include boxplots and plots (Everyone)
Check outliers (Jelmer)
Multi-regression uitwerken (Kristian)
Cosine + sine + day of the year (Kristian)
LSTM (Sun)
RNN (Mats)
ARIMA (Tom)
Storyline + presentation (Jelmer)

Meeting 3 - 24-09-2024 - 13:45-15:30

Agenda

Progress tasks last week
Conclusions visualization
Progress sprint planning (week 3 & 4)
Planning coming week

Progress tasks last week

Tasks finished

Finish step 1 & 2 in class (everyone)
Visualize aggregate demand over planning horizon (seasonality) (Mats)
Visualize aggregate demand on day of the month (Dexin)
Visualize per chute demand over planning horizon (in percentages) (Tom)
Visualize per chute demand on weekday (in percentages) (Tom)
Overview of data quality (Jelmer)
Visualize aggregate demand on weekday (Kristian)

Conclusions visualization

global pattern (aggregate)
more in december
stable over the year
weekly pattern (tuesday, wednesday most busy, saturday way less, sunday the least)
cluster based on size of sorting center
try to find reason of peaks
make boxplot for year, month, day, per sorting center
few output belts have a high percentage of output
Sine pattern days of the month, needs to be fixed to correct for monthdays in a year, plus in one plot
Questions:
create seperate neural networks for different sorting centers?
Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?

Progress sprint planning (week 3 & 4)

Import and load data - finished
Clean data - finished
Visualize data - boxplots and decide which are interesting
Remove outliers - investigate with PrimeVision
Add extra features - day, month, weekday added (experiment later on with seasonality, festives)

Planning coming week

Make boxplots (year, month, weekday, sorting center) (Jelmer)
Create main file for visualizations (Tom)
- Aggregate planning horizon (Mats)
- boxplots (Jelmer)
- Day of month (Sun)
- Chute outliers (Tom)
Create Artificial Neural Network as baseline (Mats)
Investigate ARIMA (Tom)
Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian)

Meeting 2 - 17-09-2024 - 13:45-15:30

Agenda

Meet with PrimeVision
Revise planning
Tasks division

Meeting Primevision

Objective:
- calculate amount of volume per chute
- chute = output belt, physical location
Functioning of software:
- Enter date and give prediction for next two weeks
Split data per sorting center
Filter on event type since we need to count them only once
Implementation
- First try on python
- if we have spare time, try to implement in AWS native together with people from PrimeVision
Data
- Should be available today

Revise planning

Read about machine learning techniques
Take time series into account (trends last period)
- Otherwise we make prediction for every day of the year

Steps:

Import and load data
Clean data
Visualize data
Remove outliers + add extra features
Advanced data analysis
- Predict correlations using linear regression
Split train, test and validation set
Normalize data
Simple Neural Network
LSTM / deep Neural Network
Analyze results

Tasks division

Finish step 1 & 2 in class (everyone)
Visualize aggregate demand over planning horizon (seasonality) (Mats)
Visualize aggregate demand on weekday (Kristian)
Visualize aggregate demand on day of the month (Dexin)
Visualize per chute demand over planning horizon (in percentages) (Tom)
Visualize per chute demand on weekday (in percentages) (Tom)
Overview of data quality (Jelmer)

Meeting 1 - 10-09-2024 - 15:45-17:00

Agenda

Discuss about backlog

Discuss about backlog

Create research question + subquestions
Create sprint planning
Create backlog
Tech stack