Backlog - 504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 GitHub Wiki
Welcome to the CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 wiki!
Last update: 30 October 2024
Weekly meetings:
- Every Tuesday 13:45 at de Bouwplaats
Research Question:
How can time-series forecasting methods be used to predict future parcel volume in warehouses?
Subquestions:
- What does the data look like?
- What models can potentially yield effective modeling results?
Data
- The data consists of date and sorting belt-specific entries yielding a number of events, which represent the amount of packages that is sorted by that chute on that day.
- More features are added by deriving from the sorting date and the volume in the previous day (so-called lag features).
Tech stack
- Operating systems: Windows
- Server-Side Programming: Python
- Version control: Github
- Development environment: VS Code, JupyterLab
- Documentation: Github Wiki
Backlog Sprint week 7 & 8
- Work out storyline
- Finish model engineering
- Compare model performances
- Create standard input & output
https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated1/1730303330024-72faacd7-fbf9-4705-b6cc-9d23ecfa24ec_1.jpg https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated/1730303330183-115241d8-31ff-4ccc-8f92-ebdd8d82a282_1.jpg https://github.com/504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1/blob/main/GantChart_updated/1730303330183-115241d8-31ff-4ccc-8f92-ebdd8d82a282_2.jpg
Update meetings
Update 29-10-2024
Progress tasks last week
- Work out storyline - include research questions, conclusion (pro and cons)
- Introduction - finished
- Research question - finished
- Conclusion - in progress
- Finish model engineering - finished
- Compare model performances - results collected, still needs to be commented
- Create standard input & output - finished
New tasks
- Hand in notebook
- Prepare slides for the presentation
Meeting 7 - 22-10-2024 - 13:45-16:00
Agenda
- Progress tasks last week
- Sprint planning
- Questions Nicolas
- Discussion methods
- Planning coming week
Progress tasks last week
Tasks update
- Storyline (Tom)
- Correlations - finished
- Restructuring - finished
- Outlier analysis - finished
- Capital / rural analysis (include map) - finished
- Linear regression (Kristian)
- different features / lags - add some text, finished, can be included in main notebook
- ARIMA (Tom)
- add to main - included to main notebook, some bugs
- CEEMDAN
- parallel computing (Sun) - tried, but doesn't work that well
- explanation (Jelmer & Sun) - wave decomposition, bad results for all output belts
- comparisons Gru vs LSTM (Jelmer) - Gru decomposition aggregate results
- results, but not dashboard
- Neural Network (Mats)
- two files - extra model for rural, currently merging those files
- feature engineering - tried cosine, sine weekdays, but embeddings work better
- explanation - currently working on it
- Dashboard forecast (Tom)
- Standardize input + output - structure created
- Make accessible for forecast - working for ARIMA, can be extended to different models
Sprint planning
- Work out storyline - include research questions, conclusion (pro and cons)
- Finish model engineering - finished, add to main notebook, align with preparation
- Compare model performances - Align models with same measures
- Create standard input & output - Created for ARIMA, need to align for other models
Discussion methods
- Include comparison capital vs rural (one hot encoding)
- Report running time
- RMSE, VSE, MAE
- Add comments
- Assymmetric loss function
- The key to addressing your problem is to use an asymmetric loss function that penalizes overpredictions more heavily than underestimations.
- Mats will look into that
Sprint planning (week 7 & 8)
- Work out storyline
- Finish model engineering
- Compare model performances
- Create standard input & output
Planning coming week
- Work out storyline - work out introduction, research questions (Kristian), conclusion (pro and cons) (Jelmer)
- Finish Gantt chart (Jelmer)
- Store running time (Everyone)
- Add models to main notebook (Everyone)
- Add comments (Everyone)
- Add plots for three different outputs belts (Everyone)
- Align with preparation (Tom)
- Create standard input & output - Created for ARIMA, need to align for other models (Tom)
- Compare model performances - Align models with same measures (Tom)
- Compare model performances - Running (Mats)
- Look into assymetric loss function (Mats)
- Notebook finished before end of the weekend! (Everyone)
Meeting 6 - 15-10-2024 - 13:45-17:00
Agenda
- Progress tasks last week
- Sprint planning
- Questions Nicolas
- Discussion methods
- Planning coming week
Progress tasks last week
Tasks update
- ARIMA (Tom)
- Visualisations and code align (Tom)
- Standardized
- Linear regression uitwerken (Kristian)
- Experiment with different lags
- Polynomial regression
- Multi-linear regression not possible for time-series
- Day of the week as extra feature
- LSTM (Sun & Jelmer)
- Aggregated prediction at the moment
- 3 minutes training time per output belt
- Lag of meaningfull features
- Wave deconstruction (can be limited)
- Gru instead of LSTM
- RNN (Mats)
- LSTM instead of RNN
- 365x150 dataset gives good average results but not daily
Sprint retrospective
- Finish visualization + outlier analysis
- Visualization + outlier analysis is finished
- Predict correlations using linear regression
- Single correlations calculated, quite low
- Cross correlations need to be calculated (ask Nicolas)
- Finish model engineering
- ARIMA finished
- ANN not finished yet
- LSTM not finished yet
- Split train, test and validation set
- Train: January-October
- Test: 2 weeks
- Choose best model
- Continue on different models
- Evaluate on
- KPIs (RMSE & MAE) - absolute value and square per day per output belt, normalization (square root, number of observations) over all data - MSE and Variance SE over each day of the planning horizon
- Running time (Google Colab, same computer)
- Interpretability
Questions Nicolas
-
Can you guys do manual feature engineering?
- I liked the idea of rural vs the capital
- One hot encoding for the locations (feature)
-
Do you guys have the timing for training the models?
- Add running time of the training (repeatable part or pre-repeatable part?)
-
How are you handling predictions?
- Predicting per output belt for time horizon of two weeks
-
Focus is on predicting two weeks ahead
- How are you handling this?
- A simple approach could be to predict the first day, and then use the prediction of that day to predict the following day. Repeat until 14 days are finished
- I think most of you are using the true data to predict the next day, ignoring that you wouldn’t have it
- I think we predict on 2 weeks simultaneously
-
Correlations (cross-correlations)
- Don't look at cross-correlations
- Look at correlation with day before
-
Linear regression
- 6 lags days
- Different models (RF, Linear)
- One-hot encoding sorting center & output_belt
- Sine, cosine (date in time)
-
Check if there is relation between # output belt & number of events
-
RMSE & MAE
-
Add comments
-
Assymmetric loss function
- The key to addressing your problem is to use an asymmetric loss function that penalizes overpredictions more heavily than underestimations.
Discussion methods
- RNN is replaced by LSTM (Mats)
- 365 x 150 dataset
- LSTM and Gru both considered (Jelmer & Sun)
- currently aggregate level
- separate outliers from normal model to prevent underfit
- Linear regression (Kristian)
- Day of the week as extra feature
Sprint planning (week 7 & 8)
- Work out storyline
- Finish model engineering
- Compare model performances
- Create standard input & output
Planning coming week
- Storyline (Tom)
- Correlations
- Restructuring
- Outlier analysis
- Capital / rural analysis (include map)
- Linear regression (Kristian)
- different features / lags
- ARIMA (Tom)
- add to main
- CEEMDAN
- parallel computing (Sun)
- explanation (Jelmer & Sun)
- comparisons Gru vs LSTM (Jelmer)
- Neural Network (Mats)
- feature engineering
- explanation
- Dashboard forecast (Tom)
- Standardize input + output
- Make accessible for forecast
Meeting 5 - 08-10-2024 - 14:00-16:00
Agenda
- Progress tasks last week
- Sprint planning
- Discussion methods
- Planning coming week
Progress tasks last week
Tasks finished
- Include boxplots and plots (Everyone) - Tom will align everything
- Check outliers (Jelmer) - Later
- Multi-regression uitwerken (Kristian) - Small adjustment needed
- Cosine + sine + day of the year (Kristian) - Done
- LSTM (Sun) - Needs further explanation
- RNN (Mats) - Needs further explanation
- ARIMA (Tom) - First results
- Storyline + presentation (Jelmer) - Storyline later
Sprint planning
- Finish visualization + outlier analysis
- Tom will finish visualization this week
- Finish outlier analysis after finishing models
- Predict correlations using linear regression
- Finish model engineering
- ARIMA
- ANN
- LSTM
- Split train, test and validation set
- Train: January-October
- Test: 2 weeks
- Choose best model
Discussion methods
- Include RMSE and MAE
Planning coming week
- ARIMA (Tom)
- Visualisations and code align (Tom)
- Linear regression uitwerken (Kristian)
- LSTM (Sun & Jelmer)
- RNN (Mats)
Midterm PrimeVision - 08-10-2024 - 10:00-11:00
- Feature embedding might be overkill (when only considering day, week, month)
- Embedding with (trained) PCA
- No bildirectional LSTM, since you are predicting the future
- Why mix LSTM with transformer? Attention is already included in LSTM, delve into that or discard it
- ARIMA: try to see how the results the accuracy falls of when predicting further ahead
- Take a look at smaller sections (for example look at individual months, bundle extremes (high outliers or low outliers)
- Mean absolute error and Mean Squared error data
- Feature importance for the model
- Feature weight
- Lean split
- Absolute error over time
- Variance actuals vs predictions
- Report running time
Meeting 4 - 01-10-2024 - 13:45-16:00
Agenda
- Progress tasks last week
- Sprint retrospective
- Meeting PrimeVision
- Meeting Nicolas
- Discussion methods
- Sprint planning coming weeks (week 5 & 6)
- Planning coming week
Progress tasks last week
Tasks finished
- Make boxplots (year, month, weekday, sorting center) (Jelmer) - Adjust to small plots
- Create main file for visualizations (Tom) - Done
- Aggregate planning horizon (Mats) - Add and adjust for VANTAA
- boxplots (Jelmer) - Add and adjust for VANTAA
- Day of month (Sun) - Adjust for VANTAA
- Chute outliers (Tom) - Adjust for less chutes (no focus)
- Create Artificial Neural Network as baseline (Mats) - Adjust splitting train and test
- Investigate ARIMA (Tom) - Continue
- Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian) - Continue
Sprint retrospective
- Import and load data - finished
- Clean data - finished
- Visualize data - boxplots and decide which are interesting
- Remove outliers - investigate with PrimeVision
- Add extra features - day, month, weekday added (experiment later on with seasonality, festives)
Notes: some unclarities in data & goal led to delay -> Improvement: ask questions directly and make choices early
Meeting PrimeVision
- Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?
- Not all chutes are opened always -> just assume that chute is always used for some region
- Send email for extra clarification
- What is the final model that they want to have?
- Input: last days, last year (data we have)? For what period do you want the forecast?
- Directly into the future
- Check how much historical data is needed for good prediction (the less, the better)
- Output: Predictions two weeks for all output belts and all sorting centers?
- Try for all output belts and otherwise only for all sorting centers
- Input: last days, last year (data we have)? For what period do you want the forecast?
- What to do with outliers? When are they outliers?
- Check on performance
Meeting Nicolas (TA)
- Answer mail Nicolas + schedule meeting
- Outliers
- Check whether there is time-trend
- Correlation
- Better to overshoot or undershoot?
- Date
- Use sine and cosine to connect day 7 to day 1
- Add day of the year
- Output belt should be a feature
- Testing
- Use previous months to predict this month
Discussion methods
- Multiple regression instead of linear regression (compare which variables have most influence)
- Weekday most performance
- Week and month are similar -> remove month
Presentation
- Explain problem
- Explain data
- Visualization
- Methods
- Next steps
Sprint planning coming weeks (week 5 & 6)
- Finish visualization + outlier analysis
- Predict correlations using linear regression
- Finish model engineering
- ARIMA
- ANN
- LSTM
- Split train, test and validation set
- Choose best model
Planning coming week
- Include boxplots and plots (Everyone)
- Check outliers (Jelmer)
- Multi-regression uitwerken (Kristian)
- Cosine + sine + day of the year (Kristian)
- LSTM (Sun)
- RNN (Mats)
- ARIMA (Tom)
- Storyline + presentation (Jelmer)
Meeting 3 - 24-09-2024 - 13:45-15:30
Agenda
- Progress tasks last week
- Conclusions visualization
- Progress sprint planning (week 3 & 4)
- Planning coming week
Progress tasks last week
Tasks finished
- Finish step 1 & 2 in class (everyone)
- Visualize aggregate demand over planning horizon (seasonality) (Mats)
- Visualize aggregate demand on day of the month (Dexin)
- Visualize per chute demand over planning horizon (in percentages) (Tom)
- Visualize per chute demand on weekday (in percentages) (Tom)
- Overview of data quality (Jelmer)
- Visualize aggregate demand on weekday (Kristian)
Conclusions visualization
-
global pattern (aggregate)
-
more in december
-
stable over the year
-
weekly pattern (tuesday, wednesday most busy, saturday way less, sunday the least)
-
cluster based on size of sorting center
-
try to find reason of peaks
-
make boxplot for year, month, day, per sorting center
-
few output belts have a high percentage of output
-
Sine pattern days of the month, needs to be fixed to correct for monthdays in a year, plus in one plot
-
Questions:
-
create seperate neural networks for different sorting centers?
-
Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?
Progress sprint planning (week 3 & 4)
- Import and load data - finished
- Clean data - finished
- Visualize data - boxplots and decide which are interesting
- Remove outliers - investigate with PrimeVision
- Add extra features - day, month, weekday added (experiment later on with seasonality, festives)
Planning coming week
- Make boxplots (year, month, weekday, sorting center) (Jelmer)
- Create main file for visualizations (Tom)
- Aggregate planning horizon (Mats)
- boxplots (Jelmer)
- Day of month (Sun)
- Chute outliers (Tom)
- Create Artificial Neural Network as baseline (Mats)
- Investigate ARIMA (Tom)
- Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian)
Meeting 2 - 17-09-2024 - 13:45-15:30
Agenda
- Meet with PrimeVision
- Revise planning
- Tasks division
Meeting Primevision
- Objective:
- calculate amount of volume per chute
- chute = output belt, physical location
- Functioning of software:
- Enter date and give prediction for next two weeks
- Split data per sorting center
- Filter on event type since we need to count them only once
- Implementation
- First try on python
- if we have spare time, try to implement in AWS native together with people from PrimeVision
- Data
- Should be available today
Revise planning
- Read about machine learning techniques
- Take time series into account (trends last period)
- Otherwise we make prediction for every day of the year
Steps:
- Import and load data
- Clean data
- Visualize data
- Remove outliers + add extra features
- Advanced data analysis
- Predict correlations using linear regression
- Split train, test and validation set
- Normalize data
- Simple Neural Network
- LSTM / deep Neural Network
- Analyze results
Tasks division
- Finish step 1 & 2 in class (everyone)
- Visualize aggregate demand over planning horizon (seasonality) (Mats)
- Visualize aggregate demand on weekday (Kristian)
- Visualize aggregate demand on day of the month (Dexin)
- Visualize per chute demand over planning horizon (in percentages) (Tom)
- Visualize per chute demand on weekday (in percentages) (Tom)
- Overview of data quality (Jelmer)
Meeting 1 - 10-09-2024 - 15:45-17:00
Agenda
- Discuss about backlog
Discuss about backlog
- Create research question + subquestions
- Create sprint planning
- Create backlog
- Tech stack