Backlog - 504523tb/CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 GitHub Wiki
Welcome to the CIEM6302-Advanced-Data-Science-for-Traffic-and-Transportation-Engineering-PrimeVision-G1 wiki!
Last update: 01 October 2024
Meetings:
- Sprint planning meetings: every Tuesday 13:45 at de Bouwplaats
- Additional update meeting: every Friday
Research Question:
How can time-series forecasting methods be used to predict future parcel volume in warehouses?
Subquestions:
- Which time-series forecasting methods can accurately predict the parcel volume?
- Which features significantly contribute to the prediction accuracy?
Data
- (Anonymized) parcel volume data with size (width, height, length), weight, destination (country, ZIP), and parcel product (SLA) over time. If possible, we will add anonymized shipping parties.
- Timestamps for previously announced parcels and the time they physically arrived at a facility (Site ID and Site ZIP).
- Sorting Results at the moment parcels arrive at a physical facility (chute and working shift).
Tech stack
- Operating systems: Windows
- Server-Side Programming: Python
- Cloud platform: AWS
- Version control: Github
- Development environment: VS Code, JupyterLab
- Documentation: Github Wiki
Backlog sprint 1 (week 3 & 4):
- Import and load data
- Clean data
- Visualize data
- Remove outliers
- Add extra features
Meeting 5 - 08-10-2024 - 14:00-16:00
Agenda
- Progress tasks last week
- Sprint planning
- Discussion methods
- Planning coming week
Progress tasks last week
Tasks finished
- Include boxplots and plots (Everyone) - Tom will align everything
- Check outliers (Jelmer) - Later
- Multi-regression uitwerken (Kristian) - Small adjustment needed
- Cosine + sine + day of the year (Kristian) - Done
- LSTM (Sun) - Needs further explanation
- RNN (Mats) - Needs further explanation
- ARIMA (Tom) - First results
- Storyline + presentation (Jelmer) - Storyline later
Sprint planning
- Finish visualization + outlier analysis
- Tom will finish visualization this week
- Finish outlier analysis after finishing models
- Predict correlations using linear regression
- Finish model engineering
- ARIMA
- ANN
- LSTM
- Split train, test and validation set
- Train: January-October
- Test: 2 weeks
- Choose best model
Discussion methods
- Include RMSE and MAE
Planning coming week
- ARIMA (Tom)
- Visualisations and code align (Tom)
- Linear regression uitwerken (Kristian)
- LSTM (Sun & Jelmer)
- RNN (Mats)
Midterm PrimeVision - 08-10-2024 - 10:00-11:00
- Feature embedding might be overkill (when only considering day, week, month)
- Embedding with (trained) PCA
- No bildirectional LSTM, since you are predicting the future
- Why mix LSTM with transformer? Attention is already included in LSTM, delve into that or discard it
- ARIMA: try to see how the results the accuracy falls of when predicting further ahead
- Take a look at smaller sections (for example look at individual months, bundle extremes (high outliers or low outliers)
- Mean absolute error and Mean Squared error data
- Feature importance for the model
- Feature weight
- Lean split
- Absolute error over time
- Variance actuals vs predictions
- Report running time
Meeting 4 - 01-10-2024 - 13:45-16:00
Agenda
- Progress tasks last week
- Sprint retrospective
- Meeting PrimeVision
- Meeting Nicolas
- Discussion methods
- Sprint planning coming weeks (week 5 & 6)
- Planning coming week
Progress tasks last week
Tasks finished
- Make boxplots (year, month, weekday, sorting center) (Jelmer) - Adjust to small plots
- Create main file for visualizations (Tom) - Done
- Aggregate planning horizon (Mats) - Add and adjust for VANTAA
- boxplots (Jelmer) - Add and adjust for VANTAA
- Day of month (Sun) - Adjust for VANTAA
- Chute outliers (Tom) - Adjust for less chutes (no focus)
- Create Artificial Neural Network as baseline (Mats) - Adjust splitting train and test
- Investigate ARIMA (Tom) - Continue
- Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian) - Continue
Sprint retrospective
- Import and load data - finished
- Clean data - finished
- Visualize data - boxplots and decide which are interesting
- Remove outliers - investigate with PrimeVision
- Add extra features - day, month, weekday added (experiment later on with seasonality, festives)
Notes: some unclarities in data & goal led to delay -> Improvement: ask questions directly and make choices early
Meeting PrimeVision
- Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?
- Not all chutes are opened always -> just assume that chute is always used for some region
- Send email for extra clarification
- What is the final model that they want to have?
- Input: last days, last year (data we have)? For what period do you want the forecast?
- Directly into the future
- Check how much historical data is needed for good prediction (the less, the better)
- Output: Predictions two weeks for all output belts and all sorting centers?
- Try for all output belts and otherwise only for all sorting centers
- Input: last days, last year (data we have)? For what period do you want the forecast?
- What to do with outliers? When are they outliers?
- Check on performance
Meeting Nicolas (TA)
- Answer mail Nicolas + schedule meeting
- Outliers
- Check whether there is time-trend
- Correlation
- Better to overshoot or undershoot?
- Date
- Use sine and cosine to connect day 7 to day 1
- Add day of the year
- Output belt should be a feature
- Testing
- Use previous months to predict this month
Discussion methods
- Multiple regression instead of linear regression (compare which variables have most influence)
- Weekday most performance
- Week and month are similar -> remove month
Presentation
- Explain problem
- Explain data
- Visualization
- Methods
- Next steps
Sprint planning coming weeks (week 5 & 6)
- Finish visualization + outlier analysis
- Predict correlations using linear regression
- Finish model engineering
- ARIMA
- ANN
- LSTM
- Split train, test and validation set
- Choose best model
Planning coming week
- Include boxplots and plots (Everyone)
- Check outliers (Jelmer)
- Multi-regression uitwerken (Kristian)
- Cosine + sine + day of the year (Kristian)
- LSTM (Sun)
- RNN (Mats)
- ARIMA (Tom)
- Storyline + presentation (Jelmer)
Meeting 3 - 24-09-2024 - 13:45-15:30
Agenda
- Progress tasks last week
- Conclusions visualization
- Progress sprint planning (week 3 & 4)
- Planning coming week
Progress tasks last week
Tasks finished
- Finish step 1 & 2 in class (everyone)
- Visualize aggregate demand over planning horizon (seasonality) (Mats)
- Visualize aggregate demand on day of the month (Dexin)
- Visualize per chute demand over planning horizon (in percentages) (Tom)
- Visualize per chute demand on weekday (in percentages) (Tom)
- Overview of data quality (Jelmer)
- Visualize aggregate demand on weekday (Kristian)
Conclusions visualization
-
global pattern (aggregate)
-
more in december
-
stable over the year
-
weekly pattern (tuesday, wednesday most busy, saturday way less, sunday the least)
-
cluster based on size of sorting center
-
try to find reason of peaks
-
make boxplot for year, month, day, per sorting center
-
few output belts have a high percentage of output
-
Sine pattern days of the month, needs to be fixed to correct for monthdays in a year, plus in one plot
-
Questions:
-
create seperate neural networks for different sorting centers?
-
Some output belts have less demand in beginning of the year and more at the end of the year in comparison to other belts? The other way around also happens. Is there an explanation for that?
Progress sprint planning (week 3 & 4)
- Import and load data - finished
- Clean data - finished
- Visualize data - boxplots and decide which are interesting
- Remove outliers - investigate with PrimeVision
- Add extra features - day, month, weekday added (experiment later on with seasonality, festives)
Planning coming week
- Make boxplots (year, month, weekday, sorting center) (Jelmer)
- Create main file for visualizations (Tom)
- Aggregate planning horizon (Mats)
- boxplots (Jelmer)
- Day of month (Sun)
- Chute outliers (Tom)
- Create Artificial Neural Network as baseline (Mats)
- Investigate ARIMA (Tom)
- Investigate Recurrent Neural Network (LSTM) (Sun, Jelmer, Kristian)
Meeting 2 - 17-09-2024 - 13:45-15:30
Agenda
- Meet with PrimeVision
- Revise planning
- Tasks division
Meeting Primevision
- Objective:
- calculate amount of volume per chute
- chute = output belt, physical location
- Functioning of software:
- Enter date and give prediction for next two weeks
- Split data per sorting center
- Filter on event type since we need to count them only once
- Implementation
- First try on python
- if we have spare time, try to implement in AWS native together with people from PrimeVision
- Data
- Should be available today
Revise planning
- Read about machine learning techniques
- Take time series into account (trends last period)
- Otherwise we make prediction for every day of the year
Steps:
- Import and load data
- Clean data
- Visualize data
- Remove outliers + add extra features
- Advanced data analysis
- Predict correlations using linear regression
- Split train, test and validation set
- Normalize data
- Simple Neural Network
- LSTM / deep Neural Network
- Analyze results
Tasks division
- Finish step 1 & 2 in class (everyone)
- Visualize aggregate demand over planning horizon (seasonality) (Mats)
- Visualize aggregate demand on weekday (Kristian)
- Visualize aggregate demand on day of the month (Dexin)
- Visualize per chute demand over planning horizon (in percentages) (Tom)
- Visualize per chute demand on weekday (in percentages) (Tom)
- Overview of data quality (Jelmer)
Meeting 1 - 10-09-2024 - 15:45-17:00
Agenda
- Discuss about backlog
Discuss about backlog
- Create research question + subquestions
- Create sprint planning
- Create backlog
- Tech stack