Data Science - Fleet-Analytics-Dashboard/Application GitHub Wiki

Table of Contents

1. General

As a step towards normalization of the tables in our database, we decided to split the original dataset into:

  • Vehicle_data
    • Containing all vehicle-related information
  • Driving_data
    • Containing all the observations for each journey of our vehicles

Through further simulation and calculation, we also created the following tables:

  • Vehicle_cost_data
    • Containing a calculation of the cost per vehicle per month
  • Driver_names
    • A list of personal information about each driver
  • 10_fold_cross_validation_maintenance
    • The result of the 10-fold cross-validation of our maintenance-prediction model

For the simulation of our data, we mainly used the numpy random function. The following variables were created using different distributions:

  • Business Goals
    • The data points were simulated using a normal distribution and sorted ascending.
  • Vehicle build year
    • A random integer between 2000 and 2018 was created using the randint() function
  • Load capacity
    • Depending on the vehicle class, we created random integer values representing the capacity of each vehicle in pounds, using the randint() function
  • Vehicle status
    • Status 'accident', 'unused', 'idle', 'on time', 'delayed' were set.
    • Different probabilities were set for each status to ensure a realistic distribution
    • Afterwards the status of each vehicle with a scheduled service this week was set to 'maintenance'
  • Fuel cost
    • Depending on the vehicle class a random float value representing the fuel consumption in miles per gallon was created
    • The average price for a gallon of diesel was divided through the simulated consumption and multiplied by the total distance of the trip to calculate the fuel cost in dollar per trip
    • For further calculation of the fuel cost per vehicle see cost calculation
  • Accident probability
    • A random integer between 0 and 100 (excluding 100) was created to represent the likelihood of an accident of the vehicle
    • We planned on implementing a proper prediction model here but was postponed due to lack of time and other priorities

Furthermore, we created the following variables for each vehicle

  • Position
    • We used geomidpointto generate a random position within a given rectangle across the USA and Canada for each vehicle and added it to the data frame
  • Licence plate
    • We generated a licence plate with a continuous number, sorted by the vehicle id

2. Costs calculation

  • Fuel Costs: From the calculated fuel cost per trip, we split the data frame monthly by using the day_id. Then we calculated the monthly fuel costs per vehicle by adding up all the fuel costs for each vehicle id.
  • Insurance Costs: Simulated using a normal distribution around the value of 200$ a month with a standard deviation of 20$.
  • Maintenance Costs: Simulated using a normal distribution around the value of 1200$ a month with a standard deviation of 100$.
  • Total Costs: All three values were added

3. Prediction

We decided that we wanted to implement a prediction model for maintenance to help the fleet manager to plan delivery contracts with ease. Through the maintenance prediction, a shortage of vehicles, due to maintenance can be recognised earlier and therefore be handled in advance.

Our idea was to determine the maintenance need for each vehicle after a trip from sensor data for different parts. Each sensor should return a categorical value for each part with 1 representing a problem and 0 representing no problem. If a certain amount of sensors report a problem after a trip than the vehicle would have to undergo maintenance. Since our Dataset did not include such sensor data, we created three sensors using thresholds for existing variables:

  • Tire sensor
    • Using 'maximum_rolling_power_density_demand'
  • Break sensor
    • Using 'max_deceleration_event_duration'
  • Engine sensor
    • Using 'maximum_kinetic_power_density_demand' The thresholds for each variable were set to be in the upper part of the 75th percentile. We then decided that if two or more sensors report a problem, the vehicle would need to undergo maintenance.

With our data foundation set, we had to decide on a model that we want to use for our prediction. Based on the current developments in the data science community we decided to try the XGBoost algorithm. XGBoost uses gradient boosting to predict the target variable. For classification problems, there is also an opportunity to return the probability with which the prediction was classified.

To check if the model would be a good fit for our dataset, we used the 10-fold cross-validation and calculated the root mean squared error (RMSE). The calculated RMSE for the test in the last round was at about 0,072 which is why we decided to continue using XGBoost.

In the following steps, we applied one-hot encoding for all the leftover categorical variables and extracted one trip per vehicle from the driving_data table. We did this to be able to predict a final maintenance need for each vehicle. We then used the entire leftover data points as training data, trained our model and calculated the probability for the classification with 1 (maintenance need). Based on this probability we crated additional classes, representing the left weeks until the predicted maintenance. These results were visualised in the maintenance calendar.