ml_report - CankayaUniversity/ceng-407-408-2020-2021-Monitoring-System-of-Water-Quality-and-Efficiency-of-Wastewater-Treatment GitHub Wiki

ML Report

Overview

This document explains which ML algorithms are used and how they behaved on our data. We tested our results only using Akarsu dataset because it is bigger in quantity compared to other datasets(Deniz, Göl, Artıma).

Data

The dataset is provided by Ministry of Environment and Urban Planning and it includes water samples taken from rivers, seas, lakes and water treatment plants. However, there are lots of issues with this dataset.

  • The samples are recorded in uneven time intervals therefore making timeseries prediction a hard task.
  • Lots of missing data
  • Lots of incorrect data
  • Outliers

By dealing with these issues, we lost a huge amount of data, so in this comparison we decided to use only Akarsu dataset to compare the model performance.

Akarsu dataset has the following 7 feature columns: Fekal_Koliform, Toplam_Koliform, Toplam_Fosfor, Toplam_Kjeldahl_Azotu, Kimyasal_Oksijen_İhtiyacı, Nitrat_Azotu and Çözünmüş_Oksijen.

LSTM Models

Both LSTM models uses the same hyperparameters for accurate comparison. Validation is done with slicing a sample location out of the dataset and comparing the predicted and actual results. Accuracy, Mean Squared Error and validation with test set will be used to compare these models.

  • Optimizer: Adam
  • Epochs: 20
  • n_timesteps = 16
  • Batch size = 8
  • Neurons = 100

Single LSTM

This model is succeded predicting the increase and decrease in water sample in some features but failed in others. These features are Toplam_Fosfor, Kimyasal_Oksijen_İhtiyacı and Çözünmüş Oksijen.

  • Accuracy: 91%
  • Mean-Squared Error: 0.0015

Line and Scatter Plots

Çözünmüş Oksijen Kimyasal Oksijen İhtiyacı Toplam Fosfor

Stacked LSTM

In addition to single LSTM, we used return_states and return_sequences for this model and added another LSTM layer with 100 neurons. This model has a better fit compared to single LSTM. We got a great fit for the following 4 out of 7 features: Toplam_Fosfor, Toplam_Kjedahl_Azotu, Kimyasal_Oksijen_İhtiyacı and Çözünmüş Oksijen.

  • Accuracy: 91%
  • Mean-Squared Error: 0.0013

Line and Scatter Plots

Çözünmüş Oksijen Kimyasal Oksijen İhtiyacı Toplam Fosfor Toplam_Kjeldahl_Azotu

ARIMA

Auto Regressive Integrated Moving Average Model (ARIMA) is a class of statistical models for analyzing and forecasting time series data. It is different than machine learning techniques we used and it is mostly used in stock price prediction.

Prediction in ARIMA is done by selecting a single column. By this, the model is not able to understand the relations between features, but it calculates the moving average of a feature.

ARIMA Results

⚠️ **GitHub.com Fallback** ⚠️