State of the art Machine Learning to predict web application crashes 💉 - urcuqui/Data-Science GitHub Wiki

[1] Resources exhaustion for different reseasons overload, inadequate system recourse planning, or transient software failures which consume resources until a crash. The article proposes a framework based on machine learning techniques to predict the time to crash when the system suffers software errors such as memory leaks which consume resources randomly and gradually. The study does emphasis on the memory leaks. They explain that Clustering is the most used technique to minimize outages in industry, and today most business-critical servers apply some sort of server-redundancy, load-balancers, and fail-over techniques.

They explain that on the literature the three reasons for downtime are:

Human o operator errors (40%)
Software errors (40%)
Hardware errors (20%)

According to [1][2], software errors can be divided into two main categories:

Heinsenbugs, they manifest themselves and disappear nondeterministically. they are more difficult to analyze. These failures are related to memory, CPU time, processes, connectivity, etc.
Bohrbugs, they are deterministic and easy to fix through traditional testing and debugging.

As I explained the work proposes a framework, specifically, it's focused to predict the time until crash for the application servers to indeterministic software faults. Moreover, due they main were to make a real-time system, they only analyzed machine learning with low computation needs.

They took a set of predefined measurements M1, M2, ..., Mn of a system. These measurements are taken at different points in time: M(i,t) is the value Mi seen at time t.

Some examples are: memory, response time, number of active processes, cpu usage, requests per unit time, etc.

The relation between workload, internal state, and measurements is complex that they can see them of M(i,t) as random variables, i.e., as values coming from some (unknown) probability distribution.

They denoted with R(i,t) the expected value of M(i,t). The choice of the name R(i,t) is not neutral: we think of M(i,t) as measuring some kind resourse Ri over time.

System stability hypothesis: Assuming the workload characteristics remain constant, the system will converge to as et of values of R(i,t) depending on the workload only, after sufficiently long time t. In particular lambda(i,t)=0

This paper proposes a strategy for dealing with this kind of failures: By monitoring the speed at which R(i,t) varies (equivalently, monitoring the evolution of lambda(i,t)), we can estimate a time Tfail > t such that Ri,Tfail.

Their framework is composed by Application Server Machine and the Prediction Framework Machine.

The Monitoring Agent is focused to collect system metrics from the system, some of them are throughput, response time, workload, system load, disc usage, swap used, number of processes, the number of threads, free system memory, memory occupied by tomcat, the number of HTTP connections received and the number of connections to the database. All of these metrics were gotten through a modification of Nagios.

They collect all data from several system executions, possibly under different workloads, using the monitoring tool. After that, they have an expert in the team in order to locate the dataset crashes. The new data set is used as input for the Enriching Process.

Experimental Setup

In their experiment, they have used a multi-lier-e-commerce site that simulates an online bookstore, following the standard configuration of TCP-W benchmark (Obsolete as of 4/28/05).

TCP-W benchmark (http://www.tpc.org/tpcw/) is a way to run different experiments using different parameters and in a controlled environment.

TCP-W allows us to make Emulated Browsers (EBs) where each client can access the website(a simulation of an online bookstore in TCP-W), these are called sessions. Each session is a sequence of logically connected requests. Between two consecutive requests from the same

To simulate a transient failure that consumes resources until their exhaustion, they modified a servlet (TCPW_search_request_servlet class) of the TCP-W. This server calculates a random number between 0 and N that means how many requests use the servlet before the next memory leak is injected.

The variation of memory consumption depends on the number of clients and the frequency of servlet visits. In average, there be a memory leak injection every N/2 requests. TCP-W specification, this frequency depends on the workload chosen. This tool has three types of workload (browsing, shopping, and ordering). _Their experiments were conducted using shopping distribution.

They collect all data from several system executions, possibly under different workloads, they and an expert located wherein the data set crashes have occurred (for example, which values of response time or throughput are unacceptable)

Through the WEKA application, they used three algorithms, J48, NaiveBayes, IBk (k nearest neighbor). They used de default options.

The results are shown in the form of confusion matrices, indicating how many examples of each class (red, orange, green) are classified as in each class.

[2]

References

[1] Alonso, J., Berral, J., Gavalda, R., & Torres, J. Predicting web application crashes using machine learning.
[2] Hock, M., Neumeister, F., Zitterbart, M., & Bless, R. (2017, October). TCP LoLa: Congestion Control for Low Latencies and High Throughput. In Local Computer Networks (LCN), 2017 IEEE 42nd Conference on (pp. 215-218). IEEE.
[3] Williams, N., Zander, S., & Armitage, G. (2006). A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Computer Communication Review, 36(5), 5-16.
[4] Chaudhuri, A. (2017, October). Hierarchical support vector regression for QoS prediction of network traffic data. In Proceedings of the 1st International Conference on Internet of Things and Machine Learning (p. 15). ACM.
[5] Mirza, M., Sommers, J., Barford, P., & Zhu, X. (2007, June). A machine learning approach to TCP throughput prediction. In ACM SIGMETRICS Performance Evaluation Review (Vol. 35, No. 1, pp. 97-108). ACM.