專案PPT - cookie10142008/tsmc_intern GitHub Wiki

Anomaly Diagnosis of Database Performance by Machine Learning
F15MITD 2019/08/26
Background & Benefits of SQL Performance Shift Detection
Intern Objective for Solution of Abnormal Pattern
Construct an efficient way to identify the SQL causing anomaly
Find out the abnormal patterns in CIMDB
Find out the relationship between SQL and CPU usage
Find out which SQL causing abnormal pattern
5 Steps to Implement
Advisement for TSMC
Intern in TSMC Ba* ckground
Currently 300mm FAB production line are depending on IT system fully automation, these huge LOT processed data and system transaction need to rely on lots of I/O to Oracle/DB2 database.
But…
inappropriate AP logic and bad SQL performance may cause highly I/O usage and CPU trend up issue
sharply slow down the FAB Track in/out frequency.
current abnormal CPU usage are taking manual monitor without preventing previously
substantially take time and human power to fix/find out root causes.

Early detect abnormal patterns in CIMDB
Detect whether the abnormal pattern is related to SQL or not
Find out which SQL causing abnormal pattern

Step 1: Identify Abnormal Pattern
Automate way of finding peak in CPU usage charts
Step 2: Feature Engineering
Enhance referenced data/parameters for machine learning detection model
Step 3: Feature Selection
Select key features for ML detection model
Step 4: Model Selection
Build different ML detection models to find out
which SQL causing Peak
Step 5: Model Evaluation
Evaluate the accuracy of ML detection model
based on the peak found
Concept:
- predefine threshold and if CPU usage is over it, see it as peak
Problem:
- spend time on predefining threshold
- DB trend may change
Concept:
- use K-means ML algorithm to split dataset into peak cluster and non-peak cluster
Advantage:
- ML help to automatically define threshold and identify peak
- adjust threshold whenever we change our needs
Task2：Pre-task of feature engineering
Purpose: Collect more features for ML model as reference to identify SQL causing abnormal pattern
Method:
Remove the noise (like empty values that affect the judgement of ML model )
Data type transform (transform non-numeric type value(e.g., letters) into numeric type that machine can recognize)
Join table (Combine SQL performance table & Frequency of SQL usage table)
Benefit: Improve the accuracy of ML model
Task2：Check whether SQL have something to do with CPU usage or not (calculate Pearson Correlation)
Purpose: Select key features making the detection model more precisely
Method: PCA(Principal components analysis) (features: 9987 => 120)
Result: - cannot improve the accuracy of ML detection model
Purpose: Build model with previously processed parameters to identify which SQL causing abnormal pattern
Method: Refer to “sk-learn” model selection and validation result, try appropriate classification model and choose the better one
Benefit: Get the better model to make more precise result
Criteria of evaluation: Hit rate / False alarm rate
Task3：Find out which SQL causing abnormal pattern
(Hit rate/False alarm rate)
SVM:0.5/0.16
KNN:0/0.06
AdaBoost: 0.36/0.07
Gradient Boosting:0.63/0.036
Ensemble-XGBoost:0.909/0.117
About Project
Inspiration of current methodology
Collect more DB data from Hsinchu branch to predict which combination of SQL usage will cause peak in the future
E.g., simultaneously using A,B,C 3 SQL will definitely cause peak
About Environment
Set metal gate at the entrance before current metal gates

        avoid unconsciously bringing cellphone, laptop

Easy way to get open learning resources & analysis tools of ML
Only teamwork can finally reach the goal
Everyone connects, nobody can make mistakes
Only keeping corporate’s intelligence confidential can maintain competitive advantages