專案PPT - cookie10142008/tsmc_intern GitHub Wiki
- Anomaly Diagnosis of Database Performance by Machine Learning
- F15MITD 2019/08/26
- Background & Benefits of SQL Performance Shift Detection
- Intern Objective for Solution of Abnormal Pattern
- Construct an efficient way to identify the SQL causing anomaly
- Find out the abnormal patterns in CIMDB
- Find out the relationship between SQL and CPU usage
- Find out which SQL causing abnormal pattern
- 5 Steps to Implement
- Advisement for TSMC
- Intern in TSMC Ba* ckground
- Currently 300mm FAB production line are depending on IT system fully automation, these huge LOT processed data and system transaction need to rely on lots of I/O to Oracle/DB2 database.
- But…
- inappropriate AP logic and bad SQL performance may cause highly I/O usage and CPU trend up issue
- sharply slow down the FAB Track in/out frequency.
- current abnormal CPU usage are taking manual monitor without preventing previously
- substantially take time and human power to fix/find out root causes.
- Early detect abnormal patterns in CIMDB
- Detect whether the abnormal pattern is related to SQL or not
- Find out which SQL causing abnormal pattern
-
Step 1: Identify Abnormal Pattern
-
Automate way of finding peak in CPU usage charts
-
Step 2: Feature Engineering
-
Enhance referenced data/parameters for machine learning detection model
-
Step 3: Feature Selection
-
Select key features for ML detection model
-
Step 4: Model Selection
-
Build different ML detection models to find out
-
which SQL causing Peak
-
Step 5: Model Evaluation
-
Evaluate the accuracy of ML detection model
-
based on the peak found
-
Concept:
-
- predefine threshold and if CPU usage is over it, see it as peak
-
Problem:
-
- spend time on predefining threshold
-
- DB trend may change
-
Concept:
-
- use K-means ML algorithm to split dataset into peak cluster and non-peak cluster
-
Advantage:
-
- ML help to automatically define threshold and identify peak
-
- adjust threshold whenever we change our needs
-
Task2:Pre-task of feature engineering
-
Purpose: Collect more features for ML model as reference to identify SQL causing abnormal pattern
-
Method:
-
Remove the noise (like empty values that affect the judgement of ML model )
-
Data type transform (transform non-numeric type value(e.g., letters) into numeric type that machine can recognize)
-
Join table (Combine SQL performance table & Frequency of SQL usage table)
-
Benefit: Improve the accuracy of ML model
-
Task2:Check whether SQL have something to do with CPU usage or not (calculate Pearson Correlation)
-
Purpose: Select key features making the detection model more precisely
-
Method: PCA(Principal components analysis) (features: 9987 => 120)
-
Result: - cannot improve the accuracy of ML detection model
-
Purpose: Build model with previously processed parameters to identify which SQL causing abnormal pattern
-
Method: Refer to “sk-learn” model selection and validation result, try appropriate classification model and choose the better one
-
Benefit: Get the better model to make more precise result
-
Criteria of evaluation: Hit rate / False alarm rate
-
Task3:Find out which SQL causing abnormal pattern
-
(Hit rate/False alarm rate)
-
SVM:0.5/0.16
-
KNN:0/0.06
-
AdaBoost: 0.36/0.07
-
Gradient Boosting:0.63/0.036
-
Ensemble-XGBoost:0.909/0.117
-
About Project
-
Inspiration of current methodology
-
Collect more DB data from Hsinchu branch to predict which combination of SQL usage will cause peak in the future
-
E.g., simultaneously using A,B,C 3 SQL will definitely cause peak
-
About Environment
-
Set metal gate at the entrance before current metal gates
-
avoid unconsciously bringing cellphone, laptop -
Easy way to get open learning resources & analysis tools of ML
-
Only teamwork can finally reach the goal
-
Everyone connects, nobody can make mistakes
-
Only keeping corporate’s intelligence confidential can maintain competitive advantages