Python_ML - jjin-choi/study_note GitHub Wiki

์ด๊ฑฐ๋ถ€ํ„ฐ ๋ณด์ž https://lovit.github.io/nlp/representation/2018/03/26/word_doc_embedding/

https://lovit.github.io/nlp/machine%20learning/2018/03/21/kmeans_cluster_labeling/

https://stackoverrun.com/ko/q/7967067

http://doc.mindscale.kr/km/unstructured/04.html

https://frhyme.github.io/python-lib/document-clustering/

[DAY01] Dataset & Performance metrics

======================================

https://www.notion.so/blissray/Machine-Learning-Gachon-IME-2020-Spring-22692a7825d8412ca2a1336411ef4191

http://www.phontron.com/class/nn4nlp2019/schedule.html // spacy / itidfvectorizer (sklearn.feature_extraction.text) / tsne

https://lss.fnal.gov/archive/2019/slides/fermilab-slides-19-718-cms.pdf http://www.iasonltd.com/wp-upload/all/2020_NPL_Classification_A_Random_Forest_Approach_(Rev).pdf https://ieeexplore.ieee.org/abstract/document/8695493

ํŒŒ์ดํ† ์น˜

https://medium.com/@dodghekgoo/nlp-%EC%8B%9C%EC%9E%91%ED%95%98%EA%B8%B0-%EA%B8%B0%EC%B4%88%ED%8E%B8-d07405383453

ime2020 https://wikidocs.net/21667

$$ \sqrt{3x-1}+(1+x)^2 $$

Dataset split

โ–ก [Module] pandas

- pandas.concat

  • ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๋ฐ์ดํ„ฐ ์—ฐ๊ฒฐ (cf.merge ์™€๋Š” ๋‹ค๋ฅด๋‹ค)
  • ์ผ๋ฐ˜์ ์œผ๋กœ ์œ„/์•„๋ž˜๋กœ ๋ฐ์ดํ„ฐ ํ–‰์„ ์—ฐ๊ฒฐ.
  • ์˜†์œผ๋กœ ๋ฐ์ดํ„ฐ ์—ด์„ ์—ฐ๊ฒฐํ•˜๋ ค๋ฉด axis=1

Link: Pandas

pd.concat(list, axis=1)

โ–ก [Module] scikit-learn (3. Model selection and evaluation)

- 3.1 Cross-validation: evaluating estimator performance

  • To avoid overfitting, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.
  • Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.


  • A random split into training and test sets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets

X, y = datasets.load_iris(return_X_y,=True)

# Holding out 40% of the data for testing (evaluating) the classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
  • There is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally.
  • To solve this problem, yet another part of the dataset can be held out as a so-called โ€œvalidation setโ€.
  • Training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
  • k-fold Cross-validation (CV for short): the training set is split into k smaller sets.
    • A model is trained using k-1 of the folds as training data
    • The resulting model is validated on the remaining part of the data
    • The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
    • Pros: not waste too much data
    • Cons: computationally expensive


sklearn.model_selection.cross_val_score(estimator, X, y=None, *, groups=None, \
scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

 # Arguments
 # estimator: ๋Œ€์ƒ data
 # cv: default 5-fold ์ด๊ณ  int ๊ฐ’ 
 # scoring: ๋ชจ๋ธ ์„ฑ๋Šฅ evaluation rules ์— ๋Œ€ํ•œ parameter 
 # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter ์ฐธ๊ณ  
 # Classification / Clustering / Regression ๋งˆ๋‹ค ์ ํ•ฉํ•œ evaluation function ์ •์˜๋˜์–ด์žˆ์Œ

 # Return 
 # scores: cross validation ์˜ score 

from sklearn.model_selection import cross_val_score
scores = cross_val_score(data, X, y, cv=5)
print (f"Accuracy : {scores.mean()}")
from sklearn import metrics
scores = corss_val_score(data, X, y, cv=5, scoring='f1_macro')
  • cross_validate != cross_val_score
    • cross_validate ๋Š” ์—ฌ๋Ÿฌ metrics ์— ๋Œ€ํ•ด evaluate ๊ฐ€๋Šฅ

Link : Scikit-learn

โ–ก Exploratory Data Analysis (EDA)

"Torture the data, and it will confess to anything." - Ronald Coase


- Definition: ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ๊ด€์ฐฐํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๊ณผ์ •.

  • EDA is the process of visualizing and analyzing data to extract insights from it.
  • In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.
  • ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์•„๋Š” ์‚ฌ๋žŒ๋“ค๊ณผ ํ† ๋ก ์„ ํ•˜์ž !
  • ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ๋ฐ ๊ฐ’์„ ๊ฒ€ํ† ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๊ฐ€ ํ‘œํ˜„ํ•˜๋Š” ํ˜„์ƒ์„ ๋”์šฑ ์ดํ•ดํ•˜๊ณ  ์ž ์žฌ์  ๋ฌธ์ œ ๋ฐœ๊ฒฌ.
  • ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ์‚ดํŽด๋ณด๋Š” ๊ณผ์ •์„ ํ†ตํ•ด ๋ฌธ์ œ ์ •์˜ ๋‹จ๊ณ„์—์„œ ๋‹ค์–‘ํ•œ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ•˜์—ฌ ๊ธฐ์กด์˜ ๊ฐ€์„ค์„ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ ์ƒˆ๋กœ์šด ๊ฐ€์„ค

- Process: ์—ฐ๊ตฌ ์งˆ๋ฌธ๊ณผ ๊ฐ€์„ค์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ถ„์„ ๊ณ„ํš์„ ์„ธ์šฐ์ž !

  • ๋ถ„์„ ๊ณ„ํš์—์„œ๋Š” ์–ด๋–ค ์†์„ฑ ๋ฐ ์†์„ฑ ๊ฐ„ ๊ด€๊ณ„์— ์ง‘์ค‘์ ์œผ๋กœ ๊ด€์ฐฐํ• ์ง€, ์ตœ์ ์˜ ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ผ์ง€ ๋“ฑ๋“ฑ
  1. ๋ถ„์„์˜ ๋ชฉ์ ๊ณผ ๋ณ€์ˆ˜๊ฐ€ ๋ฌด์—‡์ด ์žˆ๋Š”์ง€ ํ™•์ธ. ๊ฐœ๋ณ„ ๋ณ€์ˆ˜์˜ ์ด๋ฆ„/์„ค๋ช…์„ ๊ฐ€์ง€๋Š”์ง€ ํ™•์ธ
  2. ๋ฐ์ดํ„ฐ ์ „์ฒด์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ. head ๋‚˜ tail ๋ถ€๋ถ„ ํ™•์ธ, ์ด์ƒ์น˜ ๊ฒฐ์ธก์น˜ ๋“ฑ๋“ฑ
  3. ๋ฐ์ดํ„ฐ ๊ฐœ๋ณ„ ์†์„ฑ๊ฐ’ ๊ด€์ฐฐํ•˜๊ธฐ. ๊ฐ ์†์„ฑ ๊ฐ’์ด ์˜ˆ์ธกํ•œ ๋ฒ”์œ„์™€ ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š”์ง€ ํ™•์ธ. ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ์ด์œ  ํ™•์ธ
  4. ์†์„ฑ ๊ฐ„์˜ ๊ด€๊ณ„์— ์ดˆ์ ์„ ๋งž์ถ”์–ด ๊ฐœ๋ณ„ ์†์„ฑ ๊ด€์ฐฐ์—์„œ ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•œ ํŒจํ„ด ๋ฐœ๊ฒฌ (์ƒ๊ด€๊ด€๊ณ„, ์‹œ๊ฐํ™”)
sns.pairplot(main_df[numeric_col_name]) 

- Common methods used for EDA

  • Descriptive Statistics
    • numerical data
      • pd.describe() method ์ด์šฉํ•˜์—ฌ brief summary / basic statistics
      • sns.boxplot() method ๋กœ graph of distribution
    • categorical data
      • pd.value_counts() method ์ด์šฉํ•˜์—ฌ summary
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

df = pd.read_csv('input.csv') # read data from CSV file
df.head()

# for numerical data 
df.describe() # mean, standard deviation, max & min values

# for categorical data
df['column_name'].value_counts()
  • Grouping of Data
    • pd.groupby() method and pivot
   * Handling missing values in dataset
   * ANOVA: Analysis of variance
   * Correlation 
 
> Link : [EDA](https://medium.com/code-heroku/introduction-to-exploratory-data-analysis-eda-c0257f888676)


y label ์˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ผ ๊ฐ column ๊ณผ์˜ data ๋ถ„ํฌ๋„๋ฅผ ๋ณผ ์ˆ˜ ์ž‡์Œ 

 
## โ–ก Stratified sampling 
 ### - ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ y ๋˜๋Š” ๊ทธ๋ฃน ๋น„์œจ ๊ณ ๋ คํ•˜์—ฌ ์ƒ˜ํ”Œ๋ง  

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:06
์ธตํ™” ์ถ”์ถœ Stratified sampling : ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ y ๋˜๋Š” ๊ทธ๋ฃน์˜ ๋น„์œจ์„ ๊ณ ๋ คํ•˜์—ฌ ์ƒ˜ํ”Œ๋ง ์‹คํ–‰

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:10
pandas ์˜ Series 
 - pd.Series(list) ๊ฐ€ ์žˆ๋‹ค๋ฉด index ์™€ values ๊ฐ€ ๋™์‹œ์— ๋“ค์–ด๊ฐ„๋‹ค. 
 - pd.Series(list, index=index_list) ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด index ๋ฅผ ์ง์ ‘ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ
 - pd.Series(dictionary) dict ํ˜•ํƒœ๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ์ด์— ๋งž๊ฒŒ key (index) : val (data) ๋กœ series ์ƒ์„ฑ

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:11
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html




# Performance metrics

 - Model : ๊ณ ์–‘์ด์™€ ๊ฐœ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” Classification / ์ƒํ’ˆ์˜ ํŒ๋งค๋Ÿ‰ ํ˜น์€ ์ฃผ์‹์˜ ๊ฐ€๊ฒฉ ์˜ˆ์ธกํ•˜๋Š” Regression

## โ–ก Regression model 
 ### - Scale-dependent Errors 
   * RMSE (Root Mean Squared Error)  
     + ๋‹จ์  : ์˜ˆ์ธก ๋Œ€์ƒ์˜ ํฌ๊ธฐ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค. 
> <img width="300" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e258221518869aa1c6561bb75b99476c4734108e"><br/>

```python
from sklearn.metrics import mean_squared_error
RMSE = mean_squared_error(y, y_pred)**0.5

- Percentage Errors

  • MAPE (Mean Absolute Percentage Error)
    • ๋‹จ์  : ์‹ค์ œ๊ฐ’์ด 1๋ณด๋‹ค ์ž‘์œผ๋ฉด MAPE ๋Š” ๋ฌดํ•œ๋Œ€์— ๊ฐ€๊นŒ์šด ๊ฐ’์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Œ.


- Scaled Errors

  • MASE (Mean Absolute Scaled Error)
    • ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ํ‰์†Œ์— ์›€์ง์ด๋Š” ํ‰๊ท  ๋ณ€๋™ํญ์œผ๋กœ ๋‚˜๋ˆˆ ๊ฐ’.
    • MAPE ๊ฐ€ ์˜ค์ฐจ๋ฅผ ์‹ค์ œ ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์—ˆ์ง€๋งŒ, MASE ๋Š” ํ‰์†Œ ๋ณ€๋™ํญ์— ๋น„ํ•ด ์–ผ๋งˆ๋‚˜ ์˜ค์ฐจ๊ฐ€ ๋‚˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ค€.
    • ๋ณ€๋™์„ฑ์ด ํฐ ์ง€ํ‘œ์™€ ๋ณ€๋™์„ฑ์ด ๋‚ฎ์€ ์ง€ํ‘œ๋ฅผ ๊ฐ™์ด ์˜ˆ์ธกํ•  ๋•Œ ์œ ์šฉ.


- ๋ฌธ์ œ์˜ ํŠน์„ฑ์— ๋งž๋Š” loss function ์„ ์„ ํƒํ•˜์—ฌ ๋ชจ๋ธ์— ์ ์šฉํ•˜์ž !

from keras import losses
model.compile(loss=losses.mean_squared_error, optimizer='sgd')

Link: Memory_Segment

โ–ก Classification model

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:24

  1. regression metrics
  1. root mean squared error (RMSE) : RMSE ๊ฐ’์ด 27 ์ด๋ฉด ๋ฌด์Šจ ์˜๋ฏธ์ผ๊นŒ ?
  2. R squared = 1 - (MSE/(1/N)*sum(y_i-y_mean)^2)
  3. mean absolute error (MAE) : ํŽธ์ฐจ์— ๋Œ€ํ•œ ์ ˆ๋Œ€๊ฐ’์˜ ํ‰๊ท .

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:51 confusion matrixs (ํ˜ผํ•ฉ ํ–‰๋ ฌ) : ์‹ค์ œ์™€ ์˜ˆ์ธก ๋ ˆ์ด๋ธ”์˜ ์ผ์น˜ ๊ฐœ์ˆ˜๋ฅผ matrix ํ˜•ํƒœ๋กœ ํ‘œํ˜„

[์ตœ์ง„ / Jin Choi] 2020-08-24 11:57 https://datascienceschool.net/view-notebook/731e0d2ef52c41c686ba53dcaf346f32/

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:01 ๋ถˆ๊ท ์ผํ•œ dataset ์ข…๋ฅ˜

  • ํ•˜๋ฒ„๋“œ ์ž…ํ•™ ์ง€์›์ž์˜ ํ•ฉ๊ฒฉ๋ฅ ์€ 2 % ์ธ๋ฐ ์šฐ๋ฆฌ ๋ชจ๋ธ์€ ๋‹ค ๋–จ์–ด์ง„๋‹ค๊ณ  ์˜ˆ์ƒํ•˜๋ฉด ์ •ํ™•๋„๋Š” 0.98 ์ด์ง€๋งŒ, Accuracy ๋งŒ์œผ๋กœ๋งŒ ๋ชจ๋“  data๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜๋Š” ์—†์Œ.

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:03 imbalanced data ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹์—์„œ ์ด๋ฅผ ๋” ์ž˜ ๋‹ค๋ฃฌ๋‹ค. ๋จธ์‹  ๋Ÿฌ๋‹์—์„œ๋Š” (ํ†ต๊ณ„ ๊ธฐ๋ฐ˜์˜) ...

ML (ํ†ต๊ณ„ํ•™ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ) / DL (NN ๋ชจ๋ธ๋“ค) ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์—์„œ ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์œผ๋กœ imbalance data ์ ์šฉํ•œ ์‚ฌ๋ก€๊ฐ€ ๋งŽ์•„์„œ ์‚ฌ๋ก€ ์†Œ๊ฐœ์ •๋„๋งŒ ํ•  ์˜ˆ์ •..

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:03 ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์—๋Š” ์ •๋ฐ€๋„ (Precision) ์„ ์‚ฌ์šฉํ•œ๋‹ค.

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:05 ์ •๋ฐ€๋„๋Š” trade off ๊ฐ€ ์žˆ์Œ

โ–ก Imbalanced data

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:05 https://databuzz-team.github.io/2018/10/21/Handle-Imbalanced-Data/

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:08 https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:08 https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:10 Sensitivity, Recall ๋ฏผ๊ฐ๋„ : ์‹ค์ œ ๊ธ์ • ๋ฐ์ดํ„ฐ ์ค‘ ๊ธ์ •์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๋น„์œจ. ์–ผ๋งˆ๋‚˜ ์ž˜ ๊ธ์ •์ด๋ผ๊ณ  ์˜ˆ์ธกํ•˜๋Š”์ง€ ? ์š”์ƒˆ๋Š” recall ์ด๋ผ๊ณ ๋ถ€๋ฆ„

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:14 precision ๊ณผ recall ์€ trade off ์ด๊ธฐ ๋•Œ๋ฌธ์— F1 Score ์ด๋ผ๋Š” ํ†ตํ•ฉ์  ์ง€ํ‘œ๊ฐ€ ์กด์žฌํ•œ๋‹ค..! F1 = 2 ((precision * recall) / (precision + recall))

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:15 ํ•™์‚ฌ ๊ฒฝ๊ณ  ์˜ˆ์ธก --> recall ์•”์„ ์˜ˆ์ธก --> precision (์•„๋ฌด๋‚˜ ์•”์ด๋ผ๊ณ  ํ•˜๋ฉด ๋ฆฌ์Šคํฌ๊ฐ€ ํฌ๋‹ค..!)

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:16 precision recall curve ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ํ™•๋ฅ  threshold ๋ณ€ํ™”์‹œ์ผœ precision/recall ์ธก์ •

[์ตœ์ง„ / Jin Choi] 2020-08-24 12:17 ์ž„์ง€ํ˜œ๋‹˜/๊ณต์ธ์šฑ๋‹˜/์‹ ์•„ํ˜•๋‹˜/๊น€์ƒ๊ฐ‘๋‹˜/์ด๊ทœํ™๋‹˜ ๋ฐ ๊ธฐํƒ€ ๋ฉ”์ผ๋“ฑ

[์ตœ์ง„ / Jin Choi] 2020-08-24 18:47 https://github.com/jjin-choi

[์ตœ์ง„ / Jin Choi] 2020-08-24 18:49 jjin-choi / Cj+*

[์ตœ์ง„ / Jin Choi] 2020-08-24 20:48 https://3months.tistory.com/325

[์ตœ์ง„ / Jin Choi] 2020-08-24 20:49 https://blog.naver.com/PostView.nhn?blogId=youji4ever&logNo=221484324353&parentCategoryNo=&categoryNo=10&viewDate=&isShowPopularPosts=false&from=postView

[์ตœ์ง„ / Jin Choi] 2020-08-24 20:49 https://m.blog.naver.com/youji4ever/221705683091

2020๋…„ 8์›” 25์ผ ํ™”์š”์ผ

[์ตœ์ง„ / Jin Choi] 2020-08-25 08:44

  1. ํ”„๋กœ์ ํŠธ ๋Œ€์ƒ ๋ฐ์ดํ„ฐ ๋ฐ ํ”„๋กœ์ ํŠธ ๋ชฉ์  ์ •ํ•˜๊ธฐ
  2. ํ”„๋กœ์ ํŠธ์—์„œ ๊ฒ€์ฆํ•˜๊ณ  ์‹ถ์€ ๊ฐ€์„ค ์ •ํ•˜๊ธฐ
  3. ๊ฐ€์„ค์— ๋Œ€ํ•œ ํ™•์ธ ๋ฐฉ๋ฒ• (์‹œ๊ฐํ™”, ํ†ต๊ณ„ ๋“ฑ) (EDA, Feature Engineering...)
  4. ๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐฉํ–ฅ ์ •ํ•˜๊ธฐ (๋ชจ๋ธ๋ง)
  5. ๋ฐ์ดํ„ฐ ์ •๋ฆฌ ๋ฐฉํ–ฅ์œผ๋กœ ์ „์ฒ˜๋ฆฌ ์ง„ํ–‰

[์ตœ์ง„ / Jin Choi] 2020-08-25 08:45 ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ

  1. ๋ถ„์„ ๋Œ€์ƒ DATASET
  2. ์‹œ๊ฐํ™” notebook ํŽ˜์ด์ง€
  • 3๊ฐœ ์ด์ƒ ๊ฐ€์„ค๊ณผ ๊ทธ ๊ฐ€์„ค ๊ฒ€์ฆ์„ ์œ„ํ•œ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฌผ
  1. ์˜ˆ์ธก ๋ชจ๋ธ๊ฐœ๋ฐœ notbook ํŽ˜์ด์ง€
  • ์ „์ฒ˜๋ฆฌ ๋ชจ๋“ˆ ๊ฐœ๋ฐœํ•˜๊ธฐ
  • hyper parameter ๋“ฑ

[์ตœ์ง„ / Jin Choi] 2020-08-25 08:46 dacon.io

[์ตœ์ง„ / Jin Choi] 2020-08-25 09:12 ์ง€๊ธˆ ํ• ์ผ

  1. ๋ฐ์ดํ„ฐ์…‹ ์„ ์ •
  2. ๋ถ„์„ ์ฃผ์ œ ์„ ์ •
  3. ๊ฐ€์„ค ์„ ์ •
  • ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต ๋ฐฉ๋ฒ•๋“ค Gradient descent based learning Probability theory based learning Information theory based learning (์šฐ๋ฆฌ๊ฐ€ ํ•  ๊ฒƒ !)

  • Entropy ๊ด€๋ จ ๋‚ด์šฉ ์ •๋ฆฌํ•˜๊ธฐ ๊ณต์‹ ๋ฐ์ดํ„ฐ์˜ label ์ด ์กด์žฌํ•  ํ™•๋ฅ , - ๊ฐ€ ๋ถ™์–ด์žˆ๋Š” ๋กœ๊ทธ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ™•๋ฅ ์ด 1 ์ด๋ฉด entropy ๊ฐ€ 0 ํ™•๋ฅ ์ด 1 ์ด๋ผ๋Š” ๊ฑด ๋ถˆํ™•์‹ค์„ฑ์ด ์—†์œผ๋ฏ€๋กœ entropy ๊ฐ€ 0 ํ™•๋ฅ ์ด ์ž‘์„์ˆ˜๋ก ๋ชจํ˜ธ์„ฑ์ด ์ปค์ง

growing a decision tree ๋Œ€์ƒ ๋ผ๋ฒจ์— ๋Œ€ํ•ด ์–ด๋–ค attribute์ด ๋” ํ™•์‹คํ•œ ใ…ˆ์–ด๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š”๊ฐ€ ? ๋กœ branch attribute ์„ ํƒ ํ™•์‹คํ•œ ์ •๋ณด ์„ ํƒ ๊ธฐ์ค€์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ„๋กœ ์ฐจ์ด๋‚จ tree ์ƒ์„ฑ ํ›„ pruning ํ†ตํ•ด tree generalization ์‹œํ–‰

decision tree ํŠน์ง• ํ›ˆ๋ จ ์‹œ๊ฐ„์ด ๊ธธ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„ ๋งŽ์ด ์‚ฌ์šฉ ์ง๊ด€์  ๊ฒฐ๊ณผ ํ‘œํ˜„ top-down / recursive / divide and conquer ๊ธฐ๋ฒ• greedy ์•Œ๊ณ ๋ฆฌ์ฆ˜ -> ๋ถ€๋ถ„ ์ตœ์ ํ™”

๊ด€์ธก์น˜์˜ ์ ˆ๋Œ€๊ฐ’์ด ์•„๋‹Œ ์ˆœ์„œ๊ฐ€ ์ค‘์š” -> outlier ์— ์ด์  ์ž๋™ใ…Ž์  ๋ณ€์ˆ˜ ๋ถ€๋ถ„ ์„ ํƒ scaling ํ•„์š” ์—†์Œ

์•Œ๊ณ ๋ฆฌ์ฆ˜

ID3 -> C4.5, CART ์—ฐ์†ํ˜• ๋ณ€์ˆ˜๋ฅผ ์œ„ํ•œ regression tree ๋„ ์กด์žฌ
Information gain

  • ์—”ํŠธ๋กœํ”ผ ํ•จ์ˆ˜๋ฅผ ๋„์ž…ํ•˜์—ฌ branch splitting
  • ์—”ํŠธ๋กœํ”ผ ์‚ฌ์šฉํ•˜์—ฌ ์†์„ฑ๋ณ„ ๋ถ„๋ฅ˜์‹œ impurity ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ
  • ์ „์ฒด ์—”ํŠธ๋กœํ”ผ - ์†์„ฑ๋ณ„ ์—”ํŠธ๋กœํ”ผ๋กœ ์†์„ฑ๋ณ„ information gain ๊ณ„์‚ฐ
  • ์ „์ฒด ๋ฐ์ดํ„ฐ d์˜ ์ •๋ณด๋Ÿ‰, ์†์„ฑ a๋กœ ๋ถ„๋ฅ˜์‹œ ์ •๋ณด๋Ÿ‰, a ์†์„ฑ์˜ ์ •๋ณด ์†Œ๋“ ๋“ฑ์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์ž‡์Œ

C4.5 & Gini index

information gain ๋ฌธ์ œ์  : Attribute ํฌํ•จ๋œ ๊ฐ’์ด ๋‹ค์–‘ํ• ์ˆ˜๋ก ์„ ํƒํ•˜๊ณ ์ž ํ•จ label ๊ฐ’๋„ ์ž‘์•„์ ธ์„œ ์ „์ฒด์ ์œผ๋กœ ํ•ด๋‹น attribute ์˜ entropy ๊ฐ€ ์ค„์–ด๋“ฆ ์ด๋ฅผ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด C4.5 ์ œ์•ˆ log๋ฅผ ์”Œ์›Œ์„œ ํ‰์ค€ํ™” ์‹œ์ผœ์„œ ๋ถ„ํ•  ์ •๋ณด ๊ฐ’์„ ๋Œ€์‹  ์‚ฌ์šฉ

CART ์•Œ๊ณ ๋ฆฌ์ฆ˜

gini index : entropy ์™€ ๋น„์Šทํ•œ ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง

https://scikit-learn.org/stable/modules/tree.html

tree ๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€, ์ด๋Š” ๋˜๋„๋ก์ด๋ฉด branch ๊ฐ€ ์ผ์–ด๋‚ฌ์„๋•Œ ์ƒˆ๋กœ ์ƒ์„ฑ๋˜๋Š” branch ๋ฅผ ๋ชจํ˜ธ์„ฑ์ด ์ ์€ ๋ฐฉํ–ฅ์œผ๋กœ y ๋ผ๋ฒจ์ด ํ•œ์กฑ์œผ๋กœ ๋งŽ์ด ์น˜์šฐ์ ธ์„œ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ gini ์™€ entropy ๋ฅผ ์“ธ ์ˆ˜ ์žˆ์Œ. binary ํ˜•ํƒœ๋กœ ๊ตฌํ˜„๋˜์–ด์žˆ๊ณ  ์ตœ๊ทผ์—๋Š” gini ๊ฐ€ ๋””ํดํŠธ


Tree pruning

decision tree ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ์  : leaf node ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์€ ๊ฒฝ์šฐ over-fitting impurity ๋˜๋Š” variance ๋Š” ๋‚ฎ์€๋ฐ, node์— ๋ฐ์ดํ„ฐ๊ฐ€ 1๊ฐœ ์–ด๋–ค ์‹œ์ ์—์„œ tree pruning ํ•ด์•ผํ• ์ง€?

๋ฐฉ๋ฒ•:

  1. pre-pruning ์‚ฌ์ „์— ๊ฐ’์„ ์ •ํ•˜์ž ! 5๋‹จ๊ณ„ ์ดํ•˜๋Š” ๋‚ด๋ ค๊ฐ€์ง€ ๋งˆ ~ ํ•˜์œ„ ๋…ธ๋“œ ๊ฐœ์ˆ˜, ํ•˜์œ„ ๋…ธ๋“œ์˜ label ๋น„์œจ์„ ์ •ํ•œ๋‹ค. threshold ์žก์„ ์ˆ˜๊ฐ€ ์—†์Œ. ์‹คํ—˜์ ์œผ๋กœ ์žก์„ ์ˆ˜๋ฐ–์— ์—†์Œ.. CHAID ๋“ฑ ์‚ฌ์šฉ ๊ณ„์‚ฐ ํšจ์œจ์ด ์ข‹๊ณ  ์ž‘์€ dataset ์—์„œ ์ž˜ ์ž‘๋™ ์†์„ฑ์„ ๋†“์น  ์ˆ˜ ์žˆ์Œ. under-fitting ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ

  2. post-pruning ์˜ค๋ถ„๋ฅ˜์œจ ์ตœ์†Œํ™”

๋จธ์‹ ๋Ÿฌ๋‹ ๊ต๊ณผ์„œ !! ๊ผญ ์ฝ์–ด๋ณด๊ธฐ


DT in sklearn

sklearn.tree.DecisionTreeClassifier

splitter ์‹ ๊ฒฝ ์•ˆ์จ๋‘ ๋Œ max_depth ๊นŠ๊ฒŒ ๋“ค์–ด๊ฐ€๋Š” ๋‹จ๊ณ„

decision parameter ๋Š” ๋„ˆ๋ฌด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— params = [] ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“ค์–ด์„œ append ํ•˜๋ฉด ํŽธํ•˜๋‹ค

๊ทธ๋ฆฌ๋“œ ์„œ์น˜ (๋„ˆ๋ฌด ๋ฐฉ๋Œ€ํ•ด์„œ ๋ฒ ์ด์ง€์•ˆ์ด๋‚˜ ๋‹ค๋ฅธ ๋ฐฉ์‹ ์จ์•ผํ•˜๊ธฐ๋„ ํ•จ)

๋ชจ๋ธ์—์„œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”์„ ๋•Œ ์–ด๋–ค feature ๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ์ง€ ์•Œ๋ ค์ฃผ๋Š” ์žฅ์  !

continuous attribute ๋‚˜๋ˆ„๊ธฐ

  • ๋ถˆ์—ฐ์†์  ๋ช…๋ชฉ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š ใ„ด๊ตฌ๊ฐ„์ด ๋งŽ์Œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ๊ธฐ์ค€์ ์œผ๋กœ ํ•˜๊ณ , ์ค‘์œ„๊ฐ’, 4๋ถ„์œ„์ˆ˜๋“ค์„ ๊ธฐ์ค€์ ์œผ๋กœ ํ•œ๋‹ค. y-class ๊ฐ’์ด ๋ฐ”๊พธ๋Š” ์ˆ˜๋ฅผ ๊ธฐ์ค€์ ์œผ๋กœ ํ•œ๋‹ค.

chart_studio ๋Š” ๋ญํ•˜๋Š” ๊ฑด์ง€ plotly.figure_factory ๋Š” ?

house prices : advanced regression techniques ํ•ด๋ณด๊ธฐ

numetrical category feature ๋‚˜๋ˆ ์„œ ํ•ด๋‘๋ฉด ๋‚˜์ค‘์— ๋ณด๊ธฐ ํŽธํ•จ

cross validation ์‚ฌ์šฉํ•ด์„œ house_price_with_dt ํ•ด๋ณด๊ธฐ

[email protected]

์•™์ƒ๋ธ” ๋ชจ๋ธ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ ๋ชจ๋ธ์˜ ํˆฌํ‘œ๋กœ y๊ฐ’ ์˜ˆ์ธก regression ๋ฌธ์ œ์—์„œ๋Š” ํ‰๊ท ๊ฐ’์„ ์˜ˆ์ธก ์‹ค์‹œ๊ฐ„์œผ๋กœ ์„œ๋น„์Šค ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ๋ฆฌ์†Œ์Šค๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ฒŒ ๋จ

kaggle ๋Œ€์„ธ ๊ธฐ๋ฒ• (structed dataset)

  • keywords

valia (๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ) ensemble boosting bagging adaptive boosting

sklearn.ensemble.VotingClassifier

// bagging // 2_bagging.ipyn ๋‹จ์ˆœํžˆ ๊ฐ™์€ dataset ์œผ๋กœ ๋งŒ๋“œ๋Š” classifier ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, dataset์€ ๋ชจ๋ฐ์ดํ„ฐ์˜ sampling ์ด๋ฏ€๋กœ dataset ๋งˆ๋‹ค model ์„ ๋งŒ๋“ค์–ด์„œ ์ด๋ฅผ ensemble ํ•ด๋ณด์ž. ๋‹ค์–‘ํ•œ sampling dataset ์œผ๋กœ ๋‹ค์–‘ํ•œ classifier ๋งŒ๋“ค์ž.

bootstrapping ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ์ž„์˜์˜ ๋ณต์› ์ถ”์ถœ (์ถ”์ถœํ• ๋•Œ๋งˆ๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋„ฃ๋Š”๊ฒƒ) subset ํ•™์Šต ๋ฐ์ดํ„ฐ n๊ฐœ๋ฅผ ์ถ”์ถœ

์™ธ๋ถ€ input ์—†์ด ์ฒ˜์Œ ์‹œ์ž‘ํ•˜๋Š” ์ผ์„ ์ง€์นญ

bootstrap์˜ subset sample ๋กœ ๋ชจ๋ธ n ๊ฐœ๋ฅผ ํ•™์Šต -> ์•™์ƒ๋ธ”

from sklearn.model_selection import GridSearchCV GridSearchCV : sub sampling ๋ช‡๊ฐœ ํ• ๊ฑด์ง€๋Š” ์‚ฌ๋žŒ์ด ์„ค์ •ํ•ด์ฃผ๋Š” hyper parameter tuning ์ด ๋ชจ๋“ˆ์€ matrix ํ˜•ํƒœ๋กœ ๊ฐ๊ฐ ๊ฐ’์„ ๋‚˜์—ด

1   0.9   0.6   0.3   max

10 20 30 40

n_estimator

์œ„์˜ hyper parameter ์„ ์ž…๋ ฅํ•˜๋Š” ๊ฑธ ๋งŒ๋“ค์–ด์„œ, ๋ชจ๋“  ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์„ ๋ฝ‘์•„์คŒ.

(๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” grid search ํ•˜๊ธฐ ํž˜๋“ค๋‹ค. ๋ณดํ†ต randomized search / ๋ฒ ์ด์ง€์•ˆ ์„ ์‚ฌ์šฉํ•œ๋‹ค.) auto-sklearn

https://automl.github.io/auto-sklearn/master/

out of bag error

oob error estimation

bagging ์‹คํ–‰ ์‹œ bag ์— ๋ฏธํฌํ•จ ๋ฐ์ดํ„ฐ๋กœ ์„ฑ๋Šฅ ์ธก์ • validation set ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์œ ์‚ฌ

grid.best_estimator_.oob_score_

 
// random forest // 
๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ

correlation ๋‚ฎ์€ m ๊ฐœ์˜ subset data ๋กœ ํ•™์Šต,
split ์‹œ ๊ฒ€ํ†  ๋Œ€์ƒ feature ๋ฅผ random ํ•˜๊ฒŒ n๊ฐœ ์„ ์ •
ex) 10 ๊ฐœ ์ค‘์— 7๊ฐœ๋งŒ ์“ฐ๊ณ  ์ด ์ค‘ ๋ชจํ˜ธ์„ฑ ์ค„์ด๋Š” ๊ฑธ split ์œผ๋กœ 

์ „์ฒด feature ๊ฐ€ p ์ด๋ฉด n=p ์ด๋ฉด bagging tree

feature ์˜ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ, n์€ root p ๋˜๋Š” p/3
variance ๊ฐ€ ๋†’์€ ํŠธ๋ฆฌ -> last node 1~5

์ฆ‰ bagging ๊ณผ decision tree ๊ฐ€ ํ•ฉ์ณ์ง„ ๊ฒƒ์ธ๋ฐ, ๋ชจ๋“  feature ๋ฅผ ๋ณด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ n๊ฐœ๋งŒ ๋ณธ๋‹ค.
์ด n์„ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™์Œ
sklearn.ensemble.RandomForestClassifier ์—์„œ max_features ์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•˜๊ธฐ
auto (sqrt(n_features)) / sqrt / log2/ None (n_features) 



//// time series ////

์‹œ๊ฐ„์— ํŠนํ™”๋œ ๊ธฐ๋Šฅ์ด ํ•„์š” (์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ) 

python datetime ๋ชจ๋“ˆ ๋ง๊ณ ๋„ pendulum ์ด ๋” ์‰ฝ๊ฒŒ ๋จ

python datetime module ๋„ ์ข‹๊ธด ํ•จ

datetime ์ด๋ฉด to_datetime ์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ์–ด์•ผ ํ•จ.

๋ณดํ†ต ๋ฆฌ๋ˆ…์Šค์—์„œ๋Š” utf-8 ์„ ๋”ฐ๋ฅด๋Š”๋ฐ ์œˆ๋„์šฐ์—์„œ๋Š” cp949 ๋กœ ์ธ์ฝ”๋”ฉ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์—
์ด๋ฅผ ์ž˜ ํŒŒ์•…ํ•ด์•ผํ•จ.

pd.crosstab ์ •๋ฆฌํ•˜๊ธฐ

// bike demand // 

์‹œ๊ฐ„๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ์„ datetime ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ  index ๊ฐ’์œผ๋กœ ๋„ฃ์–ด์ฃผ๊ณ 
**resample** ํ˜น์€ groupby ์ด์šฉํ•˜์—ฌ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ
resample ์ค‘์š”

```python
df['count'].resample('D').sum() #D/M/W/Q ๋Š” resampling ์•ŒํŒŒ๋ฒณ ์ œ๊ณตํ•œ๋‹ค. (์ธํ„ฐ๋„ท ์ฐพ์•„๋ณด๋ฉด ๋จ) 

resampling - filter

selection ์€ date_range ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉํ•จ๋…€ ๋จ.

period = pd.date_range(
   start = '2011-01-01', end='2011-05-31', freq='M')
df['count'].resample('D').sum()[period]

// time shifting //

shift 2 ๋ฅผ ๋จนํ˜€์ค€๋‹ค .shift(periods=2, fill_value=0) -> 2์นธ ์”ฉ ์˜ฎ๊ฒจ ๊ฐ„๊ฒƒ

// moving avergae // ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋Š” ๋…ธ์ด์ฆˆ ๋ฐœ์ƒ -> ๋…ธ์ด์ฆˆ ์ค„์ด๋ฉด์„œ ์ถ”์„ธ๋ฅผ ๋ณด๊ธฐ ์œ„ํ•œ ์ด๋™ ํ‰๊ท ๋ฒ•

  • rolling expand 30์ผ ๊ธฐ์ค€์œผ๋กœ (1/1 ~ 1/31) ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํ‰๊ท ๋‚ธ๋‹ค.

  • secondary y

// predictions //

data sampling for timeseries dataset ํ…Œ์ดํ„ฐ ํŠน์ง•์— ๋”ฐ๋ผ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์™€ ๋‹ค๋ฅธ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ• ์‚ฌ์šฉ nested corss validation ์˜ ํ‰๊ท  ์„ฑ๋Šฅ ์ธก์ •์ด ์ผ๋ฐ˜์  ๊ณ„์ ˆ์„ฑ์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ๋Š” ์‹œ๊ฐ„์ถ•์—์„œ ์ชผ๊ฐ ๋‹ค. ๋ฐ์ดํŠธ ํƒ€์ž„์˜ period ๋ฅผ ๋“ค์–ด๊ฐ€์„œ ๊ทธ๊ฑธ ์ž˜๋ผ์„œ ์˜ˆ์ธก

sMAPE ๋ฅผ ๋งŽ์ด ์‚ฌ์šฉ

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ stats.model ARIMA https://bit.ly/3gv9ZZx Facekbook์˜ prophet https://bit.ly/3huWKsS ์ „์ฒ˜๋ฆฌ ํ›„ regression ๋ฌธ์ œ์ฒ˜๋Ÿผ interpretable DL for time series (N-beats : neural basis expansion analysis for interpretable time series forecasting)

id ๊ฐ’์„ meta data ๋กœ ์“ธ๋•Œ๋Š” ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ (๋”ฅ๋Ÿฌ๋‹)

https://github.com/blissray/s-python/tree/master/day3/imbalanced_dataset ํ•˜๋‚˜์˜ y๊ฐ’์ด ์กด์žฌํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ page ๋งˆ๋‹ค y ๊ฐ’์„ ์˜ˆ์ธกํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ

object

/ 4์ผ์ฐจ/

imbalanced data

๋ณดํ†ต ํŠธ๋ ˆ์ด๋‹ data ๋Š” F:T = 5:5 ๋กœ ๋น„์œจ์„ ์ตœ๋Œ€ํ•œ ๋งž์ถ”๊ณ  test data ๋Š” original data set ์˜ ๋น„์œจ์„ ๊ทธ๋Œ€๋กœ ๋งž์ถ”์ž.

dataset resampling imbalanced class ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ์œผ๋ฉด under sampling -> false ๋ฐ์ดํ„ฐ๋ฅผ ์ค„์ž„ ๋ถ€์กฑํ•˜๋‹ค๋ฉด over sampling -> true ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ฆผ ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๋Š”๊ฒŒ ์ข‹์€ ๋ฐฉ์‹ gpt? ๋กœ ํ…์ŠคํŠธ augmentation ๊ฐ€๋Šฅ

๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ž„๋ฒ ๋”ฉ ๊ฐ€๋Šฅ scikit learn ์˜ imbalanced dataset ํ™•์žฅ ๋ชจ๋“ˆ SMOTE ๋Š”์ž˜ ์•ˆ๋จ.

/ under sampling/

random - ๊ฐ€์žฅ ์ž‘์€ class ๋ฅผ ๊ธฐ์ค€์œผ๋กœ random ํ•˜๊ฒŒ ์„ ํƒ Nearmiss - heuristics based on NN algorithm AIIKNN - ์ž๊ธฐ class ๋‚ด์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ๋งŒ ๋‚จ๊น€ Instance hardness threshold : ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ํ•ด๋‹น ๋ชจ๋ธ์ด ๋‚˜์˜ค๋Š” ํ™•๋ฅ  (predict_proba) ๊ธฐ๋ฐ˜์œผ๋กœ sample ์„ ํƒ

/ over sampling / random - ํ˜„์žฌ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ์ผ๋ถ€๋ฅผ ๋ณต์‚ฌ

/ Performance matrixs / f1 ๋ง๊ณ ๋„ ๋งŽ์ด ์“ฐ๋Š”๊ฒŒ ROC curve (Receiver Operating Characteristics)

์ž˜ ํ‹€๋ฆฐ๊ฑฐ๋ž‘ ์ž˜ ๋งž์ถ˜๊ฑฐ์˜ ๋น„์œจ?

lh ๊ณต์‚ฌ์—์„œ ํ•˜๋Š” compas ๋ผ๋Š” ๊ฑฐ

how to handle imbalanced dataset ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ ค๋ผ ! ์•™์ƒ๋ธ”๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ณ ๋„ํ™”๊ฐ€ ํ• ๋งŒํ•˜๋‹ค

melt ๋ฅผ ์“ฐ์ž

data ์‹œ๊ฐํ™” !

[DAY06] Naive Bayes Classification

Probability

โ–ก Bayes's theorem

-

  • CC
  • DD


One hot ์œผ๋กœ ๋จผ์ € ๋ฐ”๊ฟ”์ค˜์•ผ ํ•œ๋‹ค.

np.where(Y_data==True) # Y_data ๊ฐ€ True ์ธ ๊ฐ’์˜ index ๊ฐ€ return 

sklearn.naive_bayes.BernoulliNB NB ์—์„œ 0 ๋˜๋Š” 1๋กœ ,,

  • Multinomial Naive Bayes X ๊ฐ’์ด Binary ๊ฐ€ ์•„๋‹ˆ๋ผ 1 ์ด์ƒ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š ใ„ด๋ฌธ์ œ

  • Bag of words ๋‹จ์–ด๋ณ„๋กœ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•ด์„œ ํ•œ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๊ฐœ์ˆ˜๋ฅผ vector ๋กœ ํ‘œํ˜„

  • Multinomial naive bayes ์‹๊ณ„์‚ฐ ๋ฐฉ์‹์ด ๋‹ค๋ฆ„

๋ถ„์ž :

๋ถ„๋ชจ : ํŠน์ • ๋ฌธ์„œ๊ฐ€ ๋ช‡๋ฒˆ ๋‚˜์™”๋Š”๊ฐ€ + smoothing value

  • count vectorizer..

vector ํ™” ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ๋ฐ, sklearn ์„ ์ด์šฉํ•˜์ž. stop word ๋“ฑ๋“ฑ text data ์ผ๋•Œ๋„

/ processing

  1. data prepare ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ / tagging
  2. ์ „์ฒ˜๋ฆฌ cleansing / tokenization ๋„์–ด์“ฐ๊ธฐ๋กœ ์ž๋ฅธ๋‹ค. / stopword ์ œ๊ฑฐ / stemming
  3. vectorizer hyper parameter (Ngrams Threadshodl, TF-IDF or Bag of words) bag of words ๋‹จ์ ์€ ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ์ถœํ˜„ํ–ˆ๋Š”๋ฐ ๊ทธ๊ฒŒ ์ง„์งœ ์ค‘์š”ํ•œ ๊ฑด์ง€ ์•„๋‹Œ์ง€ ๋‹ค๋ฅด๋‹ค. data ๊ฐ€ ํฌ๋ฉด bert ๋ฅผ ์“ฐ๋Š”๊ฒŒ,, log data ๋Š” TF-IDF ๋„ ๊ดœ์ฐฎ๋‹ค.

multinominal ์ด๋ž‘ logistic regression ๋งŒ ์“ธ ์ˆ˜ ์žˆ๋‹ค. text ๋ฐ์ดํ„ฐ ์ด๋ฏ€๋กœ

sklearn ์˜ pipeline ์ด์šฉํ•˜๊ธฐ fit(X, y) ์—์„œ ๋ณ€ํ™˜๋œ๊ฑธ fit ํ•˜๊ณ  ... ์˜ˆ์ธกํ•˜๊ณ  ..

gridsearch

text data ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

  1. ์„ค์น˜ conda install -c conda-forge fasttext conda install -c conda-forge transformers

pip install --upgrade gensim pip install -U spacy python -m spacy download en python -m spacy download en_core_web_lg python -m spacy validate

https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/ ์ด ๋‚ด์šฉ ์ •๋ฆฌํ•˜๊ธฐhttps://chan-lab.tistory.com/27

??

โ–ก AA

- BB

  • CC
  • DD

??

โ–ก AA

- BB

  • CC
  • DD

2020.10.03

The 5 clustering algorithms data scientists need to know

clustering ? data point set ์„ ๊ธฐ์ค€์œผ๋กœ ๋น„์ง€๋„์ ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ. ๊ฐ™์€ group ์— ์†ํ•œ data points ๋Š” ํŠน์ง•๊ณผ ๋น„์Šทํ•œ ์†์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์Œ. ๋‹ค๋ฅธ group ์— ์†ํ•œ data points ์™€๋Š” ๋น„์Šทํ•˜์ง€ ์•Š๋Š” ํŠน์ง• ์†์„ฑ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค. https://michigusa-nlp.tistory.com/27 https://ratsgo.github.io/machine%20learning/2017/04/16/clustering/

https://astralworld58.tistory.com/58

https://skyil.tistory.com/33

  1. K-means clustering

2021.08.10

tsne : https://lovit.github.io/nlp/representation/2018/09/28/tsne/

โš ๏ธ **GitHub.com Fallback** โš ๏ธ