statistics - k821209/pipelines GitHub Wiki

ํ‘œ์ค€ํŽธ์ฐจ ํ‘œ์ค€์˜ค์ฐจ

ํ‘œ์ค€ํŽธ์ฐจ๋Š” ํ‰๊ท ์—์„œ ๋ฉ€์–ด์ง„ ์ •๋„

ํ‘œ์ค€์˜ค์ฐจ๋Š” ๋ชจํ‰๊ท ์—์„œ ๋ฉ€์–ด์ง„ ์ •๋„. ๋ชจํ‰๊ท ์„ ๋ชจ๋ฅด๊ธฐ๋•Œ๋ฌธ์— ๋ฐ˜๋ณตํ•œ ์‹คํ—˜๊ตฌ๋ฅผ ๋ฐ˜๋ณตํ•ด์•ผํ•œ๋‹ค. ์‹คํ—˜๊ตฌ๊ฐ„์˜ ํŽธ์ฐจ๋ผ ๋ณด๋ฉด ๋ ๋“ฏ. ๋”ฐ๋ผ์„œ ๋ฐ˜๋ณตํ•œ ์‹คํ—˜๊ตฌ์˜ ๋ฐ˜๋ณต์ด ์—†๋Š” ์‹คํ—˜์— ํ‘œ์ค€์˜ค์ฐจ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค๋ฉด ์ง€์ ํ•ด์•ผํ• ๋“ฏ.

Random variable

http://darkpgmr.tistory.com/147

ํ˜„์‹ค์„ธ๊ณ„์—์„œ ์–ด๋–ค ํ”„๋กœ์„ธ์Šค์— ์˜ํ•ด์„œ ์„ ํƒ ๋‹นํ•  ํ•ญ๋ชฉ. ์ฝ”์ธ์˜ Head, Tail ์„ 0, 1๋กœ ์ฝ”๋”ฉํ•˜์—ฌ ์ด 100ํšŒ ์ค‘ 0: 49ํšŒ 1:51ํšŒ ๊ด€์ฐฐ. ์—ฌ๊ธฐ์„œ 0,1์ด Random variable์ž„. ์™œ ๋ Œ๋ค์ธ๊ฐ€. ํ™•๋ฅ ๊ณต๊ฐ„์•ˆ์—์„œ, (0์˜ ๋ถ„๋ฅ , 1์˜ ๋ถ„๋ฅ ), ๋ Œ๋ค์œผ๋กœ ์„ ํƒ๋˜์–ด ์Šค์ฝ”์–ด ๋˜๊ธฐ ๋•Œ๋ฌธ

๋‹ค์‹œ๋งํ•˜๋ฉด ๊ด€์ฐฐํ•˜๊ธฐ์œ„ํ•ด ๊ณ ์ •๋˜์–ด ์žˆ๋Š” ๋ˆˆ์ด๋ผ๊ณ  ๋ณด๋ฉด ์‰ฌ์šธ๋“ฏ.์œ„์˜ ์˜ˆ๋ฅผ ๋ณธ๋‹ค๋ฉด 0์„ ๋ณด๋Š” ๋ˆˆ, 1์„ ๋ณด๋Š” ๋ˆˆ์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์„ ๋“ฏ. ์ข€๋” ์‰ฝ๊ฒŒ ์ƒ๊ฐํ•œ๋‹ค๋ฉด, ๋ฐ์ดํ„ฐ์˜ ํ‘œ๋ฅผ ๊ทธ๋ฆด๋•Œ column๋ช…์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋” ์‰ฝ๋‹ค. column๋ช…์€ "์ผ์ผ๊ตํ†ต๋Ÿ‰" ์ด ๋  ์ˆ˜ ์žˆ๊ณ  ์ง€์‹๋…ธ๋™์ž๋“ค์— ์˜ํ•ด์„œ ์ง€์†์ ์œผ๋กœ ๊ธฐ๋ก๋ ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๊ธฐ๋ก๋œ ๊ฐ’๋“ค์˜ ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด ๋ฐ€๋„ ๋ถ„ํฌ๋ฅผ ํ† ๋Œ€๋กœ ์–ด๋–ค ๊ฐ’์„ ๊ฐ€์งˆ ๊ฐ€๋Šฅ์„ฑ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค ์ด๋ฅผ density estimation์ด๋ผ ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ ๋ฐ์ดํ„ฐ๋Š” ๊ถ๊ทน์ ์œผ๋กœ๋Š” ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋กœ ๋ฐ”๋€Œ๊ฒŒ ๋œ๋‹ค.

ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋กœ ๋ฐ”๊ฟ€๋•Œ! Parametric๋ฐฉ์‹์€ ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๋ถ„ํฌ์— ๊ปด๋งž์ถ”๋Š” ๋ฐฉ๋ฒ•์ž„. ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๋ถ„ํฌ์— ๋Œ€ํ•œ ๊ณต์‹์„ ๊ณต๋ถ€ ์ž˜ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด ๋งŒ๋“ค์–ด๋†จ๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„์‚ฐ๊ฐ’์ด๋ž‘ ํ‰๊ท ๊ฐ’ ๊ฐ™์€๊ฒƒ๋งŒ ๊ตฌํ•ด์ฃผ๋ฉด ์ผ์€ ๋๋‚œ๋‹ค. ๋งค์šฐ ์‰ฝ๋‹ค. ํ•˜์ง€๋งŒ ๋ณดํ†ต ๋ถ„ํฌ๋ชจ์–‘์„ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ๋Š” ๊ฑฐ์˜ ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— non-parametric ์ฆ‰ ๋ถ„ํฌ(๊ณต์‹)์„ ๋ชจ๋ฅด๋Š” ๋ฐฉ๋ฒ•์„ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ทธ๋ƒฅ Histogram๊ทธ๋ฆฌ๋Š”๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ Histogram๊ฐ€์ง€๊ณ  ๊ณต์‹์ด ๋‚˜์˜ค์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— KDE (Kernel density estimation)์ด๋ผ๋Š” ๊ฒƒ์„ ํ•œ๋‹ค. ๊ฐ๊ฐ์˜ ๊ด€์ฐฐ๊ฐ’ํ•˜๋‚˜, ์˜ˆ๋ฅผ๋“ค๋ฉด x1, ์„ ๊ฐ€์žฅ ์ž˜ ๋ฟœ์„ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋กœ ๋ฐ”๊พผ๋‹ค. (๋ญ ๋Œ€์นญ์ด ๋˜๋Š” ํ•จ์ˆ˜๊ฐ€ ์–ด์ฉŒ๊ณ  ํ•˜๋Š” ๊ฒƒ์ด ์ปค๋„์ž„. ๋ญ”๋ง์ธ์ง€?) ๊ทธ๋ฆฌ๊ณ  ๊ทธ ํ•จ์ˆ˜๋“ค์„ ๋ชจ๋‘ ํ‰๊ท ํ•ด์„œ (์–ด๋–ป๊ฒŒ? ใ…กใ…ก;) ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋กœ ๋งŒ๋“œ๋Š”๊ฒƒ์ด KDE์ด๋‹ค. ๊ทธ๋ž˜์„œ. ๋ญ” ์ปค๋„์„ ์“ธ๊ฑฐ๋ƒ๊ณ  ๋ฌผ์–ด๋ณด๋Š”๊ฑด ๊ทธ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ๋‹จ์œ„ ์ปค๋„ํ•จ์ˆ˜๋ฅผ ๋ญ˜ ์“ธ๊ฑฐ๋ƒ ํ•˜๋Š”๊ฑฐ๋ผ๊ณ  ์ดํ•ดํ•˜๋ฉด๋œ๋‹ค.

# https://stats.stackexchange.com/questions/73032/linear-kernel-and-non-linear-kernel-for-support-vector-machine
Andrew Ng gives a nice rule of thumb explanation in this video starting 14:46, though the whole video is worth watching.

Key Points

Use linear kernel when number of features is larger than number of observations.
Use gaussian kernel when number of observations is larger than number of features.
If number of observations is larger than 50,000 speed could be an issue when using gaussian kernel; hence, one might want to use linear kernel.

penalty parameter C

์„œํฌํŠธ ๋ฒกํ„ฐ๋จธ์‹ ์—์„œ ๋‚˜์˜ค๋Š” C ๊ฐ’. C๊ฐ’์ด ์ปค์ง€๋ฉด, ๋ถ„๋ฅ˜๊ฐ€ ์ตœ๋Œ€ํ•œ ์ •ํ™•ํ•ด์ง€๋„๋ก ๋งŒ๋“ค์–ด์ค€๋‹ค. ์•ฝ๊ฐ„ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜์— ์น˜์šฐ์นœ ๊ด€์ฐฐ๊ฐ’์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์–ต์ง€๋กœ๋ผ๋„ ๋…ธ๋ ฅํ•˜๊ฒŒ ๋œ๋‹ค. ์˜ˆ์ƒํ•˜๋“ฏ, ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ์•ˆ๋˜๊ฒ ์ง€ ํŠธ๋ ˆ์ด๋‹ ์…‹์— ๊ณผ์ ํ•ฉํ•˜๊ฒŒ ๋ ๊ฒƒ. "์ ๋‹นํ•œ" C๊ฐ’์„ ์ •ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.

P value

๊ด€์ฐฐ ๋””์ž์ธ :

  • ๋™์ „์„ 50๋ฒˆ ๋˜์ ธ์„œ ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ์ˆ˜

๊ฐ€์ • ๋””์ž์ธ :

  • ๋™์ „์ด ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ํ™•๋ฅ ์€ 0.5, ๊ด€์ฐฐ ๋””์ž์ธ์— ๋”ฐ๋ผ 1000๋ฒˆ์˜ ๋ฐ˜๋ณต๊ด€์ฐฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
  • 50๋ฒˆ์„ ๋˜์ ธ์„œ ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ์ˆ˜ (๋ฐ์ดํ„ฐํฌ์ธํŠธ) * 1000

์‹ค์ œ ๊ด€์ฐฐ :

  • ๋™์ „์„ 50๋ฒˆ ๋˜์ ธ์„œ ์•ž๋ฉด์ด ๋‚˜์˜จ ์ˆ˜๊ฐ€ 40๋ฒˆ์ด๋‹ค.

ํ•ด์„ :

  • ๊ฐ€์ •๋””์ž์ธ์—์„œ ๋‚˜์˜จ ๋ถ„ํฌ์—์„œ ์‹ค์ œ๊ด€์ฐฐ + ๋”๊ทน๋‹จ๊ฐ’๋“ค์„ ๋”ํ•ด์„œ ๋ถ„ํฌ์˜์—ญ์˜ ๋„“์ด๊ฐ€ p value. (์ „ํ†ต์ ์ธ ๊ฐœ๋…)

A p-value means only one thing (although it can be phrased in a few different ways), it is: The probability of getting the results you did (or more extreme results) given that the null hypothesis is true. (http://www.labstats.net/articles/pvalue.html)

์ ์šฉ

๊ด€์ฐฐ๋””์ž์ธ :

  • 50๊ฐœ์˜ edge๋ฅผ ๊ฐ€์ง€๋Š” ๋„คํŠธ์›Œํฌ์˜ ํ‰๊ท  degree ๊ฐ€์ •๋””์ž์ธ :
  • 50๊ฐœ์˜ edge๋ฅผ ๊ฐ€์ง€๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ 1000๋ฒˆ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑํ•˜๊ณ  ํ‰๊ท  degree ๊ตฌํ•จ (๋„คํŠธ์›Œํฌ๋Š” ๋ฌด์ž‘์œ„์ด๋‹ค๊ฐ€ ๊ฐ€์„ค) ์‹ค์ œ ๊ด€์ฐฐ :
  • 50๊ฐœ์˜ edge๋ฅผ ๊ฐ€์ง€๋Š” ์‹ค์ œ ppi์˜ ํ‰๊ท  degree๋Š” ํ•ด๋‹น ๊ฐ€์ •๋””์ž์ธ์˜ ๋ถ„ํฌ์—์„œ ์–ด๋А ์˜์—ญ์„ ์ฐจ์ง€ํ•˜๋Š”๊ฐ€ (๊ทน๋‹จ๊ฐ’๊นŒ์ง€)