Anomaly Detection - yarak001/machine_learning_common GitHub Wiki

  • anomaly detection
    • ์ด์ƒ์น˜ ํƒ์ง€

      • Novelty vs. Anomaly vs. Outlier
        • Novelty: data์˜ ๋ณธ์งˆ์ ์ธ ํŠน์„ฑ์€ ๊ฐ™์ง€๋งŒ ์œ ํ˜•์ด ๋‹ค๋ฅธ ๊ด€์ธก์น˜, ๊ธ์ •์  ์˜๋ฏธ, ๋ถ„์„์‹œ detection์˜ ๋Œ€์ƒ e.g. ์ผ๋ฐ˜ ํ˜ธ๋ž‘์ด๊ฐ€ ์ •์ƒ data๋ผ๊ณ  ํ• ๋•Œ ๋ฐฑํ˜ธ
        • Anomaly: ๋Œ€๋ถ€๋ถ„์˜ data์™€ ํŠน์„ฑ์ด ๋‹ค๋ฅธ ๊ด€์ธก์น˜, ๋ถ€์ •์  ์˜๋ฏธ, ๋ถ„์„์‹œ detection์˜ ๋Œ€์ƒ e.g. ์ผ๋ฐ˜ ํ˜ธ๋ž‘์ด๊ฐ€ ์ •์ƒ data๋ผ๊ณ  ํ• ๋•Œ ๋ผ์ด๊ฑฐ
        • Outlier: ๋Œ€๋ถ€๋ถ„์˜ data์™€ ๋ณธ์งˆ์ ์ธ ํŠน์„ฑ์ด ๋‹ค๋ฅธ ๊ด€์ธก์น˜, ๋ถ€์ •์  ์˜๋ฏธ, ๋ถ„์„์‹œ detectionํ›„ ์ œ๊ฑฐํ•ด์•ผํ•  ๋Œ€์ƒ e.g. ์ผ๋ฐ˜ ํ˜ธ๋ž‘์ด๊ฐ€ ์ •์ƒ data๋ผ๊ณ  ํ• ๋•Œ ์‚ฌ์ž => ๋ถ„์„์— ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์นจ
        • ๋ถ€์ •์  ์˜๋ฏธ : Novelty < Anomaly < Outlier
        • ์ด์ƒ์น˜ํƒ์ง€ = Novelty = Anomaly = Outlier
      • ์ด์ƒ์น˜ ํƒ์ง€ algorithm
        • ํ™œ์šฉ data๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์ˆ˜์˜ ์ •์ƒ data์™€ ์†Œ์ˆ˜์˜ ์ด์ƒ์น˜ data๋กœ ๊ตฌ์„ฑ => (์‹ฌํ•œ) ๋ถˆ๊ท ํ˜•
        • ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜ model์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด class ๋ถˆ๊ท ํ˜•์œผ๋กœ ์ด์ƒ์น˜ data๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ค์›€(๋Œ€๋ถ€๋ถ„ ์ •์ƒ์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•จ)
        • ๋‹ค์ˆ˜์˜ ์ •์ƒ data์™€ ์†Œ์ˆ˜์˜ ์ด์ƒ์น˜ data๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์ •์ƒ data๋งŒ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ์ด์ƒ์น˜ data๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” algorithm์ด ์ œ์•ˆ๋จ
        • one-class classification
        • ์–ธ์ œ ์‚ฌ์šฉ?
          • ์ •์ƒ data๋งŒ ์žˆ๊ณ  ์ด์ƒ์น˜ data๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ(์ด์ƒ์น˜ data๊ฐ€ ์—†์–ด ํ‰๊ฐ€ ์–ด๋ ค์›€)
        • -> ์ด์ƒ์น˜ data๊ฐ€ ์—†๋”๋ผ๋„ ์ผ๋‹จ model์„ ์ƒ์„ฑํ•˜์ž
          • ์ •์ƒ data๋Š” ์ถฉ๋ถ„ํžˆ ์žˆ์œผ๋‚˜ ์ด์ƒ์น˜ data๊ฐ€ ๊ทน์†Œ์ˆ˜์ธ ๊ฒฝ์šฐ
        • issue
          • ์ •์ƒ์น˜ data๋ฅผ ์ •์˜ํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ
          • ์ด์ƒ์น˜๋Š” ํƒ์ง€ํ•˜๋‚˜ ์–ด๋–ค ์ข…๋ฅ˜์˜ ์ด์ƒ์น˜ ์ธ์ง€๋Š” ๋ชจ๋ฆ„(์ด์ƒ์น˜ ์ข…๋ฅ˜๊ฐ€ ๋งŽ์Œ)
            • -> ๋ณ„๋„์˜ ์‚ฌํ›„๋ถ„์„(post-hoc analysis) ํ•„์š”
        • ์ด์ƒ์น˜ ํƒ์ง€ ์‚ฌ๋ก€
    • ๋ฐ€๋„๊ธฐ๋ฐ˜

      • ์ •์ƒ data๋กœ๋ถ€ํ„ฐ ์ถ”์ •๋œ ๋ฐ€๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ๊ฐ์ฒด์˜ ์ •์ƒ/์ด์ƒ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก 
      • ๋ฐ€๋„ ์žˆ๊ฒŒ ๋ญ‰์ณ ์žˆ๋Š” ๊ด€์ธก์น˜ -> ์ •์ƒ ๊ด€์ธก์น˜๋กœ ์˜ˆ์ธก
      • ๋Œ€๋ถ€๋ถ„ ๊ด€์ธก์น˜์—์„œ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๊ด€์ธก์น˜ -> ๋ถˆ๋Ÿ‰ ๊ด€์ธก์น˜๋กœ ์—์ธก
      • ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก 
        • Gaussiaan Density Estimation -> ์ •๊ทœ๋ถ„ํฌ
          • ๊ฐ ๊ฐ์ฒด๊ฐ€ ์ƒ์„ฑ๋  ํ™•๋ฅ ์„ ํ•˜๋‚˜์˜ ์ •๊ทœ๋ถ„ํฌ๋กœ ๊ฐ€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก 
          • ์ •์ƒ data๋“ค์„ ํ†ตํ•˜์—ฌ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•จ
        • Mixture of Gaussian Density Estimation -> ์ •๊ทœ๋ถ„ํฌ์˜ ํ™•์žฅ
          • ๊ฐ ๊ฐ์ฒด๊ฐ€ ์ƒ์„ฑ๋  ํ™•๋ฅ ์„ ์—ฌ๋Ÿฌ ์ •๊ทœ๋ถ„ํฌ์˜ ์„ ํ˜• ๊ฒฐํ•ฉ์œผ๋กœ ๊ฐ€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก 
          • ์ •์ƒ data๋ฅผ ํ†ตํ•˜์—ฌ ์ „์ฒด data์˜ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•จ
        • Local Outlier Factor
          • ๊ฐ•์˜๊ต์ˆ˜์˜ ๊ฐœ์ธ์ ์ธ ์ƒ๊ฐ์œผ๋กœ niceํ•œ ๋ฐฉ๋ฒ•
          • ๊ฐ ๊ด€์ธก์น˜์˜ ์ด์ƒ์น˜ scoreํ• ๋‹นํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์„ค๋ช…, threshold๋ฅผ ๋ถ€์—ฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์—†์Œ
          • ํ•œ ๊ฐ์ฒด์˜ ์ฃผ๋ณ€ data ๋ฐ€๋„๋ฅผ ๊ณ ๋ คํ•œ ์ด์ƒ์น˜ ํƒ์ง€ algorithm
          • ์ •์ƒ ๊ฐ์ฒด๋Š” ์ฃผ๋กœ ์ฃผ๋ณ€ data๊ฐ€ ๋งŽ์ด ์กด์žฌํ•˜๋ฉฐ, ๋ถˆ๋Ÿ‰ ๊ฐ์ฒด๋Š” ์ฃผ๋กœ ๋‹จ๋…์œผ๋กœ ์กด์žฌํ•จ์„ ๊ฐ€์ •
          • ๊ณผ์ •
            1. K-distance of object p
            • ์ž๊ธฐ ์ž์‹ (p)๋ฅผ ์ œ์™ธํ•˜๊ณ  k๋ฒˆ์งธ๋กœ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ
            1. K-distance neighborhood of object p (Nk(p))
            • k๋ฒˆ์งธ๋กœ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์›์œผ๋กœ ํ‘œํ˜„ํ•  ๋•Œ, ์› ์•ˆ์— ํฌํ•จ๋˜๋Š” ๋ชจ๋“  ๊ฐ์ฒด๋“ค์˜ ์ˆ˜
            1. Reachability Distance (Reachability distancek(p,o))
            • o๋ฅผ ๊ธฐ์ค€์œผ๋กœ k๋ฒˆ์žฌ ๊ฐ€๊ฐ€์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ(k-distance of o)์™€ o์™€ p์‚ฌ์ด ๊ฑฐ๋ฆฌ ๊ฐ„์˜ ์ตœ๋Œ€ ๊ฐ’
            • Reachability-distancek(p,o) = max{K-distance of o, d(p,o)}
            1. Local reachability density of object p: (lrdk(p))
            • ์—ฌ๋Ÿฌ reachability distance๋ฅผ ํ•˜๋‚˜์˜ ์ง€ํ‘œ๋กœ ๊ณ„์‚ฐํ•œ ๊ฐ’
            • ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ -> ์‹์˜ ๋ถ„๋ชจ๊ฐ€ ์ž‘๊ฒŒ ๋˜์–ด lrdk(p)๊ฐ’์ด ์ปค์ง€๊ฒŒ ๋จ -> ์ •์ƒ
            • ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๊ฒฝ์šฐ -> ์‹์˜ ๋ถ„๋ชจ๊ฐ€ ํฌ๊ฒŒ ๋˜์–ด lrdk(p)๊ฐ’์ด ์ž‘์•„์ง€๊ฒŒ ๋จ -> ๋ถˆ๋Ÿ‰
            1. Local Outlier Factor of object p (lrdk(p))
            • ์ž๊ธฐ ์ž์‹ (p)์˜ ์ตœ์ข… Local Outlier Factor ๊ฐ’์„ ๊ณ„์‚ฐ
        • Issue
          • K๋ฅผ ์–ด๋–ค ๊ฐ’์œผ๋กœ ์ •ํ• ๊นŒ?
      • LOF๊ฐ€ ์–ด๋–ค ๊ฐ’ ๋ณด๋‹ค ์ปค์•ผ์ง€ anomaly๋กœ ์ •์˜ํ• ๊นŒ?(thresholding)
      • ํ•œ dataset์— 2.1์˜ LOF๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ด€์ธก์น˜์™€ ๋˜ ๋‹ค๋ฅธ dataset์—์„œ 2.1์˜ LOF๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ด€์ธก์น˜๋Š” ๋ชจ๋‘ ๋™์ผํ•˜๊ฒŒ anomaly์ธ๊ฐ€?(ํ˜น์€ ๋ชจ๋‘ ๋™์ผํ•˜๊ฒŒ ์ •์ƒ์ธ๊ฐ€?)
      • ๊ณ„์‚ฐ๋ณต์žก๋„
      • ๊ณ ์ฐจ์›
    • Model ๊ธฐ๋ฐ˜

      • Isolation Forest
      • One-Class Support Vector Machine
      • Support Vector Data Description
    • ์žฌ๊ตฌ์ถ• ์˜ค์ฐจ ๊ธฐ๋ฐ˜

      • ์žฌ๊ตฌ์ถ• ์˜ค์ฐจ์˜ ๊ฐœ๋…
      • Principal Component Analysis
      • Autoencoder
    • ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง(GAN) ๊ธฐ๋ฐ˜

      • ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง์˜ ๊ฐœ๋…
      • AnoGAN
      • GANomaly

โš ๏ธ **GitHub.com Fallback** โš ๏ธ