F1score - BD-SEARCH/MLtutorial GitHub Wiki


01. precision, recall, accuracy

a.์ „์ œ

  • TT : ์‹ค์ œ ์ •๋‹ต T, ์‹คํ—˜ ๊ฒฐ๊ณผ T (a)
  • TF : ์‹ค์ œ ์ •๋‹ต T, ์‹คํ—˜ ๊ฒฐ๊ณผ F (b)
  • FT : ์‹ค์ œ ์ •๋‹ต F, ์‹คํ—˜ ๊ฒฐ๊ณผ T (c)
  • FF : ์‹ค์ œ ์ •๋‹ต F, ์‹คํ—˜ ๊ฒฐ๊ณผ F (d)

b.precision

  • a/(a+c)
  • ์‹คํ—˜ ๊ฒฐ๊ณผ True๋ผ๊ณ  ํŒ๋‹จ๋œ ๊ฒƒ ์ค‘์— ์‹ค์ œ ์ •๋‹ต์ด True์ธ ๊ฒƒ

c.recall

  • a/(a+b)
  • ์‹ค์ œ ์ •๋‹ต์ด True์ธ ๊ฒƒ ์ค‘์— ์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ True์ธ ๊ฒƒ

d.accuracy

  • (a+d)/(a+b+c+d)
  • ์ „์ฒด ๊ฒฐ๊ณผ ์ค‘์—์„œ ์ •๋‹ต์ธ ๊ฒƒ

02. F-measure

  • precision๊ณผ recall์— ๋Œ€ํ•œ ํ‰๊ท ์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๊ฒƒ

a. Macro & micro average

image

  • macro average
    • ํด๋ž˜์Šค ๋ณ„ f1 score์— ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์ง€ ์•Š๋Š”๋‹ค.
    • ํด๋ž˜์Šค์˜ ํฌ๊ธฐ์— ์ƒ๊ด€ ์—†์ด ๋ชจ๋“  ํด๋ž˜์Šค๋ฅผ ๊ฐ™์€ ๋น„์ค‘์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค
    • image
      • (ํ•™๊ต์˜ ๊ฐ ๋ฐ˜ ์„ฑ์ )
  • micro average
    • ๋ชจ๋“  ํด๋ž˜์Šค์˜ FP, FN, TP, TN์˜ ์ด ์ˆ˜๋ฅผ ์„ผ ํ›„ precision, recall, f1 score๋ฅผ ์ˆ˜์น˜๋กœ ๊ณ„์‚ฐ
    • ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค
    • image
      • (์ „์ฒด ํ•™์ƒ๋“ค์˜ ์„ฑ์ )
  • ๊ฐ ์ƒ˜ํ”Œ์„ ๋˜‘๊ฐ™์ด ๊ฐ„์ฃผํ•œ๋‹ค๋ฉด micro average, ๊ฐ ํด๋ž˜์Šค๋ฅผ ๋™์ผํ•œ ๋น„์ค‘์œผ๋กœ ๊ณ ๋ คํ•˜๋ฉด macro average ์‚ฌ

03. Edit distance

  • ๋‘ ๋ฌธ์ž์—ด์˜ ์œ ์‚ฌ๋„๋ฅผ ํŒ๋‹จ
  • ๋ฌธ์ž์—ด A๋ฅผ B๋กœ ๋ฐ”๊พธ๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ์—ฐ์‚ฐ์˜ ์ตœ์†Œ ํšŸ์ˆ˜
    • ๋น„๊ตํ•  ๋‘ ๋ฌธ์ž๊ฐ€ ๊ฐ™์œผ๋ฉด cost(i,j) = cost(i-1, j-1)
    • ๋น„๊ตํ•  ๋‘ ๋ฌธ์ž๊ฐ€ ๋‹ค๋ฅด๋ฉด cost(i,j) = 1 + min( cost(i-1,j),cost(i,j-1),cost(i-1,j-1) )
def _edit_dist_init(len1, len2):
    A = []
    for i in range(len1):
        A.append([0] * len2)
    
    # (i,0), (0,j) ์ฑ„์šฐ๊ธฐ
    for i in range(len1):
        A[i][0] = i
    for j in range(len2):
        A[0][j] = j
    
    return A
   
def _edit_dist_step(A, i, j, s1, s2, transpositions=False):
    c1 = s1[i-1]
    c2 = s2[j-1]
    
    a = A[i-1][j] + 1 # s1์—์„œ skip
    b = A[i][j-1] + 1 # s2์—์„œ skip
    c = A[i-1][j-1] + (c1!=c2) # ๋Œ€์ฒด
    d = c+1 # X select
    
    if transpositions and i>1 and j>1:
        if s1[i-2] == c2 and s2[j-2] == c1:
            d = A[j-2][j-2] + 1
    
        A[i][j] = min(a,b,c,d)
    
def edit_distance(s1, s2, transpositions=False):
    len1 = len(s1)
    len2 = len(s2)
    lev = _edit_dist_init(len1 + 1, len2 + 1)
    
    for i in range(len1):
        for j in range(len2):
            _edit_dist_step(lev, i+1, j+1, s1, s2, transpositions=transpositions)
    return lev[len1][len2]

04. jaccard distance

  • ๋‘ ๊ฐœ์˜ ๊ฐ์ฒด๋ฅผ ์ง‘ํ•ฉ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ์ธก์ •
def jacc_sim(query, document):
    a = set(query).intersection(set(document))
    b = set(query).union(set(document))
    return len(a)/len(b)

05. smith waterman distance

  • ๋ณดํ†ต DNA ์„œ์—ด ๊ฒ€์ถœ์„ ์œ„ํ•ด ์‚ฌ์šฉ