Slightly different FillFactor - minseok127/HashSimulator GitHub Wiki

Introduce

Distribution wiki ๋ฐ FillFactor wiki์—์„œ๋Š” root-square-mean์„ ํ†ตํ•ด FillFactor๋ฅผ ์œ ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ง์ ‘ root-square-mean์„ ํ†ตํ•ด ์ˆ˜์‹์„ ์ „๊ฐœํ•ด๋ดค์„ ๋•Œ, ๊ฐ™์€ f๋ฅผ ์œ ๋„ํ•˜์ง€ ๋ชปํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์œ„์˜ ํŽ˜์ด์ง€๋“ค๊ณผ๋Š” ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ FillFactor์™€ ๋น„์Šทํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ์ˆ˜์‹์„ ์œ ๋„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Contents

What is a Work?

์–ด๋–ค bin์— 5๊ฐœ์˜ data๊ฐ€ ๋“ค์–ด์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Hash Table์ด Chaining Rule์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ,

Linked List์˜ 1๋ฒˆ์งธ Data์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 1๊ฐœ์˜ ๋…ธ๋“œ๋ฅผ ํ™•์ธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Linked List์˜ 2๋ฒˆ์งธ Data์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 2๊ฐœ์˜ ๋…ธ๋“œ๋ฅผ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Linked List์˜ 3๋ฒˆ์งธ Data์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 3๊ฐœ์˜ ๋…ธ๋“œ๋ฅผ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

...

์ด์ฒ˜๋Ÿผ bin์— ๋“ค์–ด์žˆ๋Š” ์–ด๋– ํ•œ data์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ data๊ฐ€ Linked List์— ์œ„์น˜ํ•œ index๋งŒํผ์˜ ํ™•์ธ์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  bin์— ๋“ค์–ด์žˆ๋Š” ๋ชจ๋“  data์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋“ฑ์ฐจ์ˆ˜์—ด์˜ ํ•ฉ๋งŒํผ์˜ ํ™•์ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์–ด์ง„ ๋ชจ๋“  key๋“ค์— ๋Œ€ํ•ด์„œ ์ด์— ๋Œ€์‘ํ•˜๋Š” data๋“ค์„ ์ฐพ๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๋ชจ๋“  ํ™•์ธ์˜ ์ˆ˜๋ฅผ Work๋ผ๊ณ  ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Assume

์ด์ „ ํŽ˜์ด์ง€์™€ ๊ฐ™์€ ๋ณ€์ˆ˜๋“ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Ideal Work

Hash Table์—์„œ Ideal Distribution์€ Uniform Distribution์ž…๋‹ˆ๋‹ค.

Uniform Distribution์˜ Work๋ฅผ Ideal Work(IW)๋ผ๊ณ  ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฌํ•œ ์ด์ƒ์ ์ธ table์ด F๊ฐœ์˜ bins๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด IW๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

double c = k / f  // f๊ฐœ์˜ bins์— ๊ฐ๊ฐ ๋ถ„๋ฐฐ๋  data ์ˆ˜
double IW = f * (c * (c + 1) / 2) // ์‚ฌ์šฉํ•˜๋Š” bins์˜ ๊ฐœ์ˆ˜ * ๊ฐ bins๋งˆ๋‹ค ๊ธฐ๋Œ€๋˜๋Š” work

Real Work

Testํ•  Hash Function์œผ๋กœ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด์ง„ Distribution์˜ Work๋ฅผ Real Work(RW)๋ผ๊ณ  ์ •์˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

double RW = 0;
for (int i = 0 ; i < N; i++) {
    RW += (bin[i] * (bin[i] + 1)) / 2 // bin[i]๋Š” i๋ฒˆ์งธ bin์— ๋“ค์–ด์žˆ๋Š” data์˜ ์ˆ˜
}

๊ทธ๋Ÿฐ๋ฐ ์ด ๊ณ„์‚ฐ์„ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ณด๋ฉด

ฮฃ bi * (bi + 1) / 2 = (ฮฃ bi^2 + ฮฃ bi) / 2

์ด์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ฮฃ bi = k ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—(๋ชจ๋“  bin์— ๋“ค์–ด์žˆ๋Š” data์˜ ์ˆ˜๋Š” ์ „์ฒด key์˜ ๊ฐœ์ˆ˜์™€ ๊ฐ™์Œ)

RW = (ฮฃ bi^2) / 2 + (k / 2) ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Derive

์ฃผ์–ด์ง„ k๊ฐœ์˜ key๋“ค์— ๋Œ€ํ•ด์„œ Real Distribution๊ณผ f๊ฐœ์˜ bin์„ ์‚ฌ์šฉํ•˜๋Š” Ideal Distribution์˜ Work๊ฐ€ ๊ฐ™๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

(ฮฃ bi^2) / 2 + (k / 2) = f * (k/f)((k/f) + 1)/2
=> (ฮฃ bi^2) + k = k * k / f + k
=> (ฮฃ bi^2) = k * k / f
=> f = k * k / (ฮฃ bi^2)

์ด์™€ ๊ฐ™์ด f๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  r์€ bins๋“ค์˜ root-square-mean์ด๊ธฐ ๋•Œ๋ฌธ์—

r = ( (b1^2 + b2^2 + ... + bn^2) / n )^(1/2)
n * r * r = b1^2 + b2^2 + ... + bn^2

์ด๋ฏ€๋กœ f = k * k / (n * r * r) ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Conclusion

double f = (k*k - 1) / (n*r*r - k); // smhasher's f
double f = k * k / (n * r * r); // my f

์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์œ ๋„ ๊ณผ์ •์„ ์ดํ•ดํ•œ ๋‘ ๋ฒˆ์งธ f๋ฅผ ์ด์šฉํ•˜์—ฌ FillFactor๋ฅผ ๊ณ„์‚ฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.