ggml issue - beyondnlp/nlp GitHub Wiki

I used gptneox git repo

์‚ฌ์šฉ์ค‘ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—๋Ÿฌ๋ฅผ ๋งŒ๋‚ฌ๋Š”๋ฐ ๋‚ด๊ฐ€ ์ฐพ์•„๋‚ธ ๊ฒƒ๊นŒ์ง€๋ฅผ ๊ณต์œ ํ•˜๋ คํ•œ๋‹ค. I encountered the following error while using it, and I would like to share what I found.

๋ฌธ์ œ ๋ฐœ์ƒ ์ง€์ ์€ ggml_element_size()ํ•จ์ˆ˜์˜ tensor->type์ด๋‹ค. The problem occurs in tensor->type of the ggml_element_size() function.

์ด ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ์ง€์ •๋œ ๋ฒ”์œ„๋ฅผ ์ดˆ๊ณผํ•ด์„œ segmenation fault๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. A segmentation fault occurs because the value of this variable exceeds the specified range.

๊ทธ๋ž˜์„œ tensor->type์ด ๋ณ€๊ฒฝ๋˜๋Š” ์œ„์น˜๋ฅผ ์ฐพ๊ณ ์ž ํ–ˆ๋‹ค. So I wanted to find the location where tensor->type changes.

mainํ•จ์ˆ˜๋‚ด์˜ ctx.model.kv_self.v[0].type์ด ์œ„์—์„œ ์ด์•ผ๊ธฐํ•œ tensor->type๊ฐ€ ๋™์ผํ•œ ๋ณ€์ˆ˜์ธ๋ฐ gdb๋กœ ํ•ด๋‹น ๊ฐ’์ด ๋ณ€ํ•˜๋Š” ์œ„์น˜๋ฅผ ์ง€์†์ ์œผ๋กœ ์ถ”์ ํ•˜์˜€๋‹ค. The ctx.model.kv_self.v[0].type in the main function is a variable with the same tensor->type as mentioned above, and I continuously tracked the location where the value changes using gdb.

debugging option์„ ์ผœ๊ณ  threading ๋ฌธ์ œ์ผ์ˆ˜ ์žˆ์–ด์„œ single thread๋กœ ์ง„ํ–‰๋˜์–ด ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜์—ˆ๋‹ค. I turned on the debugging option and it was possible that it was a threading issue, so it was done in a single thread and took a lot of time.

( ๋‹คํ–‰ํžˆ ๋ฌธ์ œ๋Š” single thread์—์„œ๋„ ์ œํ˜„๋˜์—ˆ๋‹ค. ์ฆ‰ thread issue๋Š” ์•„๋‹ˆ๋‹ค ) (Fortunately, the problem is also manifested in a single thread, so it is not a thread issue)

์ฝ”๋“œ๊ฐ€ ์ฃฝ๋Š” ์œ„์น˜ Where code dies https://github.com/byroneverson/llm.cpp/blob/80f3a1ef957072205c7ed7d4e27ece67a0206a3a/arch/gptneox/gptneox.cpp#L2541C13-L2541C42

๊ฐ’์ด ๋ณ€๊ฒฝ๋˜๋Š” ์œ„์น˜ Where the value changes https://github.com/byroneverson/llm.cpp/blob/80f3a1ef957072205c7ed7d4e27ece67a0206a3a/ggml.c#L7754

์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ 100% ์™„์ „ํžˆ ํŒŒ์•…ํ•œ๊ฒƒ์€ ์•„๋‹ˆ๋ผ์„œ ์–ด๋””์—์„œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๊ณ  ๊ทธ ๋ฌธ์ œ๋ฅผ ์ฒ˜์Œ ๋ฐœ์ƒํ•˜๋Š” ์œ„์น˜๋Š” ์ฐพ์•˜์ง€๋งŒ ์ •ํ™•ํ•œ ํ•ด๊ฒฐ ํฌ์ธํŠธ๋Š” ์•„์ง ์ฐพ๊ณ  ์žˆ๋Š” ์ค‘์ด๋‹ค. I didn't completely understand the data structure 100%, so I found where the problem occurred and where it first occurred. The exact solution point is still being sought.