ggml issue - beyondnlp/nlp GitHub Wiki
I used gptneox git repo
์ฌ์ฉ์ค ๋ค์๊ณผ ๊ฐ์ ์๋ฌ๋ฅผ ๋ง๋ฌ๋๋ฐ ๋ด๊ฐ ์ฐพ์๋ธ ๊ฒ๊น์ง๋ฅผ ๊ณต์ ํ๋ คํ๋ค. I encountered the following error while using it, and I would like to share what I found.
๋ฌธ์ ๋ฐ์ ์ง์ ์ ggml_element_size()ํจ์์ tensor->type์ด๋ค. The problem occurs in tensor->type of the ggml_element_size() function.
์ด ๋ณ์์ ๊ฐ์ด ์ง์ ๋ ๋ฒ์๋ฅผ ์ด๊ณผํด์ segmenation fault๊ฐ ๋ฐ์ํ๋ค. A segmentation fault occurs because the value of this variable exceeds the specified range.
๊ทธ๋์ tensor->type์ด ๋ณ๊ฒฝ๋๋ ์์น๋ฅผ ์ฐพ๊ณ ์ ํ๋ค. So I wanted to find the location where tensor->type changes.
mainํจ์๋ด์ ctx.model.kv_self.v[0].type์ด ์์์ ์ด์ผ๊ธฐํ tensor->type๊ฐ ๋์ผํ ๋ณ์์ธ๋ฐ gdb๋ก ํด๋น ๊ฐ์ด ๋ณํ๋ ์์น๋ฅผ ์ง์์ ์ผ๋ก ์ถ์ ํ์๋ค. The ctx.model.kv_self.v[0].type in the main function is a variable with the same tensor->type as mentioned above, and I continuously tracked the location where the value changes using gdb.
debugging option์ ์ผ๊ณ threading ๋ฌธ์ ์ผ์ ์์ด์ single thread๋ก ์งํ๋์ด ๋ง์ ์๊ฐ์ด ์์๋์๋ค. I turned on the debugging option and it was possible that it was a threading issue, so it was done in a single thread and took a lot of time.
( ๋คํํ ๋ฌธ์ ๋ single thread์์๋ ์ ํ๋์๋ค. ์ฆ thread issue๋ ์๋๋ค ) (Fortunately, the problem is also manifested in a single thread, so it is not a thread issue)
์ฝ๋๊ฐ ์ฃฝ๋ ์์น Where code dies https://github.com/byroneverson/llm.cpp/blob/80f3a1ef957072205c7ed7d4e27ece67a0206a3a/arch/gptneox/gptneox.cpp#L2541C13-L2541C42
๊ฐ์ด ๋ณ๊ฒฝ๋๋ ์์น Where the value changes https://github.com/byroneverson/llm.cpp/blob/80f3a1ef957072205c7ed7d4e27ece67a0206a3a/ggml.c#L7754
์๋ฃ๊ตฌ์กฐ๋ฅผ 100% ์์ ํ ํ์ ํ๊ฒ์ ์๋๋ผ์ ์ด๋์์ ๋ฌธ์ ๊ฐ ์๊ฒผ๊ณ ๊ทธ ๋ฌธ์ ๋ฅผ ์ฒ์ ๋ฐ์ํ๋ ์์น๋ ์ฐพ์์ง๋ง ์ ํํ ํด๊ฒฐ ํฌ์ธํธ๋ ์์ง ์ฐพ๊ณ ์๋ ์ค์ด๋ค. I didn't completely understand the data structure 100%, so I found where the problem occurred and where it first occurred. The exact solution point is still being sought.