cuda - modrpc/info GitHub Wiki
- gpudirect-storage: nvme-ssd<->gpu without cpu intervention
- how-to-overlap-data-transfer
- flash-attention
- hp-omen-max-45L system
- ryzen 9 9950x3d (max 5.7ghz boost clock, 16 core, 32 threads, 128mb l3-cache (2nd-gen v-cache))
- kingston-fury 128gb (32gb x 4): ddr5-5200m5/s expo udimm rgb
- geforce rtx 5090 gddr7 32gb - up to 3352 tops (hp oem)
- gen4 pcie 4.0 tlc m.2 nvme ssd 4tb
- b850 chipset
- 1200w gold 80 modular power supply
- cores:
| ์ฝ์ด ์ข ๋ฅ | ์ ์ ๋ช ์นญ | ์ฃผ ์ญํ (Specialty) | AI/๊ทธ๋ํฝ ํ์ฉ ์์ |
|---|---|---|---|
| GPU ์ฝ์ด | CUDA Cores | ๋ฒ์ฉ ์ฐ์ฐ (General Purpose)์ํ์ ์ฌ์น์ฐ์ฐ๊ณผ ๋ ผ๋ฆฌ ์ฐ์ฐ์ ์ฒ๋ฆฌํ๋ ๊ธฐ๋ณธ ๋จ์์ ๋๋ค. | ์ ํต์ ์ธ 3D ๊ทธ๋ํฝ ๋ ๋๋ง, ๋ฌผ๋ฆฌ ์์ง ๊ณ์ฐ, ์ผ๋ฐ ๋ฒ์ฉ ์ฐ์ฐ |
| Tensor ์ฝ์ด | Tensor Cores |
ํ๋ ฌ ์ฐ์ฐ ๊ฐ์ (Deep Learning)AI ๋ชจ๋ธ์ ํต์ฌ์ธ ๊ฑฐ๋ ํ๋ ฌ ๊ณฑ์
( |
LLM(Llama 3.1) ์ถ๋ก , ์ด๋ฏธ์ง ์ ์ค์ผ์ผ๋ง(DLSS), ํ์ต(Fine-tuning) |
| RT ์ฝ์ด | Ray Tracing Cores | ๋น ์ถ์ ๊ฐ์ (Visual Realism)๋น์ ๋ฐ์ฌ, ๊ตด์ , ๊ทธ๋ฆผ์ ๋ฑ ๊ด์ ์ถ์ ์๊ณ ๋ฆฌ์ฆ์ ํ๋์จ์ด๋ก ๊ณ์ฐํฉ๋๋ค. | ๊ฒ์ ๋ด ์ค์๊ฐ ๋ฐ์ฌ ํจ๊ณผ, ๊ณ ํ๋ฆฌํฐ 3D ๋ ๋๋ง ์๊ฐํ |
| ์ฝ์ด ์ข ๋ฅ | RTX 4090 (Ada) | RTX 5090 (Blackwell) | ์ฆ๊ฐ์จ ๋ฐ ํน์ง |
|---|---|---|---|
| CUDA ์ฝ์ด | 16,384๊ฐ | 21,760๊ฐ | ์ฝ 33% ์ฆ๊ฐ. ๋ฒ์ฉ ์ฐ์ฐ ๋ฅ๋ ฅ์ ๋น์ฝ์ ํฅ์ |
| Tensor ์ฝ์ด | 512๊ฐ | 680๊ฐ (5์ธ๋) | ๊ฐ์ ์ฆ๊ฐ์ ๋๋ถ์ด FP4 ๊ฐ์ ์ง์์ผ๋ก AI ์ฑ๋ฅ์ 2๋ฐฐ ์ด์ ์ฒด๊ฐ |
| RT ์ฝ์ด | 128๊ฐ | 170๊ฐ (4์ธ๋) | ๊ณ ์ฌ์ ๋ ์ด ํธ๋ ์ด์ฑ ๋ฐ 3D ๋ ๋๋ง ํจ์จ ๊ทน๋ํ |
| SM ๊ฐ์ | 128๊ฐ | 170๊ฐ | ๊ฐ ์ฝ์ด๋ค์ ๋ด๊ณ ์๋ '์์ ๋ฐ' ์์ฒด๊ฐ ๋์ด๋จ |
- structure:
| ๊ณ์ธต (Level) | ๊ตฌ์ฑ ์์ (Component) | ์ฃผ์ ์ญํ ๋ฐ ์ญ๋ (Capability) | ๋น์ /ํน์ง |
|---|---|---|---|
| Level 1: ์์คํ ๋ ๋ฒจ(Graphic Card) | Blackwell ์ํคํ ์ฒ | ์ฐจ์ธ๋ GPU ์ค๊ณ๋๋ก, AI ์ฐ์ฐ ํจ์จ ๋ฐ ํ๋์จ์ด ๊ฐ์ ์ต์ ํ | ์์คํ ์ ์ ์ฒด ์ค๊ณ๋ |
| 32GB GDDR7 VRAM | ๋๊ท๋ชจ AI ๋ชจ๋ธ(Llama 3.1 70B ๋ฑ)์ ์ ์ฌํ๋ ์ด๊ณ ์ ๋ฉ๋ชจ๋ฆฌ ์ฐฝ๊ณ | ๋ฐ์ดํฐ ์ ์ฅ ์ฐฝ๊ณ | |
| PCIe 5.0 x16 ์ธํฐํ์ด์ค | ๋ฉ์ธ๋ณด๋ ๋ฐ ์์คํ RAM๊ณผ ํต์ ํ๋ ์ธ๋ถ ์ฐ๊ฒฐ ํต๋ก (์ฝ 64 GB/s) | ๊ณต์ฅ ์ธ๋ถ ๋๋ก | |
| Level 2: ๋ฐ์ดํฐ ์ ์ก(Interconnect) | ๋ฉ๋ชจ๋ฆฌ ๋ฒ์ค (512-bit) | GPU ์นฉ๊ณผ VRAM์ ์ง์ ์๋ ์ ์ฉ ์ธํฐํ์ด์ค ๊ธฐ์ | ์ ์ฉ ๊ณ ์๋๋ก |
| 1.79 TB/s ๋์ญํญ | VRAM์ ๋ฐ์ดํฐ๋ฅผ GPU ์ฝ์ด๋ก ์ด๋น 1.79TB ์๋๋ก ์ค์ด ๋๋ฆ | ์ด๊ณ ์ ์ปจ๋ฒ ์ด์ด ๋ฒจํธ | |
| Level 3: GPU ๋ค์ด ๋ด๋ถ(On-Chip Logic) | GPC (Processing Clusters) | GPU ๋ด๋ถ์ ๊ฐ์ฅ ํฐ ๋ฌผ๋ฆฌ์ /๋ ผ๋ฆฌ์ ๋ธ๋ก ๋จ์ | ๋ํ ์์ ๊ตฌ์ญ |
| L2 ์บ์ (L2 Cache) | VRAM๊น์ง ๊ฐ์ง ์๊ณ ๋ฐ์ดํฐ๋ฅผ ์ฆ์ ์ฒ๋ฆฌํ๋ ์นฉ ์ค์์ ์์ ์ ์ฅ์ | ์ฝ์ด ์ ๊ฐ์ด ๋ณด๊ด์ | |
| SM (Multiprocessors) | ์๋ง ๊ฐ์ ์ฝ์ด๊ฐ ์ง์ฝ๋์ด ์ค์ ์ฐ์ฐ์ด ์ผ์ด๋๋ ํต์ฌ ๋จ์ | ํต์ฌ ์์ ๋ฐ | |
| Level 4: ์ฐ์ฐ ์ฝ์ด ๋ ๋ฒจ(Core Unit) | 5์ธ๋ ํ ์ ์ฝ์ด | AI ํ๋ ฌ ์ฐ์ฐ์ ์ ์ฉ์ผ๋ก ๊ฐ์ํ๋ ์ฅ์น | AI ์ ์ฉ ์๋ จ๊ณต |
| Native FP4 ์ง์ | 4๋นํธ ๋ถ๋์์์ ์ฐ์ฐ์ ํ๋์จ์ด ๋ ๋ฒจ์์ ์ง์ ๊ฐ์ | ๊ณ ์ ์์ํ ์ฒ๋ฆฌ ๋ฅ๋ ฅ | |
| ์ถ๋ก ์ญ๋ (Capability) | Llama 3.1 70B(์์ํ) ๊ตฌ๋ ์ ์ด๋น 8~15์ ์ด์์ ์๋ ๊ตฌํ | ์ค์๊ฐ ๋ํ ๊ฐ๋ฅ ์๋ |
| ๊ตฌ๋ถ | Blocking (cudaMemcpy) | Non-blocking (cudaMemcpyAsync) |
|---|---|---|
| CPU ์ ์ด๊ถ | ์ ์ก ์๋ฃ ํ ๋ณต๊ท | ๋ช ๋ น ์ ๋ฌ ํ ์ฆ์ ๋ณต๊ท |
| ๋ฉ๋ชจ๋ฆฌ ์๊ตฌ์ฌํญ | ์ผ๋ฐ Pageable ๋ฉ๋ชจ๋ฆฌ ๊ฐ๋ฅ | Pinned Memory ํ์ (OS ํ์ด์ง ๋ฐฉ์ง) |
| ๋ณ๋ ฌ์ฑ | ์์ฐจ ์คํ (I/O ํ ์ฐ์ฐ) | ์ค์ฒฉ ๊ฐ๋ฅ (I/O์ ์ฐ์ฐ ๋์ ์ํ) |
| ๊ตฌํ ๋์ด๋ | ๋ฎ์ | ๋์ (Stream ๋ฐ ๋๊ธฐํ ๊ด๋ฆฌ ํ์) |
- lspci
$ lspci | grep -i nvidia- nvidia-smi
$ nvidia-smi
- deviceQuery
# 1. ์ํ ๋ ํฌ์งํ ๋ฆฌ ํด๋ก
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
# 2. ๋น๋ (CMake ํ์)
mkdir build && cd build
cmake ..
make -j$(nproc)
# 3. ์คํ (deviceQuery ์์น ์์)
cd Samples/1_Utilities/deviceQuery
./deviceQuery- rtx 5090 results:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 5090"
CUDA Driver Version / Runtime Version 13.2 / 13.2
CUDA Capability Major/Minor version number: 12.0
Total amount of global memory: 32111 MBytes (33670758400 bytes)
(170) Multiprocessors, (128) CUDA Cores/MP: 21760 CUDA Cores
GPU Max Clock rate: 2407 MHz (2.41 GHz)
Memory Clock rate: 14001 Mhz
Memory Bus Width: 512-bit
L2 Cache Size: 100663296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.2, CUDA Runtime Version = 13.2, NumDevs = 1
Result = PASS- download installer for Linux WSL-Ubuntu 2.0 x86_64
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.2.0/local_installers/cuda-repo-wsl-ubuntu-13-2-local_13.2.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-13-2-local_13.2.0-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-13-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-2