cuda - modrpc/info GitHub Wiki

Table of Contents

nvidia

overview

i/o optimization

nvidia-rtx-5090

  • hp-omen-max-45L system
    • ryzen 9 9950x3d (max 5.7ghz boost clock, 16 core, 32 threads, 128mb l3-cache (2nd-gen v-cache))
    • kingston-fury 128gb (32gb x 4): ddr5-5200m5/s expo udimm rgb
    • geforce rtx 5090 gddr7 32gb - up to 3352 tops (hp oem)
    • gen4 pcie 4.0 tlc m.2 nvme ssd 4tb
    • b850 chipset
    • 1200w gold 80 modular power supply
  • cores:
์ฝ”์–ด ์ข…๋ฅ˜ ์ •์‹ ๋ช…์นญ ์ฃผ ์—ญํ•  (Specialty) AI/๊ทธ๋ž˜ํ”ฝ ํ™œ์šฉ ์˜ˆ์‹œ
GPU ์ฝ”์–ด CUDA Cores ๋ฒ”์šฉ ์—ฐ์‚ฐ (General Purpose)์ˆ˜ํ•™์  ์‚ฌ์น™์—ฐ์‚ฐ๊ณผ ๋…ผ๋ฆฌ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ณธ ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ 3D ๊ทธ๋ž˜ํ”ฝ ๋ Œ๋”๋ง, ๋ฌผ๋ฆฌ ์—”์ง„ ๊ณ„์‚ฐ, ์ผ๋ฐ˜ ๋ฒ”์šฉ ์—ฐ์‚ฐ
Tensor ์ฝ”์–ด Tensor Cores ํ–‰๋ ฌ ์—ฐ์‚ฐ ๊ฐ€์† (Deep Learning)AI ๋ชจ๋ธ์˜ ํ•ต์‹ฌ์ธ ๊ฑฐ๋Œ€ ํ–‰๋ ฌ ๊ณฑ์…ˆ($A \times B + C$)์„ ์ดˆ๊ณ ์†์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. LLM(Llama 3.1) ์ถ”๋ก , ์ด๋ฏธ์ง€ ์—…์Šค์ผ€์ผ๋ง(DLSS), ํ•™์Šต(Fine-tuning)
RT ์ฝ”์–ด Ray Tracing Cores ๋น› ์ถ”์  ๊ฐ€์† (Visual Realism)๋น›์˜ ๋ฐ˜์‚ฌ, ๊ตด์ ˆ, ๊ทธ๋ฆผ์ž ๋“ฑ ๊ด‘์„  ์ถ”์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•˜๋“œ์›จ์–ด๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฒŒ์ž„ ๋‚ด ์‹ค์‹œ๊ฐ„ ๋ฐ˜์‚ฌ ํšจ๊ณผ, ๊ณ ํ€„๋ฆฌํ‹ฐ 3D ๋ Œ๋”๋ง ์‹œ๊ฐํ™”
์ฝ”์–ด ์ข…๋ฅ˜ RTX 4090 (Ada) RTX 5090 (Blackwell) ์ฆ๊ฐ€์œจ ๋ฐ ํŠน์ง•
CUDA ์ฝ”์–ด 16,384๊ฐœ 21,760๊ฐœ ์•ฝ 33% ์ฆ๊ฐ€. ๋ฒ”์šฉ ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์˜ ๋น„์•ฝ์  ํ–ฅ์ƒ
Tensor ์ฝ”์–ด 512๊ฐœ 680๊ฐœ (5์„ธ๋Œ€) ๊ฐœ์ˆ˜ ์ฆ๊ฐ€์™€ ๋”๋ถˆ์–ด FP4 ๊ฐ€์† ์ง€์›์œผ๋กœ AI ์„ฑ๋Šฅ์€ 2๋ฐฐ ์ด์ƒ ์ฒด๊ฐ
RT ์ฝ”์–ด 128๊ฐœ 170๊ฐœ (4์„ธ๋Œ€) ๊ณ ์‚ฌ์–‘ ๋ ˆ์ด ํŠธ๋ ˆ์ด์‹ฑ ๋ฐ 3D ๋ Œ๋”๋ง ํšจ์œจ ๊ทน๋Œ€ํ™”
SM ๊ฐœ์ˆ˜ 128๊ฐœ 170๊ฐœ ๊ฐ ์ฝ”์–ด๋“ค์„ ๋‹ด๊ณ  ์žˆ๋Š” '์ž‘์—…๋ฐ˜' ์ž์ฒด๊ฐ€ ๋Š˜์–ด๋‚จ
  • structure:
๊ณ„์ธต (Level) ๊ตฌ์„ฑ ์š”์†Œ (Component) ์ฃผ์š” ์—ญํ•  ๋ฐ ์—ญ๋Ÿ‰ (Capability) ๋น„์œ /ํŠน์ง•
Level 1: ์‹œ์Šคํ…œ ๋ ˆ๋ฒจ(Graphic Card) Blackwell ์•„ํ‚คํ…์ฒ˜ ์ฐจ์„ธ๋Œ€ GPU ์„ค๊ณ„๋„๋กœ, AI ์—ฐ์‚ฐ ํšจ์œจ ๋ฐ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์ตœ์ ํ™” ์‹œ์Šคํ…œ์˜ ์ „์ฒด ์„ค๊ณ„๋„
32GB GDDR7 VRAM ๋Œ€๊ทœ๋ชจ AI ๋ชจ๋ธ(Llama 3.1 70B ๋“ฑ)์„ ์ ์žฌํ•˜๋Š” ์ดˆ๊ณ ์† ๋ฉ”๋ชจ๋ฆฌ ์ฐฝ๊ณ  ๋ฐ์ดํ„ฐ ์ €์žฅ ์ฐฝ๊ณ 
PCIe 5.0 x16 ์ธํ„ฐํŽ˜์ด์Šค ๋ฉ”์ธ๋ณด๋“œ ๋ฐ ์‹œ์Šคํ…œ RAM๊ณผ ํ†ต์‹ ํ•˜๋Š” ์™ธ๋ถ€ ์—ฐ๊ฒฐ ํ†ต๋กœ (์•ฝ 64 GB/s) ๊ณต์žฅ ์™ธ๋ถ€ ๋„๋กœ
Level 2: ๋ฐ์ดํ„ฐ ์ „์†ก(Interconnect) ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์Šค (512-bit) GPU ์นฉ๊ณผ VRAM์„ ์ง์ ‘ ์ž‡๋Š” ์ „์šฉ ์ธํ„ฐํŽ˜์ด์Šค ๊ธฐ์ˆ  ์ „์šฉ ๊ณ ์†๋„๋กœ
1.79 TB/s ๋Œ€์—ญํญ VRAM์˜ ๋ฐ์ดํ„ฐ๋ฅผ GPU ์ฝ”์–ด๋กœ ์ดˆ๋‹น 1.79TB ์†๋„๋กœ ์‹ค์–ด ๋‚˜๋ฆ„ ์ดˆ๊ณ ์† ์ปจ๋ฒ ์ด์–ด ๋ฒจํŠธ
Level 3: GPU ๋‹ค์ด ๋‚ด๋ถ€(On-Chip Logic) GPC (Processing Clusters) GPU ๋‚ด๋ถ€์˜ ๊ฐ€์žฅ ํฐ ๋ฌผ๋ฆฌ์ /๋…ผ๋ฆฌ์  ๋ธ”๋ก ๋‹จ์œ„ ๋Œ€ํ˜• ์ž‘์—… ๊ตฌ์—ญ
L2 ์บ์‹œ (L2 Cache) VRAM๊นŒ์ง€ ๊ฐ€์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ‰์‹œ ์ฒ˜๋ฆฌํ•˜๋Š” ์นฉ ์ค‘์•™์˜ ์ž„์‹œ ์ €์žฅ์†Œ ์ฝ”์–ด ์˜† ๊ฐ„์ด ๋ณด๊ด€์†Œ
SM (Multiprocessors) ์ˆ˜๋งŒ ๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ์ง‘์•ฝ๋˜์–ด ์‹ค์ œ ์—ฐ์‚ฐ์ด ์ผ์–ด๋‚˜๋Š” ํ•ต์‹ฌ ๋‹จ์œ„ ํ•ต์‹ฌ ์ž‘์—…๋ฐ˜
Level 4: ์—ฐ์‚ฐ ์ฝ”์–ด ๋ ˆ๋ฒจ(Core Unit) 5์„ธ๋Œ€ ํ…์„œ ์ฝ”์–ด AI ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ „์šฉ์œผ๋กœ ๊ฐ€์†ํ•˜๋Š” ์žฅ์น˜ AI ์ „์šฉ ์ˆ™๋ จ๊ณต
Native FP4 ์ง€์› 4๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ํ•˜๋“œ์›จ์–ด ๋ ˆ๋ฒจ์—์„œ ์ง์ ‘ ๊ฐ€์† ๊ณ ์† ์–‘์žํ™” ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ
์ถ”๋ก  ์—ญ๋Ÿ‰ (Capability) Llama 3.1 70B(์–‘์žํ™”) ๊ตฌ๋™ ์‹œ ์ดˆ๋‹น 8~15์ž ์ด์ƒ์˜ ์†๋„ ๊ตฌํ˜„ ์‹ค์‹œ๊ฐ„ ๋Œ€ํ™” ๊ฐ€๋Šฅ ์†๋„

cuda

blocking vs non-blocking i/o

๊ตฌ๋ถ„ Blocking (cudaMemcpy) Non-blocking (cudaMemcpyAsync)
CPU ์ œ์–ด๊ถŒ ์ „์†ก ์™„๋ฃŒ ํ›„ ๋ณต๊ท€ ๋ช…๋ น ์ „๋‹ฌ ํ›„ ์ฆ‰์‹œ ๋ณต๊ท€
๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ์‚ฌํ•ญ ์ผ๋ฐ˜ Pageable ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€๋Šฅ Pinned Memory ํ•„์ˆ˜ (OS ํŽ˜์ด์ง• ๋ฐฉ์ง€)
๋ณ‘๋ ฌ์„ฑ ์ˆœ์ฐจ ์‹คํ–‰ (I/O ํ›„ ์—ฐ์‚ฐ) ์ค‘์ฒฉ ๊ฐ€๋Šฅ (I/O์™€ ์—ฐ์‚ฐ ๋™์‹œ ์ˆ˜ํ–‰)
๊ตฌํ˜„ ๋‚œ์ด๋„ ๋‚ฎ์Œ ๋†’์Œ (Stream ๋ฐ ๋™๊ธฐํ™” ๊ด€๋ฆฌ ํ•„์š”)

basics

  • lspci
$ lspci | grep -i nvidia
  • nvidia-smi
  $ nvidia-smi
  • deviceQuery
# 1. ์ƒ˜ํ”Œ ๋ ˆํฌ์ง€ํ† ๋ฆฌ ํด๋ก 
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples

# 2. ๋นŒ๋“œ (CMake ํ•„์š”)
mkdir build && cd build
cmake ..
make -j$(nproc)

# 3. ์‹คํ–‰ (deviceQuery ์œ„์น˜ ์˜ˆ์‹œ)
cd Samples/1_Utilities/deviceQuery
./deviceQuery
  • rtx 5090 results:
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 5090"
  CUDA Driver Version / Runtime Version          13.2 / 13.2
  CUDA Capability Major/Minor version number:    12.0
  Total amount of global memory:                 32111 MBytes (33670758400 bytes)
  (170) Multiprocessors, (128) CUDA Cores/MP:    21760 CUDA Cores
  GPU Max Clock rate:                            2407 MHz (2.41 GHz)
  Memory Clock rate:                             14001 Mhz
  Memory Bus Width:                              512-bit
  L2 Cache Size:                                 100663296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.2, CUDA Runtime Version = 13.2, NumDevs = 1
Result = PASS

cuda on wsl-2

  • download installer for Linux WSL-Ubuntu 2.0 x86_64
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.2.0/local_installers/cuda-repo-wsl-ubuntu-13-2-local_13.2.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-13-2-local_13.2.0-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-13-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-2
โš ๏ธ **GitHub.com Fallback** โš ๏ธ