Running the larger Google Gemma 7B 35GB LLM on a single GPU for over 7x Inference Performance gains - ObrienlabsDev/blog GitHub Wiki

see https://medium.com/@obrienlabs/running-the-larger-google-gemma-7b-35gb-llm-for-7x-inference-performance-gain-8b63019523bb see https://ai.google/

TLDR; Run Gemma 7B on a single GPU with over 40 GB VRAM — preferably an 80 GB H100, 40 GB A100 or a 48G RTX-A6000 — performance will be at least 7x that of a multi GPU deployment — due to PCIe bus avoidance.

In a previous article I describe how Google open sourced its larger Gemma 7B model (actually 8B) on huggingface.co https://obrienlabs.medium.com/google-gemma-7b-and-2b-llm-models-are-now-available-to-developers-as-oss-on-hugging-face-737f65688f0d This article describes how to run the Gemma 7B model on large VRAM GPUs such as the A6000 (similar to the A100 and H100) that can handle the 35GB size of the 7B model. An 8 to 10x speedup occurs if we avoid PCIe bus transfers (16 to 32 GB/s) on a model split across GPUs.

For example, Google Gemma 7B on the 48GB RTX-A6000 (GA102) GPU runs at around 14 tokens/sec vs 0.9 tokens/sec for dual L4s (AD104) — primarily because the VRAM bandwidth of 300–1000 GB/s is throttled by the 16–32 GB/s PCIe bus. If you were to split the 35G model across even two A6000’s with a combined VRAM size of 96G — the model will run slower than on a single 48G card.

LLM model inference/serving is best done on a single GPU. While model training benefits from multiple GPUs due to distributed training jobs — Inference must be done on single GPUs per request. If an inference run is split between multiple GPU’s — performance can drop down to CPU levels.

Note: Results will be very different if running InfiniBand NVLink 4 or 5 or NVSwitch.

Note: the gemma 7B model here is the default — not fine tuned or quantitized. See https://ai.google.dev/gemma/docs?hl=en

Software

We will be using the transformers python library. Try to use the latest python 3.11

gemma-gpu.py

import os
# default dual GPU - either PCIe bus or NVidia bus - slowdowns - VRAM min of 20G on either GPU
#os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# specific GPU - model must fit entierely in memory RTX-3500 ada = 12G, A4000=16G, A4500=20, A6000=48, 4000 ada = 20, 5000 ada = 32, 6000 ada = 48
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from transformers import AutoTokenizer, AutoModelForCausalLM
from datetime import datetime

access_token = 'hf_cfTP…XCQqH'
model = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(model, token=access_token)
# gpu
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", token=access_token)
# cpu
# model = AutoModelForCausalLM.from_pretrained(model, token=access_token)

input_text = "how is gold made in collapsing neutron stars 
    - specifically what is the ratio created during the beta and r process."
time_start = datetime.now().strftime("%H:%M:%S")
print("genarate start: ", datetime.now().strftime("%H:%M:%S"))

# gpu
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# cpu
#input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10000)
print(tokenizer.decode(outputs[0]))

print("end", datetime.now().strftime("%H:%M:%S"))

Hardware

The Gemma 7B will not run in limited VRAM environments like single GPU 3500-Ada (12G), A4000 (16G), A4500 (20G), L4 (24G) — due to the 7B model minimum 35G VRAM requirements.

While the Gemma 2B model can run on lower VRAM GPUs of 12GB or larger — the 35GB Gemma 7B model optimally runs on a single GPU of size of 40GB or larger like the RTX-A6000 48GB or a pair of 20GB+ GPUs like the L40 (2 x 24G), the RTX-A4500 (2 x 20G) or the consumer 4090 (2 x 24G). If the model is spread across two GPUs you will experience slowdown across the PCIe bus — even with NVlink across the cards. A single GPU is optimal.

Deployment

Performance

The LLM output of the question on neutron stars is around 170 tokens.

Single NVIDIA A6000 — Ampere GA102 (see L40s equivalent on GCP)

12 seconds for 170 tokens = 14 tokens/sec 98% GPU utilization of 10k cores and 34GB/48GB VRAM @ 85% TDP 250W of 300W 768 GB/s (384 bit)

$ nvidia-smi
Fri Apr 19 21:23:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.86                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000             WDDM  |   00000000:01:00.0 Off |                  Off |
| 38%   71C    P2            263W /  300W |   34075MiB /  49140MiB |     98%      Default |

CPU — 14900K — 6400MHz RAM — overclocked

89 seconds (7.4x A6000) = 1.9 tokens/sec 90% CPU utilization of 32 vCores (24+8) and 33GB/64GB RAM 6400 MHz DDR5 to CPU bandwidth is 102 GB/s

GPU Dual NVIDIA 4090 — Ada AD102 102 seconds (8.5x A6000) = 1.7 tokens/sec 70% GPU utilization of 2x 16k cores and 34GB/48GB VRAM @ 22% TDP 220W of 900W 1008 GB/s (384 bit) 60% PCIe bus interface load No NVlink