LLama2‐70B‐MLPerf Benchmark Setup (NVIDIA) - KrArunT/InfobellIT-Gen-AI GitHub Wiki
Clone MLPerf repo:
git clone https://github.com/mlcommons/inference_results_v4.1.git
cd inference_results_v4.1/closed/NVIDIA
Directory to store all inference trained models and data
export MLPERF_SCRATCH_PATH=/path/to/scratch/space
Create required directories to run MLPerf inference workloads
mkdir $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data
Do the following changes to execute as root without adding root in docker group. Comment out the following lines.
nano docker/Dockerfile.user
Comment following lines:
#RUN echo root:root | chpasswd \
# && groupadd -f -g ${GID} ${GROUP} \
# && useradd -G sudo -g ${GID} -u ${UID} -m ${USER} \
# && echo ${USER}:${USER} | chpasswd \
# && echo -e "\nexport PS1=\"(mlperf) \\u@\\h:\\w\\$ \"" | tee -a /home/${USER}/.bashrc \
# && echo -e "\n%sudo ALL=(ALL:ALL) NOPASSWD:ALL\n" | tee -a /etc/sudoers
Update with:
RUN if ! id -u ${USER} > /dev/null 2>&1; then \
groupadd -f -g ${GID} ${GROUP} && \
useradd -G sudo -g ${GID} -u ${UID} -m ${USER} && \
echo ${USER}:${USER} | chpasswd && \
echo -e "\nexport PS1=\"(mlperf) \\u@\\h:\\w\\$ \"" | tee -a /home/${USER}/.bashrc && \
echo -e "\n%sudo ALL=(ALL:ALL) NOPASSWD:ALL\n" | tee -a /etc/sudoers; \
fi
Launching the docker container with required mount directories
make prebuild DOCKER_ARGS="-v <Files with data,model and preprocessed_data directories> :/home"
Path inside docker container for storing inference data
export MLPERF_SCRATCH_PATH=/home
Make sure that the container has the MLPERF_SCRATCH_PATH set correctly
echo $MLPERF_SCRATCH_PATH
Make sure that the container mounted the scratch space correctly
ls -al $MLPERF_SCRATCH_PATH
To make sure that the build/ directory isn't dirty
make clean
To link the build/ directory to the scratch space
make link_dirs
ls -al build/
You should see output like the following:
lrwxrwxrwx 1 user group 35 Jun 24 18:49 data -> $MLPERF_SCRATCH_PATH/data
lrwxrwxrwx 1 user group 37 Jun 24 18:49 models -> $MLPERF_SCRATCH_PATH/models
lrwxrwxrwx 1 user group 48 Jun 24 18:49 preprocessed_data -> $MLPERF_SCRATCH_PATH/preprocessed_data
sudo -v ; curl https://rclone.org/install.sh | sudo bash
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
rclone copy mlc-inference:mlcommons-inference-wg-public/open_orca ./open_orca -P
Note: Unzip the llama dataset pickle file
Do preprocessing of raw data for RNNT inference workloads:
make preprocess_data BENCHMARKS="llama2-70b"
Adding current system for running inference workloads:
python3 -m scripts.custom_systems.add_custom_system
- When asked for custom_id , put small 'y' (YES) and add a name like DGXH100_INTEL/DGX_H100_AMD.
- After entering a system ID, the script will generate (or append to, if already existing) a file at
code/common/systems/custom_list.py
- If this is your first time running NVIDIA's MLPerf Inference for this system, enter ‘y’ at the prompt. This will generate config files for every single benchmark, located at configs/[benchmark]/[scenario]/custom.py
- Edit the hyperparameters specific to model(eg:RNNT), scenario(eg:Offline) and system name(eg: class DGXH100_INTEL) in
configs/<model>/<scenario>/custom.py
to get best performance.
Build all MLPerf dependencies inside container
make build
python3 /work/build/TRTLLM/examples/quantization/quantize.py \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir=/work/build/models/Llama2/fp8-quantized-ammo/llama2-70b-chat-hf-tp1pp1-fp8 \
--model_dir= /home/models/Llama2/Llama2-70b-chat-hf/ \
--calib_size 1024 \
--tp_size 1 \
--calib_dataset /work/build/preprocessed_data/open_orca/mlperf_llama2_openorca_calibration_1k/
If you have not built TRTLLM yet, or the TRTLLM is outdated:
rm -rf build/TRTLLM && make clone_trt_llm && make build_trt_llm
make generate_engines RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline --config_ver=high_accuracy"
make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"