LLama2‐70B‐MLPerf Benchmark Setup (NVIDIA) - KrArunT/InfobellIT-Gen-AI GitHub Wiki

MLPerf Llama2-70B Inference instructions

Environment Setup:

Clone MLPerf repo:

git clone https://github.com/mlcommons/inference_results_v4.1.git
cd inference_results_v4.1/closed/NVIDIA 

Directory to store all inference trained models and data

export MLPERF_SCRATCH_PATH=/path/to/scratch/space

Create required directories to run MLPerf inference workloads

mkdir $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data 

Do the following changes to execute as root without adding root in docker group. Comment out the following lines.

nano docker/Dockerfile.user  

Comment following lines:

#RUN echo root:root | chpasswd \ 
# && groupadd -f -g ${GID} ${GROUP} \ 
# && useradd -G sudo -g ${GID} -u ${UID} -m ${USER} \ 
# && echo ${USER}:${USER} | chpasswd \ 
# && echo -e "\nexport PS1=\"(mlperf) \\u@\\h:\\w\\$ \"" | tee -a /home/${USER}/.bashrc \ 
# && echo -e "\n%sudo ALL=(ALL:ALL) NOPASSWD:ALL\n" | tee -a /etc/sudoers 

Update with:

RUN if ! id -u ${USER} > /dev/null 2>&1; then \ 
        groupadd -f -g ${GID} ${GROUP} && \ 
        useradd -G sudo -g ${GID} -u ${UID} -m ${USER} && \ 
        echo ${USER}:${USER} | chpasswd && \ 
        echo -e "\nexport PS1=\"(mlperf) \\u@\\h:\\w\\$ \"" | tee -a /home/${USER}/.bashrc && \ 
        echo -e "\n%sudo ALL=(ALL:ALL) NOPASSWD:ALL\n" | tee -a /etc/sudoers; \ 
    fi 

Inside Docker Container

Launching the docker container with required mount directories

make prebuild DOCKER_ARGS="-v <Files with data,model and preprocessed_data directories> :/home" 

Below commands to be executed inside docker container:

Path inside docker container for storing inference data

export MLPERF_SCRATCH_PATH=/home 

Make sure that the container has the MLPERF_SCRATCH_PATH set correctly

echo $MLPERF_SCRATCH_PATH  

Make sure that the container mounted the scratch space correctly

ls -al $MLPERF_SCRATCH_PATH 

To make sure that the build/ directory isn't dirty

make clean

To link the build/ directory to the scratch space

make link_dirs  
ls -al build/  

You should see output like the following:

lrwxrwxrwx  1 user group   35 Jun 24 18:49 data -> $MLPERF_SCRATCH_PATH/data 
lrwxrwxrwx  1 user group   37 Jun 24 18:49 models -> $MLPERF_SCRATCH_PATH/models 
lrwxrwxrwx  1 user group   48 Jun 24 18:49 preprocessed_data -> $MLPERF_SCRATCH_PATH/preprocessed_data 

Download the preprocessed dataset from a Cloudflare R2 bucket:

sudo -v ; curl https://rclone.org/install.sh | sudo bash 
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com 
rclone copy mlc-inference:mlcommons-inference-wg-public/open_orca ./open_orca -P 

Note: Unzip the llama dataset pickle file

Download Llama2-70B model for inference

Do preprocessing of raw data for RNNT inference workloads:

make preprocess_data BENCHMARKS="llama2-70b"

Adding current system for running inference workloads:

python3 -m scripts.custom_systems.add_custom_system 
  1. When asked for custom_id , put small 'y' (YES) and add a name like DGXH100_INTEL/DGX_H100_AMD.
  2. After entering a system ID, the script will generate (or append to, if already existing) a file at code/common/systems/custom_list.py
  3. If this is your first time running NVIDIA's MLPerf Inference for this system, enter ‘y’ at the prompt. This will generate config files for every single benchmark, located at configs/[benchmark]/[scenario]/custom.py
  4. Edit the hyperparameters specific to model(eg:RNNT), scenario(eg:Offline) and system name(eg: class DGXH100_INTEL) in configs/<model>/<scenario>/custom.py to get best performance.

Convert the model into quantized one:

Build all MLPerf dependencies inside container

make build 
python3 /work/build/TRTLLM/examples/quantization/quantize.py \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir=/work/build/models/Llama2/fp8-quantized-ammo/llama2-70b-chat-hf-tp1pp1-fp8 \
    --model_dir= /home/models/Llama2/Llama2-70b-chat-hf/  \
    --calib_size 1024  \
    --tp_size 1 \
    --calib_dataset /work/build/preprocessed_data/open_orca/mlperf_llama2_openorca_calibration_1k/  

If you have not built TRTLLM yet, or the TRTLLM is outdated:

rm -rf build/TRTLLM && make clone_trt_llm && make build_trt_llm  

Building the engine:

make generate_engines RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline --config_ver=high_accuracy" 

Running the benchmark :

make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" 

References:

⚠️ **GitHub.com Fallback** ⚠️