7. Experiment Results - OpenMPDK/SMDK GitHub Wiki
We believe it plays a pivotal role in a success of the new memory device to find out best and worst fit usecases, when a user investigates applying a new memory device. Thus, we have assumed 3 possible CXL usecases and results for those who consider adopting CXL memory in their system. The experiments have been conducted in our lab condition on developing CXL system, therefore consequences should be understood as a reference purpose.
Including the results of this chapter, all the experimental results on this page were based on the prototypes of CXL enabled system and CXL memory module, so the results may vary depending on the evaluation environment.
The table below describes the HW/SW testbed information that we used for the experiments.
HW / SW | Description |
---|---|
CPU / Board | Prototype CPU and board system which supports PCIe Gen5 I/F and CXL |
DRAM | Samsung DDR5 DIMM 4800MT/s |
CXL Memory Expander | Samsung CMM-D prototype |
OS | Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5 |
Kernel | SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk) |
Redis Configuration
- maxmemory: 8g
- maxmemory-policy: allkeys-lru
- maxmemory samples: 10
Memtier Configuration
- target 60GB W/L
- pipeline_num: 1
- thread_num: 24 (using taskset -c 24-47)
- client_num: 50
- key_pattern P:P
- ratio 1:0 / 0:1 for Set/Get each
SMDK Configuration (CXLMALLOC_CONF)
use_exmem | Test A | Test B | maxmemory_policy | use_auto_arena_scaling | ||||
exmem_size | normal_size | priority | exmem_size | normal_size | priority | |||
TRUE | 131072 | 2048 | normal | 131072 | 2048 | normal | remain | FALSE |
The table below summarizes the benchmark result using Memtier benchmark tool. It can be seen that the performance is improved by up to x2.5 in the case of expanding the system memory (DRAM + CXL).
UX A. More system memory
Relative Bandwidth | Test A - DRAM | Test B - CXL + DRAM | |
Set | 128B | 1 | 2.08 |
256B | 1 | 2.57 | |
512B | 1 | 2.18 | |
1KB | 1 | 1.82 |
HW / SW | Description |
---|---|
CPU / Board | Prototype CPU and board system which supports PCIe Gen5 I/F and CXL |
DRAM | Samsung DDR5 DIMM 4800MT/s |
CXL Memory Expander | Samsung CMM-D prototype |
OS | Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5 |
Kernel | SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk) |
Memcached Configuration
-
UX A
- #threads: 24
- maxmemory: 2048
- memory pre-allocation by extstore options
-
UX B
- #threads: 24
- extstore write buffer: 2MiB
Memtier Configuration
- target 60GB W/L (cf. UX2: 10GB W/L)
- pipeline_num: 1
- thread_num: 24 (using taskset -c 24-47)
- client_num: 50
- key_pattern P:P
- ratio 1:0 / 0:1 for Set/Get each
- The performance of UX B - Test A (DRAM + Storage) was calculated based on the time when all the value data was written in the storage.
SMDK Configuration (CXLMALLOC_CONF)
use_exmem | Test A | Test B | maxmemory_policy | use_auto_arena_scaling | ||||
exmem_size | normal_size | priority | exmem_size | normal_size | priority | |||
TRUE | 131072 | 2048 | normal | 131072 | 2048 | normal | remain | FALSE |
TRUE | 131072 | 131072 | normal | 131072 | 131702 | exmem | remain | FALSE |
The tables below summarize the benchmark results using Memtier benchmark tool. In UX A, we found that the performance figures are similar to those evaluated in DRAM-only systems, and also found that in memory scale-up case(UX B, CXL vs. DRAM + Storage), the performance is greatly improved compared to the comparative groups.
UX A. More system memory
Relative Bandwidth | Test A - DRAM | Test B - CXL + DRAM | |
Set | 128B | 1 | 0.96 |
256B | 1 | 0.98 | |
512B | 1 | 0.83 | |
1KB | 1 | 0.91 |
UX B. Memory scale-up (vs. system memory + storage)
Relative Bandwidth | Test A - DRAM + Storage | Test B - CXL scale-up | |
Set | 4KB | 1 | 3.2 |
512KB | 1 | 253.3 | |
Get | 4KB | 1 | 14.1 |
512KB | 1 | 7.1 |
We invented an experiment methodology to lead to the aggregated memory bandwidth (UX C) on a heterogeneous memory system that equips both DRAM and CXL memory.
The key idea is that
- When DRAM bandwidth is saturated by an application use, we further allocate a CXL memory to be used by the application.
- CPU/memory resources are isolated and reserved to the application.
- The application triggers a sufficient memory workload on the isolated HW resources while running to result in the maximum memory bandwidth out of the memory. Also, there are some notations we draw for detailed explanation. Please refer to the composition of the reference testbed below.
Also, there are some notations we draw for detailed explanation.
Dmax_bw = the maximum BW of DDR DRAM (ex: 28GB/s)
Cmax_bw = the maximum BW of CXL RAM
Duse_bw = in-use BW of DDR DRAM (ex: < 25GB/s)
Cuse_bw = in-use BW of CXL DRAM
Dmax_wl = the minimum Workload size for DDR BW saturation
Cmax_wl = the minimum Workload size of CXL BW saturation
Duse_wl = in-use WL size on DDR DRAM, out of running application
Cuse_wl = in-use WL size on CXL DRAM, out of running application
Texec = application execution time
Hence, the condition to reach up peak BW aggregation is Duse_wl > Dmax_wl and Cuse_wl > Cmax_wl, while Texec.
Applying the methodology, we conducted experiments on ML/AI application, GPT, BERT, and NASNET, to validate how additional BW out of CXL memory actually helps application performance. Given the test set, ML/AI applications show increased inference throughput(IPM, Inference Per Minute). Following section explains the test results of GPT-based application that performs generative sentences.
This experiment shows the improved throughput of the GPT2 application with SMDK's intelligent tiering.
The table below describes HW/SW testbed for the experiments.
HW / SW | Description |
---|---|
CPU / Board | Prototype CPU and board system which supports PCIe Gen5 I/F and CXL |
DRAM | Samsung 64GB DDR5 DIMM 4800MT/s |
CXL Memory Expander | Samsung 128GB CMM-D prototype |
OS | Ubuntu 20.04 LTS |
SMDK | SMDK v1.5 |
GPT2 Model Configuration
- Pytorch >= 2.0, Python >= 3.10
- Model type : GPT2-base, 12-layer, 768-hidden, 12-heads, 117M parameters, batch_size 8, max-length 128
- Dataset: imdb reviews
SMDK Configuration (CXLMALLOC_CONF)
- use_exmem: true
- use_adaptive_interleaving: true
- adaptive_interleaving_policy: bw_saturation
Below figures depict the results of the experiment on GPT2 application.
In case of vanilla Linux, i.e. not using adaptive interleaving, CXL DRAM was not utilized by GPT2 application, even though DRAM bandwidth is saturated.
This is because vanilla Linux does not consider bandwidth saturation of DDR DRAM, but only care available capacity of DDR DRAM.
As a result, the throughput is limited at around 240 IPM, not utilizing additional bandwidth of CXL DRAM.
Meanwhile, adaptive interleaving automatically detects when DRAM bandwidth saturation is happening, by monitoring in-use memory bandwidth of the system using CPU PMU. Having DDR DRAM bandwidth saturated, CXL DRAM is used by the application from the point. As a result, the throughput is much improved to 360 IPM, which is around x1.5 higher than the vanilla case.
This experiment shows the improved throughput of the GPT2 application using the optimization path, especially by modifying Pytorch to allocate CXL memory using s_posix_memalign for specific types of pre-trained weights.
The table below describes HW/SW testbed for the experiments.
HW / SW | Description |
---|---|
CPU / Board | Prototype CPU and board system which supports PCIe Gen5 I/F and CXL, Logical cores: 144 |
DRAM | Samsung 64GB DDR5 DIMM 4800MT/s |
CXL Memory Expander | Samsung 128GB CMM-D prototype |
OS | Ubuntu 22.04 LTS |
SMDK | SMDK v1.5 |
GPT2 Model Configuration
- Pytorch >= 2.0, Python >= 3.10
- Model type : GPT2-large, 36-layer, 1280-hidden, 20-heads, 774M parameters, max-length 128
- Dataset: imdb reviews
SMDK Configuration (SMALLOC_CONF)
- use_auto_arena_scaling:true
By using optimized path, CPU cores are fully utilized by GPT2 application, even though DRAM bandwidth is moderately saturated. This is because in non-optimized path, pre-trained weights are stored in DRAM that utilizes the available memory. Since the large language modeling pre-trained weights are considered as a part of model but not as a part of inference data, keeping the pre-trained data items in CXL helps us to serve more request in DRAM. So in this experiment model, some weights are offloaded through optimization path to CXL DDR during weights initialization which helps DDR to serve more inference requests keeping memory bandwidth saturated.
As a result, the throughput which is limited at 1 (normalized metric) without optimization path has been increased to 1.99 (normalized metric) with optimization path.
It is also observed that in optimization path the GPT2-large pre-trained weights utilize 4GB per instance. If the total cores used are increased from 96 to 144 and CXL DDR usage is increased from 51.32% to 90%, the performance growth factor will increase from 1.99 to 2.5. The improved performance growth can also be seen if pre-weights utilization per instance increased from 4GB to 6GB. This observation is based on GPT2 variants and the memory consumption.
Virtualization is also an important SW stack as promising usecase of CXL memory adoption. The container, thin virtualization, technology is being widely used in industry. Therefore we performed integration of container and the SMDK and conducted some experiments.
- Install required container runtime and Docker.
- When creating the container image of an application, a SMDK plugin needs to be included. (e.g., libcxlmalloc.so for compatible path)
- Start Docker container as usual. No additional setting is needed when starting a container.
- When running the application, set and export required configurations according to the plugin. (e.g., LD_PRELOAD and CXLMALLOC_CONF for libcxlmalloc.so) Please refer to Compatible path section for more details.
The Docker, container and SMDK integration have worked well. Those are the list of application containers we tested.
- ML/AI Applications
- GPT2 Inference (with python, pytorch framework)
- BERT Inference (with python, tensorflow framework)
- NASNet Inference (with python, tensorflow framework)
- DLRM Inference (with python, pytorch framework)
- In-memory Database(IMDB) Applications
- Redis
- Memcached