7.1 CXL Usecases

We believe it plays a pivotal role in a success of the new memory device to find out best and worst fit usecases, when a user investigates applying a new memory device. Thus, we have assumed 3 possible CXL usecases and results for those who consider adopting CXL memory in their system. The experiments have been conducted in our lab condition on developing CXL system, therefore consequences should be understood as a reference purpose.

7.2 Experiment 1 - IMDB(Redis, UX A)

Including the results of this chapter, all the experimental results on this page were based on the prototypes of CXL enabled system and CXL memory module, so the results may vary depending on the evaluation environment.

7.2.1 Testbed

The table below describes the HW/SW testbed information that we used for the experiments.

HW / SW	Description
CPU / Board	Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM	Samsung DDR5 DIMM 4800MT/s
CXL Memory Expander	Samsung CMM-D prototype
OS	Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5
Kernel	SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk)

7.2.2 Test Configuraton

Redis Configuration

maxmemory: 8g
maxmemory-policy: allkeys-lru
maxmemory samples: 10

Memtier Configuration

target 60GB W/L
pipeline_num: 1
thread_num: 24 (using taskset -c 24-47)
client_num: 50
key_pattern P:P
ratio 1:0 / 0:1 for Set/Get each

SMDK Configuration (CXLMALLOC_CONF)

use_exmem	Test A			Test B			maxmemory_policy	use_auto_arena_scaling
use_exmem	exmem_size	normal_size	priority	exmem_size	normal_size	priority	maxmemory_policy	use_auto_arena_scaling
TRUE	131072	2048	normal	131072	2048	normal	remain	FALSE

7.2.3 Test Results

The table below summarizes the benchmark result using Memtier benchmark tool. It can be seen that the performance is improved by up to x2.5 in the case of expanding the system memory (DRAM + CXL).

UX A. More system memory

Relative Bandwidth		Test A - DRAM	Test B - CXL + DRAM
Set	128B	1	2.08
	256B	1	2.57
	512B	1	2.18
	1KB	1	1.82

7.3 Experiment 2 - IMDB(Memcached, UX A and B)

7.3.1 Testbed

HW / SW	Description
CPU / Board	Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM	Samsung DDR5 DIMM 4800MT/s
CXL Memory Expander	Samsung CMM-D prototype
OS	Ubuntu 20.04 LTS, CentOS-7 x86 2009, Fedora Workstation 38, OpenSUSE Leap 15.5
Kernel	SMDK kernel: 5.17.0-rc5-smdk and later (Latest: 6.6.0-smdk)

7.3.2 Test Configuraton

Memcached Configuration

UX A
- #threads: 24
- maxmemory: 2048
- memory pre-allocation by extstore options
UX B
- #threads: 24
- extstore write buffer: 2MiB

Memtier Configuration

target 60GB W/L (cf. UX2: 10GB W/L)
pipeline_num: 1
thread_num: 24 (using taskset -c 24-47)
client_num: 50
key_pattern P:P
ratio 1:0 / 0:1 for Set/Get each
The performance of UX B - Test A (DRAM + Storage) was calculated based on the time when all the value data was written in the storage.

SMDK Configuration (CXLMALLOC_CONF)

use_exmem	Test A			Test B			maxmemory_policy	use_auto_arena_scaling
use_exmem	exmem_size	normal_size	priority	exmem_size	normal_size	priority	maxmemory_policy	use_auto_arena_scaling
TRUE	131072	2048	normal	131072	2048	normal	remain	FALSE
TRUE	131072	131072	normal	131072	131702	exmem	remain	FALSE

7.3.3 Test Results

The tables below summarize the benchmark results using Memtier benchmark tool. In UX A, we found that the performance figures are similar to those evaluated in DRAM-only systems, and also found that in memory scale-up case(UX B, CXL vs. DRAM + Storage), the performance is greatly improved compared to the comparative groups.

UX A. More system memory

Relative Bandwidth		Test A - DRAM	Test B - CXL + DRAM
Set	128B	1	0.96
	256B	1	0.98
	512B	1	0.83
	1KB	1	0.91

UX B. Memory scale-up (vs. system memory + storage)

Relative Bandwidth		Test A - DRAM + Storage	Test B - CXL scale-up
Set	4KB	1	3.2
Set	512KB	1	253.3
Get	4KB	1	14.1
Get	512KB	1	7.1

7.4 Experiment 3 - ML/AI(GPT model, UX C)

7.4.1 Methodology

We invented an experiment methodology to lead to the aggregated memory bandwidth (UX C) on a heterogeneous memory system that equips both DRAM and CXL memory.

The key idea is that

When DRAM bandwidth is saturated by an application use, we further allocate a CXL memory to be used by the application.
CPU/memory resources are isolated and reserved to the application.
The application triggers a sufficient memory workload on the isolated HW resources while running to result in the maximum memory bandwidth out of the memory. Also, there are some notations we draw for detailed explanation. Please refer to the composition of the reference testbed below.

Also, there are some notations we draw for detailed explanation.

Dmax_bw = the maximum BW of DDR DRAM (ex: 28GB/s)
Cmax_bw = the maximum BW of CXL RAM
Duse_bw = in-use BW of DDR DRAM (ex: < 25GB/s)
Cuse_bw = in-use BW of CXL DRAM
Dmax_wl = the minimum Workload size for DDR BW saturation
Cmax_wl = the minimum Workload size of CXL BW saturation
Duse_wl = in-use WL size on DDR DRAM, out of running application 
Cuse_wl = in-use WL size on CXL DRAM, out of running application
Texec = application execution time
Hence, the condition to reach up peak BW aggregation is Duse_wl > Dmax_wl and Cuse_wl > Cmax_wl, while Texec.

Applying the methodology, we conducted experiments on ML/AI application, GPT, BERT, and NASNET, to validate how additional BW out of CXL memory actually helps application performance. Given the test set, ML/AI applications show increased inference throughput(IPM, Inference Per Minute). Following section explains the test results of GPT-based application that performs generative sentences.

7.4.2 GPT + SMDK Bandwidth-based tiering

This experiment shows the improved throughput of the GPT2 application with SMDK's intelligent tiering.

7.4.2.1 Testbed

The table below describes HW/SW testbed for the experiments.

HW / SW	Description
CPU / Board	Prototype CPU and board system which supports PCIe Gen5 I/F and CXL
DRAM	Samsung 64GB DDR5 DIMM 4800MT/s
CXL Memory Expander	Samsung 128GB CMM-D prototype
OS	Ubuntu 20.04 LTS
SMDK	SMDK v1.5

7.4.2.2 Test Configuration

GPT2 Model Configuration

Pytorch >= 2.0, Python >= 3.10
Model type : GPT2-base, 12-layer, 768-hidden, 12-heads, 117M parameters, batch_size 8, max-length 128
Dataset: imdb reviews

SMDK Configuration (CXLMALLOC_CONF)

use_exmem: true
use_adaptive_interleaving: true
adaptive_interleaving_policy: bw_saturation

7.4.2.3 Test Results

Below figures depict the results of the experiment on GPT2 application.
In case of vanilla Linux, i.e. not using adaptive interleaving, CXL DRAM was not utilized by GPT2 application, even though DRAM bandwidth is saturated. This is because vanilla Linux does not consider bandwidth saturation of DDR DRAM, but only care available capacity of DDR DRAM. As a result, the throughput is limited at around 240 IPM, not utilizing additional bandwidth of CXL DRAM.

Meanwhile, adaptive interleaving automatically detects when DRAM bandwidth saturation is happening, by monitoring in-use memory bandwidth of the system using CPU PMU. Having DDR DRAM bandwidth saturated, CXL DRAM is used by the application from the point. As a result, the throughput is much improved to 360 IPM, which is around x1.5 higher than the vanilla case.

7.4.3 GPT + SMDK Optimization path

This experiment shows the improved throughput of the GPT2 application using the optimization path, especially by modifying Pytorch to allocate CXL memory using s_posix_memalign for specific types of pre-trained weights.

7.4.3.1 Testbed

The table below describes HW/SW testbed for the experiments.

HW / SW	Description
CPU / Board	Prototype CPU and board system which supports PCIe Gen5 I/F and CXL, Logical cores: 144
DRAM	Samsung 64GB DDR5 DIMM 4800MT/s
CXL Memory Expander	Samsung 128GB CMM-D prototype
OS	Ubuntu 22.04 LTS
SMDK	SMDK v1.5

7.4.3.2 Test Configuration

GPT2 Model Configuration

Pytorch >= 2.0, Python >= 3.10
Model type : GPT2-large, 36-layer, 1280-hidden, 20-heads, 774M parameters, max-length 128
Dataset: imdb reviews

SMDK Configuration (SMALLOC_CONF)

use_auto_arena_scaling:true

7.4.3.3 Test Results

By using optimized path, CPU cores are fully utilized by GPT2 application, even though DRAM bandwidth is moderately saturated. This is because in non-optimized path, pre-trained weights are stored in DRAM that utilizes the available memory. Since the large language modeling pre-trained weights are considered as a part of model but not as a part of inference data, keeping the pre-trained data items in CXL helps us to serve more request in DRAM. So in this experiment model, some weights are offloaded through optimization path to CXL DDR during weights initialization which helps DDR to serve more inference requests keeping memory bandwidth saturated.

As a result, the throughput which is limited at 1 (normalized metric) without optimization path has been increased to 1.99 (normalized metric) with optimization path.

It is also observed that in optimization path the GPT2-large pre-trained weights utilize 4GB per instance. If the total cores used are increased from 96 to 144 and CXL DDR usage is increased from 51.32% to 90%, the performance growth factor will increase from 1.99 to 2.5. The improved performance growth can also be seen if pre-weights utilization per instance increased from 4GB to 6GB. This observation is based on GPT2 variants and the memory consumption.

7.5 Experiment 4 - Containerization

Virtualization is also an important SW stack as promising usecase of CXL memory adoption. The container, thin virtualization, technology is being widely used in industry. Therefore we performed integration of container and the SMDK and conducted some experiments.

7.5.1 Containerization with SMDK

Install required container runtime and Docker.
When creating the container image of an application, a SMDK plugin needs to be included. (e.g., libcxlmalloc.so for compatible path)
Start Docker container as usual. No additional setting is needed when starting a container.
When running the application, set and export required configurations according to the plugin. (e.g., LD_PRELOAD and CXLMALLOC_CONF for libcxlmalloc.so) Please refer to Compatible path section for more details.

7.5.2 Experiment Result

The Docker, container and SMDK integration have worked well. Those are the list of application containers we tested.

ML/AI Applications
- GPT2 Inference (with python, pytorch framework)
- BERT Inference (with python, tensorflow framework)
- NASNet Inference (with python, tensorflow framework)
- DLRM Inference (with python, pytorch framework)
In-memory Database(IMDB) Applications
- Redis
- Memcached

7. Experiment Results - OpenMPDK/SMDK GitHub Wiki

7.1 CXL Usecases

7.2 Experiment 1 - IMDB(Redis, UX A)

7.2.1 Testbed

7.2.2 Test Configuraton

7.2.3 Test Results

7.3 Experiment 2 - IMDB(Memcached, UX A and B)

7.3.1 Testbed

7.3.2 Test Configuraton

7.3.3 Test Results

7.4 Experiment 3 - ML/AI(GPT model, UX C)

7.4.1 Methodology