2. SMDK Architecture - OpenMPDK/SMDK GitHub Wiki

2.1 High Level Architecture


The picture below explains high level architecture of SMDK.


image


  • The top layer of SMDK is memory user interfaces. This layer consists of a CLI tool, libraries, application programming interfaces (API), sets of pre-built and reusable codes, and the connections to access these software codes.

    • The allocator library provides the compatible and optimization path for application integration. Using this library, system developers can easily incorporate CXL memory into existing systems without modifying existing applications. Additionally, they can optimize and re-write application source code for higher level optimization.
    • CLI tool allows a unified way of retrieval and controlling a CXL device and SMDK functions.
  • The middle layer of SMDK is the userspace memory tiering engine with scalable near/far memory management. This layer allows best-fit memory usecases based on memory access pattern and footprint of applications.

  • The bottom layer of SMDK is primitive logical memory views provided by SMDK kernel. The kernel change is geared to provide a flexible memory utilization for CXL users on system point of view.

Please note that SMDK is being developed to support SDM(Software-Defined Memory), penetrating full-stack software.




2.2 User Interface


As for user interface, SMDK provides memory allocator that aim to orchestrate the capacity and bandwidth of DRAM and CXL memory. Also, SMDK allocator is designed to support scalability in performance by assigning pair(s) of lock-free DRAM/CXL pools to handle concurrent memory allocation requests. The SMDK allocator is an extension of the Jemalloc allocator, and is used in both compatible and optimization way.

The figure below describes the workflow of SMDK allocator.

image



The following section describes the two paths for application integration, compatible path and optimization path, and the language binding of each.

Compatible Path

  • Background - When a new memory device emerges, it has been common for a novel API and programming model to be provided for enabling the device. However, this requires application developers and open-source community members to modify their application's source code to adopt and use the new device properly. This approach harms the reliability of running services and increases management costs. To address this issue, SMDK offers a compatible API designed to resolve the pain points based on the VOC of the industry.

  • SMDK enables an application that tiers CXL/DDR memory without SW modification. In other words, there is no need to change the API or programming model. Also, SMDK provides transparent memory management by inheriting and extending Linux process/VMM design.

  • Technically, compatible path is not an API, but a way for application developer and service operator to use CXL memory without modifying their applications.

  • Compatible path provides heap segment tiering using standard heap allocation API and system call such as malloc(), realloc(), and mmap(), overriding that of libc. Especially, the path possesses intelligent allocation engine that enables DDR/CXL usecases of memory tiering such as memory priority, size, and bandwidth.

  • Compatible Path provides not only heap segment tiering, but also user process tiering and resource isolation.

image


Optimization Path

  • Optimization Path is literally the API to achieve higher level optimization by rewriting application.

  • Optimization API comprises the Allocation API for memory (de)allocation, Metadata API for acquiring online memory status, and PNM API for processing-near-memory.

image


Language Binding

SMDK allocator provides language binding interfaces for application use. In SMDK, the compatible and optimization path support a different language binding coverage.

  • Compatible Path - C, C++, Python, Java, Go

    • Python, Java, and Go frameworks design and implement internal memory management schemes based on primitive heap allocation methods such as malloc() and mmap(). This SMDK compatible library supports language binding for these languages.
  • Optimization Path - C, C++, Python

    • The SMDK optimization library offers proprietary APIs to implement an application with sophisticated CXL/DDR memory use.

The picture below depicts SMDK on language binding aspect.

image



Furthermore, a CLI tool is available as the unified interface to manage CXL device(s) on SMDK. Some commands directly communicate with a CXL device, while others interact with SMDK components to control memory tiering functionalities.

image




2.3 Intelligent Tiering Engine

Background : There are a lot of memory tiering scenarios that are applicable on a CXL capable system. e.g.) bandwidth/capacity aggregation, page migration, bandwidth/capacity isolation, memory ballooning, QoS throttling, etc.

While the needs and value of a scenarios is indeed a client dependent, recently CXL industry focuses on finding out a killer application to utilize additional CXL capacity and bandwidth among the tiering scenarios, because it is the most fundamental value of the CXL memory expansion.

The Intelligent Tiering Engine is geared to support a wide range of DDR/CXL memory tiering scenarios in a compatible and optimized way. It works across userland and kernelland depending on a tiering scenario.

As of v1.4 - Given priority, preference and OOM policy, the engine helps DDR/CXL capacity aggregation.

Since v1.5 - Given interleaving policy, the engine helps DDR/CXL bandwidth aggregation.

This engine dynamically tiers DDR/CXL memory in a software manner, reflecting the online status of a memory system. While system boot-up, it creates the characteristics map of online memory-nodes, which is updated when a memory-node is being on/offlined. Next, it keeps track of live memory utilization using a CPU HW aid. The ultimate purpose of the tiering engine is to fully utilize the benefit of CXL memory expansion easily.

image




2.4 Kernel


SMDK introduces 3 interfaces - System RAM, Swap, Cache - for application to make use of CXL memory by inheriting and expanding Linux VMM.

Memory Zone/Partition, Swap, and Cache paragraphs explain the 3 interfaces respectively.


Memory Zone

Background - Historically, Linux VMM has a hierarchy of logical memory views from node, zone, buddy, and finally page granularity.

image

Linux VMM has expanded the zone unit for better use of physical memory according to HW and SW needs of the time such as DMA/DMA32, MOVABLE, DEVICE, NORMAL and HIGHMEM zones.

Linux Zone Trigger Description Option
ZONE_NORMAL Initial design NORMAL addressable memory. mandatory
ZONE_DMA HW (I/O) Addressable memory for some peripheral devices supporting limited address resolution. configurable
ZONE_HIGHMEM HW (CPU) A memory area that is only addressable by the kernel through. Mapping portions into its own address space. This is for example used by i386 to allow the kernel to address the memory beyond 900MB. configurable
ZONE_DEVICE HW (I/O) Offers paging and mmap for device driver identified physical address ranges. (e.g., pmem, hmm, p2pdma) configurable
ZONE_MOVABLE SW Is similar to ZONE_NORMAL, except that it contains movable pages. Main usecases for ZONE_MOVABLE are to make memory offlining more likely to succeed. configurable

SMDK uses ZONE_MOVABLE to manage CXL memory because memory in ZONE_MOVABLE is easier to be offlined than other zones, which is appropriate to reflect the composability and pluggability characteristics of CXL memory devices.


Memory Partition

Since CXL v1.1 system, the industry deals with a single HDM range of CXL memory as a single logical memory-only node in the Linux system. Hence, "CXL Memory : Node = 1 : 1", which is "Single Node" status as shown in picture below.

However, we think this approach would have drawbacks in that service operators or high level application developers have to be aware of the existence of CXL memory and manually control them by themselves using 3rd party plugins like numactl and libnuma. With more CXL memories in a system, the more management efforts would be needed.

In addition, it is not compatible with the traditional way of the Linux that depicts an array of physical DRAM as one logical numa node.

From CXL v2.0 system, H/W interleaving method is supported, which allows multiple CXL devices to be interleaved with the help of system/BIOS and device.

SMDK provides not only this H/W method but also a more flexible S/W interleaving method. (c.f. H/W, S/W RAID)

We suggest an abstraction layer that hides and minimizes the efforts, grouping multiple CXL devices into a logical partition in a user configurable manner.

  • The node partition represents CXL memories on the system in a memory node grouped.

  • The noop partition represents a CXL memory on the system in a separate memory node.

  • The n-way partition represents the specified number of CXL memories on the system in a memory node grouped.

These partitions, except noop, support SW interleaving of the multiple CXL memories in the partition, as a result, it leads to aggregated bandwidth among the CXL device connected.

In addition, we assume CXL memory can be used not only as memory interface, but also device interface such as DAX. SMDK allows both ways.

image

The figure below further depicts the SMDK node partition with multiple CXL memories.

A single CXL memory is managed as a sub-zone possessing its own buddy list, while multiple CXL memories are grouped as so-called a super-zone. If a memory request comes from a thread context, the CXL super-zone assigns proper amount of pages in order out of the managed sub-zone array. Given context, the operation results in bandwidth aggregation effect out of the connected CXL memory array.

Without the super-zone design, memory requests from a process are not balanced, but concentrated on a single CXL device. This is because a zone's buddy list is composed in a sequential order of connected CXL devices.

image


Swap

Background - Swap has been typically used to resolve memory pressure condition in a system. When the PFRA(Page Frame Reclaiming Algorithm) works, victim pages are stored into the target swap device or file (swap-out), and then move target pages into VM space later (swap-in). Usually, a disk device is selected as the swap device due to its capable of data persistency, but a DRAM device is also used to store volatile swap data by (de)compressing pages being swapped. (e.g., zSwap) zSwap is designed to logically expand DRAM density with (de)compression technology using the host cpu cycle, and it has been widely used so far. One of the CXL philosophy is, however, obviously to adopt a larger amount of physical memory in a system. This implies the cpu cycle can be consumed more worthy, possibly no longer need to be consumed to handle compression of data being swapped.

Based on these architectural thoughts, SMDK provides CXL Swap interface allowing application software to use a CXL memory as a Swap device for volatile swap data.

image


Cache

Background - When memory pressure occurs in Linux, a victim page is selected by PFRA. If the victim is a clean file-backed page, it is removed from the VFS pagecache and then inserted into the free page list. Later, when the evicted page is referred, it causes disk I/O, which would lead to a decrease in performance on file IO workloads. CXL Cache is the 2nd-level page cache with pluggable and page-granularity attributes that stores clean file-backed pages. Upon CXL cache, a file-backed page is traversed in the following memory order - pagecache(near), cxlcache(far), disk(farthest). CXL Cache allows one cache pool per FS mount point, thus the pools are scalable and isolated each other. CXL Cache is an ephemeral cache since it only stores not a dirty page, but a clean page. Thus, a CXL memory device that is used for CXL cache is pluggable. To increase hit rate, CXL Cache provides inclusive policies that configure how to maintain the cached data.

image


Device Driver

The SMDK extends the CXL/DAX device driver to manage CXL metadata such as CXL node id and memory range according to the memory partition configuration and allows the DAX device driver to reflect this information when memory is on/offlined. SMDK device driver registers a notifier for memory node on/offline event to update CXL metadata based on the on/offline status of the CXL memory device.

Please note that the metadata is mutable, reflecting the grouping status of a CXL memory, configurable via SMDK plugins.

The SMDK exports a set of sysfs to report the static and dynamic information of CXL device(s). Specifically, /sys/kernel/cxl/devices/cxlX exports the address, size, and socket location of the CXL device X, while /sys/kernel/cxl/nodes/nodeY reports CXL device(s) in the memory-node Y at the time.

image

⚠️ **GitHub.com Fallback** ⚠️