MXL Inter host memory sharing ‐ Draft - dmf-mxl/mxl GitHub Wiki
Disclaimer
The following text refers to a specific vendor; however, this reference is made purely for informational clarity. It should not be interpreted as an endorsement or indication of bias.
Inter-node memory sharing
For inter-node memory sharing, we intend to use libfabric which supports host to host transfers very well. Although not in technical scope of this document, device to host and device to device memory sharing has support in the Libfabric framework.
It also seems to support CUDA quite well through the HMEM memory registration. It also seem to provide support for GPU Direct-RDMA by using dma-buf without having to rely on Nvidia OFED driver.
Intel GPU RDMA like functionality seem available (need to confirm with Intel's MXL member)
Memory Models
Publish/Subscribe
The current implementation of the MXL flow uses a publish/subscribe scheme. This is possible, because a single buffer is shared between a publisher and 1 or many subscribers. One important fact to notice is that none of the publisher (FlowWriter) or subscribers (FlowReader) owns the buffer. Also, the intent was not trying to move a buffer from one memory region to another. The workflow in this scheme is :
- A flow is created (allocating memory for the buffer).
- The FlowWriter opens a grain with the OpenGrain method
- The FlowWriter fills the grain
- The FlowWriter commits the grain with the Commit method
Meanwhile the FlowWriter does it job, the FlowReaders will query grains with the GetGrain method. See schematize representation below.
This memory model is possible in the case of intra-node memory sharing or intra-device memory sharing.
Point-to-Point push (proposition)
When we need to move a buffer from one memory region to another, we can’t rely on the same publish/subscribe scheme. This is because there is no central buffer that can be shared between two processes. An additional requirement is that we should be to move a buffer asynchronously. More specifically, we should be able to move grains as soon as they are available.
To push a buffer from one peer to another one, we have two choices
- Send/Receive Transfers: This method requires the receiver to be aware of the specific grain index that the sender intends to transmit.
- Remote Memory Writes: This approach eliminates the need for ongoing synchronization but requires an initial setup phase, including memory registration and the exchange of access credentials.
The workflow for the remote memory write scheme would be as follows:
- Targets are created
- The targets register the receiving buffer memory region
- The targets generates the targetInfo object which should contain the setup URL(endpoint address: “{provider}://{node}:{service}”), the rkey (for remote memory access), the buffer address and the number of grains in the buffer. That is done using mxlFabricsTargetSetup()
- The targetInfo’s are then sent to the initiator by some out-of-band method.
- The initator register the targets using the targetInfo objects.
- The initiator can then transfer grains by using TransferGrain to transfer to all targets or TransferGrainToTarget to transfer only to a specific target. Under the hood the library will use the libfabric write method, and the address will be calculated using:
- offset = grainIndex % numberOfGrains
- addr = baseAddr + offset * grainSize
- On the target side, the target can inspect completion of a specific grainIndex with TargetGetGrain (could blocking or non-blocking) or it can register a callback function which will be called when a new grain is available.
Our proposal introduces a new interface, referred to as the "fabrics" interface, which exposes both an initiator and a target object, along with the full set of semantics required to operate them.
Link to the fabrics API draft (https://git.ebu.io/dmf/mxl/-/blob/feature/fabrics-api/lib/include/mxl/fabrics.h?ref_type=heads)
Libfabric API
One possible implementation of that point to point memory sharing interface is to use libfabric. The intent is to only use a subset of the libfabric API.
Provider Selection
MXL should provide a simple call where the user will provide a subset of parameters that allows to query providers and select the first provider that matches.
Parameters :
- bind address / service (port)
- Provider type
- Transmit completion queue size
Address Vector
Address vector is a record of destination endpoints we can operate on.
- Add/remove libfabric addresses in the address vector. We need to book-keep addresses we add/remove.
- Provide a function to query the libfabric address.
Operations
- MXL should provide an API to allow a user to do remote WRITE operations on a endpoint. We should probably only expose scatter-gather version (quite useful for audio).
- Provide functions to register/de-register buffers, user should be able to register different types of buffers (host, cuda, etc..)
- For completion, MXL should provide a blocking completion function (ideally with a possibility to configure a poll time and a wait delay before polling). MXL should also provide a non-blocking version (which is simply a delegate)
Demo/Reference application
To show case a usage example of that new API, we should provide a proxy that allows going to/from MXL flow domain to/from MXL fabrics domain.
To showcase a usage example of the new API, we should provide a proxy that enables communication between the MXL flow domain and the MXL fabrics domain, in both directions.
Specifically, the demo application will allow a user to expose an MXL flow to a libfabric endpoint. Within the application, the user will be able to invoke two components: a libfabric sender and a libfabric receiver. Each sender and receiver will be capable of handling a single MXL flow.
The sender consits of an MXL Flow Reader and a Libfabric Sender endpoint, while the receiver is composed of a Libfabric Receiver endpoint and a MXL Flow Writer. The sender is responsible for transmitting an MXL flow over the libfabric link. The sender operates asynchronously, attempting to write grains through libfabric as soon as they are published.
To implement this, we have two choices of operations:
1. Tag Matching Send/Recv Operations
In this approach, the sender and receiver use libfabric’s tag-matching capabilities. Each grain is identified by a unique tag.
- Setup:
- The receiver needs to synchronize with the sender. The sender will send to the receiver the last grain index that was publish to its flow.
- Operation:
- The receiver can post a receive with the tag currentGrainIndex=lastGrainIndex + 1. Upon completion, the receiver can post a receive with currentGrainIndex+1.
- The sender constantly checks if a new grain is available to be sent.
- Advantage:
- It plugs directly into the current MXL flow API
- Challenges:
- Risk of the sender and receiver falling out of sync.
- How does the sender sends its current grain index to the receiver: gRPC? simple tcp with a custom protocol? RTCP?
2. Remote Memory Write Operations
In this method, the sender writes directly into the receiver's buffers using remote memory access.
- Setup: The receiver registers every buffers and shares the rkey's and addresses to the sender.
- Operation:
- The sender calculates the buffer offset using the grain index and address handles.
- The sender appends the grain index in the user data (or immediate data) so the receiver can identify and commit the written grain.
- The sender send a write request as soon as a new grain is available.
- The receiver is not involved, but gets a notification when a new grain is available in its buffer.
- Advantage:
- The receiver is not involved in the transfer and does not requires constant synchronization with the sender.
- Challenges:
- We need to extend the flow API to be able to query all the grains addresses.
For the demo app, the 2nd method was prefered for simplicity. Below is a diagram of the demo app.
Initial API Draft
Draft of API (https://github.com/dmf-mxl/mxl/blob/feature/fabrics-api/lib/include/mxl/fabrics.h)
Demo application (https://github.com/dmf-mxl/mxl/blob/feature/fabrics-api/tools/mxl-fabrics-sample/demo.cpp)
References
- hmem_cuda (libfabric/src/hmem_cuda.c at 66fd5a38e35b5f0cc0f044e9e99f57e1531c49e1 · ofiwg/libfabric)
- hmem (libfabric/src/hmem.c at 66fd5a38e35b5f0cc0f044e9e99f57e1531c49e1 · ofiwg/libfabric)
- Buffer Sharing and Synchronization (dma-buf) — The Linux Kernel documentation(https://docs.kernel.org/driver-api/dma-buf.html)
- GPU Direct (https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf)
- cuda (https://docs.nvidia.com/cuda/cuda-c-programming-guide/#interprocess-communication)