Yossi's notes - openucx/ucx GitHub Wiki

Design improvements:

  • add protection flags to mem_reg/mem_dereg, so we would be able to send from read-only memory
  • uct_completion_t should be returned from UCT and not passed to UCT, same way we are doing for UCP. Because the user would have to allocate it in advance anyway.
  • modify am callback signature to accept only the data, and obtain the descriptor by calling another function
  • refactor MM PD
  • remove sockaddr structs
  • RTE API - ucp_ep_create(worker, cb, arg) -> the callback will retrieve the address on-demand.
  • event-based API
  • Extract more IB common code
  • Move 'stats' library under 'tools'
  • Inconsistency with atomic/get bcopy API: in case the transport completes the operation immediately (e.g mmap), it should still call the callback. which means callbacks are called from communication functions, which means communication functions cannot be called from callbacks..
  • const correctness
  • Move 'perf' library under 'tools'

UCM:

  • Support mprotect()
  • Save the mmap-ed pointers in page table structure, rather than in a list

Problems:

  • Update CodeStyle with:
    • include file order
    • local variable names structure
    • number for each rule
    • add checkpatch.pl
    • space lines
  • Need protocol sync before destroying RC QP - solved by ignoring errors
  • Fail if there were warnings during test
  • Post receives only if there is active message handler (improve time with valgrind) - can't be done because we have control messages

Zero copy E2E:

  • memory hooks
  • page table
  • organize files in uct/base
  • rename pd to md
  • registration cache
  • expose memory registration performance
  • zero copy protocols
  • rndv protocols

UD:

  • Base AV
  • More efficient TX moderation
  • Common progress for verbs/accell
  • Scheduling
  • Reliability
  • Ring of control SKBs

WIP:

  • Tag matching API for UCP
  • Implement RMA on UCP
  • Create PD independently, use it to create iface (needed: uGNI PoC)
  • Performance tests for UCP
  • SIDR connection establishment
  • UCP bootstrap - use one transport to bootstrap others.
  • Add worker API
  • Implement UCT AM callback which holds reference to the message.
  • When cannot initiate the operation, UCT would return either NO_EP_RESOURCES or NO_IFACE_RESOURCES.
  • Add more allocators for TL buffers (huge pages, mmap, ...)
  • Rename uct_lkey_t to uct_mem_region_t

API features:

  • Flags for communication: solicited event, completion,...
  • Advertise required alignment for operations and best-performance alignment for operations
  • Make sure communication can be initiated from callbacks.
  • Pass configuration to UCP_CONTEXT
  • Add timers support for async API

IB features:

  • RoCE
  • RRoCE (GID index)
  • Path Query (RDMA CM / IB CM)
  • LMC
  • Non-default P_Key index
  • SL

Usability/debug improvements:

  • In debug mode - check that EP is connected before sending
  • Log by categories/objects
  • Support custom env prefix
  • Dump statistics to shared memory / unix socket.
  • All configuration variables should begin with UCX_
  • Check for constant_tsc bit, and take CPU frequency from sysfs instead of procinfo.
  • Add doxygen
  • In ucx_perftest, use PMI/librte instead of MPI

Performance improvements:

  • Separate rx/tx progress
  • likely/unlikely

Tests:

  • Bidirectional tests
  • Performance tests with multiple nodes (e.g pairs, all2all)
  • Performance test should take expected performance from resource capabilities.
  • Count warnings during gtest, and fail the test if they happen
  • Print warning from perftest if not running with optimal performance:
    • not in release mode
  • RTE support in gtest - maybe not needed; uGNI supports loopback.
  • AM message rate/bandwidth
  • Check capability flags in tests
  • in p2p_test, define sender_entity and receiver_entity

RC:

  • Don't use descriptor in atomic add - pass a global /dev/null buffer.
  • Use scatter-to-CQ for atomic/get replies
  • Handle SRQ watermark event.
  • Remove RC EP's from the hash table when they are removed (refcount)
  • Get rid of RC iface counters. instead have an array will all ep's which have pending sends. This should make flush operation faster.
  • Avoid RX descriptor allocation when calling AM callback and it returns UCS_OK
  • Update callbacks API
  • Statistics for RC.
  • Configure all RC QP parameters.
  • Parameter checks in debug mode.
  • Log data packets in RC.
  • Allocate and fill descriptor only after making sure there are send resources.
  • Update bw/latency for transports.
  • Construct WQE with SSE.
  • Performance tests for GET and Atomics.
  • Inline sends with >1 WQEBB.
  • Use NOP for flush
  • Handle async events in IB and print full information
  • Separate parameter for send CQ size.
  • Check for send CQ resources.
  • Normalize transport names
  • GET support
  • Scatter-to-CQ for 64 bytes
  • Atomic operations
  • Avoid queuing 2 callbacks for some flows (atomic add, am zcopy)

Autoconf:

  • HAVE_TL_xx in automake can be on, even if HAVE_IB is off
  • -libverbs is added to LIBS (global)
  • -libcm is added to LIBS