04 15 2020 - openucx/ucx GitHub Wiki

Participants:

  • Manjunath Gorentla (Mellanox)
  • Sergey Lebdev (Mellanox)
  • Valentin Petrov (Mellanox)
  • Alex Margolin (Huawei)

Telecon

  • Trail run with Microsoft teams for next week's call.

Resource abstraction discussion (continued)

Detailed discussion on the synchronization model for collectives

  • No synchronization on entry or exit

    • The buffer ownership transition between the user and the library is a local decision.
    • Does not wait for the completion of other processes/threads in the collective
    • The read/write window starts when a first process/thread enters the collective and continues until the last thread/process exits the collective
  • Synchronization on both entry and exit

    • The buffer ownership decision is coordinated with all other processes/threads participating in the collective
    • The read/write window starts when all processes/threads enter the collective and continue until when all threads are ready to exit the collective
  • NO_SYNC_ON_ENTRY

  • NO_SYNC_ON_EXIT

  • Preference was to start with #1 and #2 and add #3 and #4 later if we can discover more optimization opportunities

Should context creation be a local or a collective operation?

  • Context creation as a local operation
    • Performance advantages for systems/configurations that do not require collective operation for resource creation
  • Context creation as a collective operation
    • Useful for network and configurations, where resources are global (for example - resource is a combination of network and switch (group))
  • Design choices for supporting both local and collective context creation operation
    • Separate interfaces one for collective context creation and another for local context creation
    • A single interface with OOB collective operation as a parameter
      • At the meeting, the inclination was towards using a single interface
    • Flesh out semantics for collective context creation
    • When should a user decide to use a local operation over the collective operation? Can we capture as a recommendation for the user?

Next Meeting:

  • April 22nd
  • Potential agenda
    • Resource abstraction and affinity
    • Component architecture
    • Multifacet API