01 29 2020 - openucx/ucx GitHub Wiki

Participants:

  • Alex Margolin (Huawei)
  • Devendar Bureddy (Mellanox)
  • Dmitry Gladkov (Mellanox)
  • Evgeny Leksikov (Mellanox)
  • Gil Bloch (Mellanox)
  • James Dinan (NVIDIA)
  • Ken Raffenetti (ANL)
  • Manjunath Gorentla (Mellanox)
  • Pavel Shamis (ARM)
  • Sourav Chakraborty (AMD)
  • Sergey Lebdev (Mellanox)
  • Valentin Petrov (Mellanox)
  • John K (?)

Discussion on high-level features

Most of the discussion was around the content in slides

  • Supporting blocking operations (i.e., blocking)? Do we really need it?

    • It is desirable from the implementation perspective to have both blocking and non-blocking operations as it can provide low latency and can optimize resources. Some programming models do not require non-blocking operations.
    • From the usage perspective, it can be challenging if the library blocks.
  • Why do we need hardware collectives as a first-class citizen? How does it impact the API?

    • It helps with resource allocation, sharing, and utilization.
  • Symmetric memory API

    • It seems tangential to the collectives API
    • Resources in the context of the group make sense for collectives.
    • Example use case - sharing of memory between the teams. In hierarchical collectives, it can be shared between shared memory collective and p2p collectives
  • Reproducible reductions?

    • Are the requirements different than MPI? No, it is the same as MPI. In MPI, it is a recommendation, not a requirement. Typically, it required by the RFPs.
  • Other features such as support for cloud-based and VM-based collectives, support for heterogeneity, collective I/Os, and performance interfaces are good to discuss, but it is more far-out requirements.

  • Fault-tolerance

    • We need to address based on the needs of cloud apps/installation as the MPI community will not lead the way. The exact semantics and requirements are yet to be determined.

Discussion on abstractions

  • Most of the discussions were around the abstractions on the slide.

  • The missing abstractions will be discussed based on the needs of the high-level features.

  • What are the requirements from the perspective of GPU Collectives? Do we need to incorporate the notion of GPU stream/context?

  • NVIDIA to provide input

Concerns/Questions

  • When will we release/finalize the API proposal?
    • We should have an incremental approach.
    • We need to keep all features or requirements in mind while designing the API and ensure that we have room for extensions; learn lessons from UCX experience
    • The current approach we envision is defining the basic abstractions (as discussed in the slides) and many features will manifest as configurations to these basic abstractions. This also enables extensions for features that are not envisioned or urgent. This approach is incremental.
    • Once we define the basic abstractions and convinced that we have mechanisms for incrementally adding the features, we should have a good releasing point.

Next meeting: Feb 12th

  • Potential Agenda
    • Go over Alex’s slides — Hierarchical vs Reactive approach
    • Go over some of the details of abstractions