01 22 2020 - openucx/ucx GitHub Wiki

Participants:

Akshay Venkatesh (NVidia)
Alex Margolin (Huawei)
Alina Sklarevich-Karni (Mellanox)
Brad Benton (AMD)
Devendar Bureddy (Mellanox)
Dmitry Gladkov (Mellanox)
Evgeny Leksikov (Mellanox)
Gil Bloch (Mellanox)
Jeffery Kuehn (LANL)
James Dinan (NVIDIA)
Josh Ladd (Mellanox)
Ken Raffenetti (ANL)
Manjunath Gorentla (Mellanox)
Matthew Baker (ORNL)
Mike Dubman (Mellanox)
Mikhail Brinskii (Mellanox)
Naveen Ravi (HPE/Cray)
Oscar Hernandez (ORNL)
Pavel Shamis (ARM)
Peter Entschev (NVidia)
Peter Rudenko (Mellanox)
Sergey Oblomov (Mellanox)
Sourav Chakraborty (AMD)
Valentin Petrov (Mellanox)
Yossi Itigin (Mellanox)

Discussion on the project structure (continued)

Most of the discussion was around what is already captured in the mailing list. There were few points and in-depth discussion on some case studies.

Email thread references: https://elist.ornl.gov/mailman/private/ucx-group/2020-January/001117.html

Integrated vs. Separate code

Perspective #1: UCS, UCM changes fast. In the integrated approach, it is easy to manage the changes and if external, it would be challenging given that external interfaces needs to be managed and supported. Exposing UCS, UCM should solve the problem of requiring the utilities by the collectives. However, given the history, can the UCX community define the explicit interfaces for collectives?

Perspective #2 HCOLL experience: Have been using UCS as a utility for some time. The pace at which UCS changes was not an issue and easy to track.

Others: How do we deal with the logistics of collective’s code requiring features from UCX? We need a mechanism to place a request to the UCX community. The programming models use UCX and define the usage and drive the features. We need to have a coordinating mechanism.

Code bloat

Disagreed on the definition of code bloat. Perspective #1 Code bloat is not a bloat when the feature is useful. Perspective #2 Code bloat is bloat in this case, particularly for use cases, that do not require collectives or vice versa.

Assumptions/API semantics: Either integrated or separate approach should use well-defined interfaces. Using internal headers should not be allowed.

Use of other p2p (like MPI) Perspective #1: Supporting MPI can be done in an integrated approach. Perspective #2: Cannot be done without breaking abstraction and it is not clean in the integrated approach.

Separate approach - Everyone agrees it can be done. Also, from the architecture perspective, there is no need to limit to UCX (though we prefer and encourage using UCX). Other projects are already doing it - OneCCL

Software call overheads: Not an issue from either perspective.

Resource duplication: Engineering: Many discussions from internal software and engineering perspective regarding resource duplication was discussed. In most cases, it was concluded that good design should address it. We can’t recall a case where a separate library is worse than the integrated approach or vice versa. If people remember a case study, we should capture it here.

Usage: The usage experience seems to (potentially) show that resource duplication was leading to a negative experience. Experience with using LCF systems with PAMI, there was skepticism that a User can determine that - there is a lot of complexity in collectives on these systems, there are multiple layers, multiple libraries supporting collectives (PAMI, SHARP, HCOLL). Further, users have access to only binaries. Without exact technical details and what resource duplication was leading to this negative behavior, this does not add to the discussion.

Cost of bad choice? No new point was made here.

Future support for new runtimes - No new point was made here.

Maintership, Versioning, Schedule - No new point was made here. But there were a lot of discussions around the points described in the email list.

For separate approach - separate releases are easier to handle