UCC Virtual F2F Meeting Information - openucx/ucx GitHub Wiki

UCC Virtual F2F Meeting (May 11-13th and May 18-19th)

Artifacts

Registration

Please fill in the form here

Agenda

Day1

Meeting Notes

Monday, May 11th, 2020

Time Topic Telecon
7:00 am - 7:30 PT Kickoff and Opening Remarks (Gilad Shainer)
7:30 - 8:15 PT Highlights of UCC API (Review) (Manju)
8:15 - 8:30 AM PT Break
8:30 - 9:30 AM PT Teams API (Manju; All/Discussion)
9:30 - 9:45 AM PT Break
9:45 - 11:00 AM PT Endpoints / Collective Operations (Manju; All/Discussion)

Day_1_Notes

Participants

  • Manjunath Gorentla Venkata
  • Alex Margolin
  • Sergey Lebedev
  • Valentin Petrov
  • Rami Nudelman
  • Baker, Matthew
  • Tony
  • Gilad Shainer
  • James S Dinan .
  • Chambreau, Chris
  • Gil Bloch
  • Dmitry Gladkov
  • Arturo
  • Pavel Shamis
  • Ravi, Naveen
  • Raffenetti, Kenneth J.
  • Akshay Venkatesh

Discussion

  • Initialization

    • Have a flexible infrastructure for initialization and selection of library functionality
    • Discuss final options during component arch discussion
    • UCC config interface to follow UCS config. 
    • Rename ucc_config to ucc_params to reflect UCX style  
  • Context

    • Do we need sync model config on the context create ?
      • Yes for enabling RDMA based implementations
      • The drawback - might have to create more contexts (sync and non-sync)
        • Yes, might require multiple objects but not necessarily multiple resources
        • Explore explicit device abstraction and ability to express affinity and propose to the WG group
  • Team Creation

    • Need to revisit endpoints (as this seems to be implementation specific) after presentation from Alex
    • Can we hide endpoint from interface and enable agnostic way of creating teams
  • Collective Operations

    • Need to define the mapping of programming model (src, dst) to UCC (src, dst) for cases like MPI broadcast, which has only set of buffers.
    • Is there a need for multiple outstanding persistent collective operations of same type ? No use case yet.

Day2

Time Topic Telecon
7:00 am - 7:45 PT Topology Aware Collectives (Sameh)
7:45 - 8:00 AM PT Break
8:00 am - 8:45 PT Collectives API - the Reactive alternative (Alex)
8:45 - 9:00 AM PT Break
9:00 - 11:00 PT Task and Plan API Discussion

Day_2_Notes

  • Manjunath Gorentla Venkata
  • Richard Graham
  • Sameh
  • Gil Bloch
  • Ravi, Naveen
  • Alex Margolin
  • Tony
  • Raffenetti, Kenneth J.
  • Sergey Lebedev
  • Rami Nudelman
  • Arturo
  • James Dinan
  • Pavel Shamis
  • Geoffroy
  • Valentine Petrov

Topology aware collectives

WG to sync with Sameh (IBM) about topology definition as we abstract topology, device, and affinity

Multiple-level API ?

Option 1: Standardize ucc and ucc_mpi interfaces Option 2: Standardize only ucc interfaces   Discussion on UCC base, UCC MPI

  • For now focus on UCC base and continue the discussion on UCC MPI in the working group 
  • Option for UCC MPI (driver) - provide as a part of UCC project (example contrib directory) 
  • (Alex correct this if needed)

Task API

Task API is use-full (feedback from the WG)

  • To be considered for a later version of API (not the first version)
  • It is useful to address the use-cases that include 
    • computation + communication
    • Pipelined protocols
    • provide a use case for bundled collectives
    • Propose Task API to the working group  

Topology Information

What topology information to abstract and what to pass? 

  • Capture distance between various processes/threads that forms the team/groups
  • Capture distance between context (resource) and devices (GPU/CPU)
  • Where to pass this information team creation or init?
  • AI for the working group: Propose an API that covers the above requirements

Endpoints

  • Endpoint in UCC is member_index in UCG
  • Move the endpoint to the team_config structure
  • Make endpoint an input 
  • If no input is provided the library will create the endpoints and it will be available via get_attrib interface

Day3

Wednesday, May 13th, 2020

Time Topic Telecon
7:00 am - 8:00 PT GPUs/DL (NVIDIA/IBM/All)
8:00 - 8:45 PT Multirail Discussion (Sergey;All)
8:45 - 9:00 PT Break
9:00 - 9:30 PT Algorithm Selection Models (All)
9:30 - 10:00 PT Memory registration and Global Symmetric Memory (All)
10:00 - 11:00 PT Document on differences and plan to converge

Day_3_Notes

  • Manjunath Gorentla Venkata
  • Sameh
  • Arturo
  • Valentin Petrov
  • Devendar Bureddy
  • Sergey Lebedev
  • Rami Nudelman
  • Alex Margolin
  • James Dinan
  • Sreeram Potluri
  • Pavel Shamis
  • Raffenetti, Kenneth
  • Geoffroy Vallee
  • Gil Bloch

UCC and GPUs / DL/AI(NVIDIA/IBM/All)

  • Goals

    • UCC should support GPU-aware MPI collectives
    • UCC should be cognizant of DL/AI requirements and should design interfaces for it
      • (participants were in consensus)
  • Relevant use cases/interfaces besides MPI and OpenSHMEM

    • Single process/thread utilizing multiple GPUs
    • Aggregate or bundled collectives - the motivation is to reduce the launch overhead.
      • A series of collectives launched
      • NCCL addresses this with ncclGroupStart/End interfaces
  • Missing abstractions from the UCC interface proposals

    • Memory type: The library should know the memory passed to the collective operation.
      • Host memory, device memory
      • Where to abstract this information?
        • Passing this information to the team creation operation should be enough. The user might have to create a team that is specific to memory type.
      • Passing this information to each invocation is useful, but there is no use case yet.
      • The abstraction should support other accelerators and memory types (CUDA, ROCM, Smart NIC, DRAM, HBM
    • Device abstraction and affinity
    • How do you handle the GPU device context?
      • Can this be abstracted onto the UCC context?
    • How do you handle CUDA streams?
  • Next steps / Questions

    • Design for missing abstractions
    • Ping AMD and IBM
    • Error handling / Managing asynchronous errors
      • More details required

Multirail support

  • Goal

    • UCC should support multirail collectives (participants were in consensus)
  • Lessons from Sergey’s implementation

  • Multirail support can be implemented “easily” if we have basic collectives expressed as components and these components can be composed to implement the UCC API.
  • Hierarchical collectives are implemented like this in XCCL
  • Missing abstractions
    • The team create operation should pass in multiple UCC contexts (resources) to the team create operation
    • The information about the distance between the contexts (assuming contexts are mapped as one context per HCA)

Topology Infrastructure

  • The topology information is needed for multirail, UCG’s group create operation, and GPU-aware collectives

  • What topology information is needed?

    • Distance between the participants of the team in the team create operation
    • Distance between the network resources (HCA’s) and thread invoking the team create operation
    • Distance between the GPUs and thread invoking the team create operation
  • Who should implement it? UCC or an external library?

    • Can we pass this information from the external libraries (hwloc, ompi)? If so, how to abstract it?
    • Can the library implement it?
      • This is an expensive operation and a huge undertaking.
  • Next steps:

    • Prototype interfaces and work with IBM to understand the pitfalls.

Algorithm Selection

  1. HCOLL model
  2. libcoll/Intel model
  3. User query model
  4. Adaptive model

A common thread for all the models is the selection attributes. The selection attributes can include algorithm type, message range, collective implementation type (XCCL, XUCG, hardware), and more.

  • Next steps:
    • Define the selection attributes.
    • In version 1.0, design the interfaces that are not external but internal. Gather experience and then make it public.

Day4

Monday, May 18th, 2020

Time Topic Telecon
7:00 am - 7:45 PT OMPI-X / ADAPT (George Bosilca/Talk)
7:45 am - 8:00 PT Break
8:00 am - 9:00 PT Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
9:00 am - 9:30 PT Memory registration and symmetric memory API (Manju; All; Discussion)
9:30 am - 9:45 PT Break
9:45 am - 10:30 PT Library initialization parameters
10:30 am - 11:00 Documentation / Code Structure

Day_4_Notes

  • Manjunath Gorentla Venkata
  • George
  • Arturo
  • Valentin Petrov
  • Sergey Lebedev
  • Rami Nudelman
  • Alex Margolin
  • Pavel Shamis
  • Raffenetti, Kenneth
  • Geoffroy Vallee
  • Tony

Version 1.0 of the component architecture (from Val’s presentation)

  • Component Architecture Overview

    • Abstractions

      • Collective layer with multiple collective implementations (XCCL, XUCG, Hardware)
      • Basic collective layer with primitive collectives (p2p_collectives, SHARP)
      • P2P layer
      • Services layer
      • Resolves:

        • It addresses a majority of the requirements for component architecture that was identified by the previous iteration of component architecture such as
        • Avoiding circular dependencies
        • Ability to provide a thin layer over hardware collectives
      • To address

        • Ability to share resources between multiple implementations. For example, sharing p2p (or SHARP) resources between XCCL and XUCG
        • Ability to choose multiple collective components (.i.e. say all reduce from XCCL, and a2a from XUCG). Add a selection component that encompasses multiple collective implementations.
        • Ability to share and reuse code at the fine-grained level.
  • Next Steps

    • Develop fine-grained component architecture for XCCL and XUCG
    • Identify the components that can be shared
    • Identify a way to share resources between different implementations

Day5

Tuesday, May 19th, 2020 (Hackathon Mode)

Time Topic Telecon
7:00 am - 8:30 PT Flesh out the component architecture
8:30 am - 8:45 PT
8:45 am - 10:30 PT Review and flesh out the spec document
10:30 am - 11:00 PT Next Steps

Topics

(Laundry List)

  • Kickoff (Gilad)
  • Highlights of UCC API (Review for non-WG participants) (Manju)
  • OMPI-X / ADAPT (George Bosilca/Talk)
  • Requirements from the AI Users/Deep Learning/GPUs (NVIDIA; All)
  • API Discussion (Incase not completed in WG)
    • Library Initialization
    • Resource Abstraction (Contexts)
    • Teams API (Manju; All/Discussion)
    • Endpoints (Manju; All/Discussion)
    • Collective Operations (Manju; All/Discussion)
    • Task API (Manju; All/Discussion)
    • Alternative Control-path API (Initialization and communicator creation) (Alex; All/Discussion)
    • Alternative Data-path API (Starting and progressing collectives) (Alex; All/Discussion)
  • Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
  • Flesh out UCC.H Header (All)
  • Unit tests and CI infrastructure (?)
  • Documentation (doxygen ?)(?)
  • Multirail Support (Sergey)
  • Topology-aware collectives (Sameh/Talk)
  • Memory registration (Discussion)
  • Algorithm selection (Discussion)