Meeting 2021 09 30 - openpmix/openpmix GitHub Wiki

09/30/2021 PMIx call notes

Attendees

  • Michael Karo (Altair)
  • Ralph Castain (Nanook)
  • Howard Pritchard (LANL)
  • Matt Baker (ORNL)
  • Brian Barrett (Amazon)
  • Thomas Naughton (ORNL)

Notes

  • PMIx/PRRTE rc1 release candidates

    • Some fixes being picked up, e.g., ancient hwloc tripped things up (fixed)
  • Review recent changes (drop pandoc, return "partial success" on collectives)

    • Moved man page source to external repo and just keep generated nroff files in the repo
    • Removes the problem with adding pandoc for building tarballs, etc.
    • Also, possibly have shift from pandoc to sphinx in future, so likely to avoid having that problem on others in future
    • When shift to sphinx (ReadTheDocs), may want to follow approach used with Libfabric
  • PMIx updates have included few fixes, e.g., fix resource leaks, etc.

  • Previous question about incomplete fence

    • Local pmix server aggregates local participants up to main on host

    • This was in part by RMgrs, to avoid excessive pings from lots of local processes.

    • however, raises question on error path if one of them fails (what do you do/wait forever). generally in practice the RMgr will detect and deal by killing the job. However, in ULFM or other FT related scenarios, you do not have that luxury and need to allow collective to complete and the notify host that all that could have completed.

    • Worked in past, but was broken.

    • But when return, it returned a SUCCESS when in fact it was actually a bit of a misnomer b/c some may have actually failed but a subset completed.

    • So created a "partially completed" return code to indicate this error code scenario. (PMIX_ERR_PARTIAL_SUCCESS)

    • This change was made in Fence but not in the group collectives. Not sure if this is the correct for group collectives.

    • Current change made to to return PMIX_ERR_PARTIAL_SUCCESS:

      • Fence
      • Connect
      • Disconnect
    • Changes included in rc2

    • Updated OMPI master and v5.0.x

    • Looks like may be an issue from Aurelien about not returning a slot on failure (i.e., resource leak). Possibly an issue with current partially complete item

  • Issue with skew on few envvars, need to sync w/ OMPI related side

    • XXX: I MISSED GENERAL DESCRIPTION FOR ISSUE
    • TODO: Michael to add OMPI issue
    • TODO: Add Issue# here
  • Testing of OMPI/PRRTE side looking good

  • Appear to not be getting coverity output again (from prte/pmix)

    • In future, would be nice to have something like clang-static analyzer could a useful way of getting these things at dist time
    • xlc warnings will be big item
  • PMIx Python deadlock / GIL management

    • Issue#2306 https://github.com/openpmix/openpmix/issues/2306
    • Add attribute on the cython generated code to avoid holding python global interpreter lock (GIL), which fixes the problem with the callbacks hanging due to lock interleaving.
    • Would be good to have this in upstream/release to get more testing
    • Plan to have PR ready soon (est. this week-ish)