Meeting 2019 11 14 - openpmix/openpmix GitHub Wiki

Assorted Meeting Notes:

  • Next meeting Dec. 5
  • SC'19 BoF
  • GitHub Actions - Mellanox CI
    • Some new PRs to make sure it triggers on PRs and final commits
    • "CI / mlnx (pull_request)" - see the "Checks" tab for details
    • When it runs on 'master' it would be nice to have a "badge" we can put on the front page
    • Mellanox working on refining the tests a bit more
  • Webhooks
    • Travis and Signed off checks
  • v3.2 rebranch
    • No update yet. Will try to get to it over the next week or so.
    • v3.1 is up-to-date
    • Target a v3.2.0 release in early/mid-Dec
  • OpenPMIx new issues:
    • https://github.com/openpmix/openpmix/issues/1544
      • Trying to do a PMIx_Get(PMIX_RANK_UNDEF) instead of using rank
      • Likely triggering a dmodex request into JSM that JSM may not be handling.
      • It's a little unclear why this is making an upcall in JSM if data is being fully collected.
  • Memory leak in dstore (tested against v3.1.4)
    • Reported 8 MB leak with every job launch (via JSM job steps in same allocation)
      • Happens when chaining the job steps (one starts just before last completes)
      • dstore will keep the prior job's data (just in case it's needed), but it is never actually cleaned up. So the chaining leads to a growing memory leak.
      • How should we address this?
    • deregister_nspace is being called, but it notices the other namespaces exist in the same 'session' so it won't delete them.
      • Assumption that anything else running is related in the session.
      • Actually they are different jobs.
      • Intent was to not cleanup the namespace until the last local process in the namespace is complete. Allows other remote processes to still access data on this node (since there is one last local process).
    • It is likely that it can be reproduced under PRRTE
      • Start job A ; sleep 10 ; start job B ; job A finishes ; sleep 10 ; ...
    • Dave & Ralph to work on a fix
      • Target a bug fix in the v3.1.5 release