Meeting 2021 07 08 - openpmix/openpmix GitHub Wiki

07/08/2021 PMIx call notes

Attendees

  • Ralph Castain (Nanook)
  • Thomas Naughton (ORNL)
  • Aurelien Bouteiller (UTK)
  • Michael Karo (Altair)
  • Samuel Gutierrez (LANL)
  • Austen Lauria (IBM)

Notes


This week, Ralph fixed two older issues in prrte:

  • spawn request 'could not be found' error

    • spawn request originally store it in hotel, when done we remove from hotel
    • avoid hangs should the spawn operation fail for some reason
    • there is 1 eviction time for all rooms, so you handle different per-request timeouts by checking the eviction to see if the timeout has been met, returning the request to the hotel if not done.
    • code was using the prte_hotel_checkin API to return the request - but this API puts the object into a new room, not the old one, and the request still had the old room number. Thus, subsequent retrieval attempt would generate the "not found" error
    • fix: add new API to put object back into the old room after eviction
  • Lost output from manystress test

    • Part 1: there was a race condition in fork/exec code that could lead to thread-lock. Refactored code to avoid this issue. - https://github.com/openpmix/prrte/pull/1017
    • Part 2: Error in code to retrieve the parent process ID
      • prte tracks who made spawn request to report back info, inherit parent job policies
      • the get_attribute function returns a malloc'd pointer to the process ID, while the code accessed it as a static memory location
      • occasionally would assign the last prior job that was spawned as the "parent" of the one being spawned - e.g.: if Job15 came first followed by Job16, then Job 16 would incorrectly be given a parent of Job15. If Job15 completed first, then PRRTE automatically would terminate all its child jobs - and Job16 would receive a "sigkill" termination. Depending on race condition, this would prevent Job16 from generating its expected output.
      • Fixed by properly treating the returned value as a malloc'd location
  • these are fixed and feel good that pmix and prrte are in good shape


Another problem with the debuggers filed.

  • Seems to be some clarification required in the tools standard.
  • When you start an intermediate launcher, how do you tell that launcher that you want the output of the job it launches forwarded to you?
  • It's a problem in the examples, Ralph is working to fix them and check that they are all correct.
  • Tools support is fine, examples just aren't quite right. Ralph will work on clarifying text for the tools chapter.
  • https://github.com/openpmix/prrte/issues/1019

Aurelien having issue on Cori with large single node launch

  • works at 32 cores, timeouts at 68 (1 node)
  • Bit of a head scratcher
  • Aurelian checking hwloc delays or possibly dns delays.
    • The mapping is correct, and binding looks right.
  • Maybe enable --prtemca state_base_verbose 5 to get timestamps on state changes
  • See also: contrib/states/ for some chopper scripts to help analyze the output

Recoverable jobs issue (#8925):

  • One problem was OMPI never registered a default error handler.
  • The lost connection event is not allowed to go to the default error handler, needs specific event registration. This approach is debatable.
  • Ralph removed fix in PMIx out where PMIx_Abort() called exit(), and put above changes into ompi.
  • Possible discussion: What should we do about default error handlers? PMIx generates the "lost connection" event - should that be allowed to go to default handlers (currently does not)? Ralph will go over events to identify those that are being restricted to event-specific handlers and send out a note.
  • https://github.com/open-mpi/ompi/issues/8925

Status on Issue#2210


General notes:

  • Some work on Python bindings, posting bugs/fixes**
    • Kudos to Matt Baker!
  • Ralph out next week
  • Note: PMIx Tools Working Group starting
  • Idea for pInspector tool
    • Gathering system info
    • Ralph going to start work on this
    • Interested parties welcome to collaborate!