Meeting 2021 07 08 - openpmix/openpmix GitHub Wiki
07/08/2021 PMIx call notes
Attendees
- Ralph Castain (Nanook)
- Thomas Naughton (ORNL)
- Aurelien Bouteiller (UTK)
- Michael Karo (Altair)
- Samuel Gutierrez (LANL)
- Austen Lauria (IBM)
Notes
This week, Ralph fixed two older issues in prrte:
-
spawn request 'could not be found' error
- spawn request originally store it in hotel, when done we remove from hotel
- avoid hangs should the spawn operation fail for some reason
- there is 1 eviction time for all rooms, so you handle different per-request timeouts by checking the eviction to see if the timeout has been met, returning the request to the hotel if not done.
- code was using the
prte_hotel_checkin
API to return the request - but this API puts the object into a new room, not the old one, and the request still had the old room number. Thus, subsequent retrieval attempt would generate the "not found" error - fix: add new API to put object back into the old room after eviction
-
Lost output from manystress test
- Part 1: there was a race condition in fork/exec code that could lead to thread-lock. Refactored code to avoid this issue. - https://github.com/openpmix/prrte/pull/1017
- Part 2: Error in code to retrieve the parent process ID
- prte tracks who made spawn request to report back info, inherit parent job policies
- the get_attribute function returns a malloc'd pointer to the process ID, while the code accessed it as a static memory location
- occasionally would assign the last prior job that was spawned as the "parent" of the one being spawned - e.g.: if Job15 came first followed by Job16, then Job 16 would incorrectly be given a parent of Job15. If Job15 completed first, then PRRTE automatically would terminate all its child jobs - and Job16 would receive a "sigkill" termination. Depending on race condition, this would prevent Job16 from generating its expected output.
- Fixed by properly treating the returned value as a malloc'd location
-
these are fixed and feel good that pmix and prrte are in good shape
Another problem with the debuggers filed.
- Seems to be some clarification required in the tools standard.
- When you start an intermediate launcher, how do you tell that launcher that you want the output of the job it launches forwarded to you?
- It's a problem in the examples, Ralph is working to fix them and check that they are all correct.
- Tools support is fine, examples just aren't quite right. Ralph will work on clarifying text for the tools chapter.
- https://github.com/openpmix/prrte/issues/1019
Aurelien having issue on Cori with large single node launch
- works at 32 cores, timeouts at 68 (1 node)
- Bit of a head scratcher
- Aurelian checking hwloc delays or possibly dns delays.
- The mapping is correct, and binding looks right.
- Maybe enable
--prtemca state_base_verbose 5
to get timestamps on state changes - See also: contrib/states/ for some chopper scripts to help analyze the output
Recoverable jobs issue (#8925):
- One problem was OMPI never registered a default error handler.
- The lost connection event is not allowed to go to the default error handler, needs specific event registration. This approach is debatable.
- Ralph removed fix in PMIx out where PMIx_Abort() called exit(), and put above changes into ompi.
- Possible discussion: What should we do about default error handlers? PMIx generates the "lost connection" event - should that be allowed to go to default handlers (currently does not)? Ralph will go over events to identify those that are being restricted to event-specific handlers and send out a note.
- https://github.com/open-mpi/ompi/issues/8925
Status on Issue#2210
- https://github.com/openpmix/openpmix/issues/2210
- Austen to follow-up further
General notes:
- Some work on Python bindings, posting bugs/fixes**
- Kudos to Matt Baker!
- Ralph out next week
- Note: PMIx Tools Working Group starting
- Idea for pInspector tool
- Gathering system info
- Ralph going to start work on this
- Interested parties welcome to collaborate!