Apr212016 - openpmix/openpmix GitHub Wiki

PMIx Event Notification Meeting

Date

April 21, 2016

Attendees:

David, Josh H., and George.

Minutes:

Discussed SIAM PP16 meeting and PMIx and talks around error recovery.

George proposed that the 2 key features of any fault mitigation system are consensus (agreement) and reliable broadcast for event notification (known as revoke in ULFM for example). Although MPI-FT and the ULFM proposal for MPI already have these features built in, they are key features for any system that allows recovery from failures and therefore other languages and environments will likely be interested in these key features. Rather than re-inventing them for each system, it may be beneficial to have these features built in to PMIx.

Although ULFM can get by with a consensus that uses a simple AND or OR reduction on a bool, there are other systems that would expect to have other options available such as a consensus that more resembles an allgather.

Conclusions:

  • Will continue to discuss support for failure recovery in future meetings. Will be interested to get input from others.
⚠️ **GitHub.com Fallback** ⚠️