WeeklyTelcon_20160517 - ICLDisco/ompi GitHub Wiki
Open MPI Weekly Telcon
Dialup Info: (Do not post to public mailing list or public wiki)
Attendees
Geoff Paulsen
Jeff Squyres
Brad Benton
Howard
Josh Hursey
Joshua Ladd
Nathan Hjelm
Ralph
Sylvain Jeaugey
Todd Kordenbrock
Agenda
Review 1.10
Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
1161 - Open IB Error Path - Giles asked Mike to review, in 2nd iteration.
Joshua Ladd tagged on 2.x version.
1150 - 2 places in Init and 1 in Finalize where we do RTE Barrier.
If launched with mpirun, it works just fine.
But direct launch will hang in cray or slurm PMIx because those have Blocking RTE barriers, and those DONT progress.
Patched it in master with MPI Barrier to make other things progress.
Will need to block 2.0.x for this fix also. Ralph will create PR.
Once these get in, Do another RC and move this out.
Review 2.0.x
Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
PMIx barrier
Nathan will review 1164.
PR 1673 Multi-threaded issues that George ran into is a doozie.
Free path in C++. In one thread in dereg hooks in Delete.
Another thread was try to allocate space, and trigerring internal garbage collection.
Classic deadlock.
Nathan reworked the rcache / mpool code to not hold lock while doing deletes.
All locks are always on in RDMA because no way around it.
Last rcache bug was if you had > 100 registrations associated with memory registration being munmapped, ran into infinite loop.
Nathan and George testing.
IBM will do some multi-threaded testing as well.
PowerPC issues as well. Nathan had to revise table a bit.
ppc64le, if you do a dlsym, pointer is into table of contents: 1 is real address.
problem is TOC is getting patched.
when patching, need to patch the real function, not the other.
ppc64BE - may still
1162 - multiple threads make same endpoint simultaneously.
Nathan thought he handled that case.
one thing we forgot to do for 2.0.0rc2, we forgot to send to users-alias. Will do for rc3.
Put announcement about Migration guide into Announcement list.
IBM trying to ramp up MTT testing. Hopefully will have Power8 XL compiler testing soon.
Some issues passing certain flags to XL compilers. Josh Hersey is working on.
Cisco / Intercomm create failures.
Getbyte offset test requires v2.0.0 or greater and spins until timeout on 1.10.
2nd month of RED. Can't seem to break out of it.
IBM wants to get Jenkins on Power8LE enabled this week. Looks like got correct permission, using the polling method.
If people pushes quickly, if multiple pushes between polling interval, it'll just pickup the last.
Jenkins servers have been hanging / restarting lately.
Howard saw that there was a cron job doing auto-mated updates of jenkins. Last wednesday jenkins was updated with security fix, but that broken a lot of github integration.
Pull Request 1650 still causing red X on Mellanox Jenkins.
Red X on master, because issue that hasn't been resolved.
Need nathan or josh hursey or someone to follow. Who knows AMC code the best? We could move AMCA out
MCA variable system
envlist being available in an aggregate.
MTT Dev status:
Jenkins is still the best of the worst for running in non-cloud
Hudson is enterprise pay-for solution, but we want free
josh posted documentation on wiki, but not the scripts yet.
MTT some new development to clean out MTT github to MTT devel list.
Clear out some issues and set a new milestone, etc.
There is an alternative for Travis, but that hasn't been an issue.
What is combinatorial Executor for MTT?
Ralph explains: if you have two different ompi builds (different configure lines)
Big list of tests.
Existing sequential executor would sequentially build both.
but When building tests, it wouldn't automatically build for both, you have to tell it.
The Combinatorial executor would do that. Build list of tests for EACH configured OMPI build.
Chelsio getting some resources to possibly do MTT nightly testing.
Status Updates:
Status Update Rotation
Mellanox, Sandia, Intel
LANL, Houston, IBM
Cisco, ORNL, UTK, NVIDIA
🗂️ Page Index for this GitHub Wiki