- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20220607
        Geoffrey Paulsen edited this page Jun 7, 2022 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Austen Lauria (IBM)
 - Brendan Cunningham (Cornelis Networks)
 - Brian Barrett (AWS)
 - Christoph Niethammer (HLRS)
 - Edgar Gabriel (UoH)
 - Geoffrey Paulsen (IBM)
 - Jeff Squyres (Cisco)
 - Joseph Schuchart
 - Josh Fisher (Cornelis Networks)
 - Josh Hursey (IBM)
 - Matthew Dosanjh (Sandia)
 - Thomas Naughton (ORNL)
 - Todd Kordenbrock (Sandia)
 - Tommy Janjusic (nVidia)
 - William Zhang (AWS)
 
- Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (nVidia)
 - Aurelien Bouteiller (UTK)
 - Brandon Yates (Intel)
 - Charles Shereda (LLNL)
 - David Bernhold (ORNL)
 - Erik Zeiske
 - Geoffroy Vallee (ARM)
 - George Bosilca (UTK)
 - Harumi Kuno (HPE)
 - Hessam Mirsadeghi (UCX/nVidia)
 - Howard Pritchard (LANL)
 - Joshua Ladd (nVidia)
 - Marisa Roman (Cornelius)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Michael Heinz (Cornelis Networks)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Raghu Raja (AWS)
 - Ralph Castain (Intel)
 - Sam Gutierrez (LLNL)
 - Scott Breyer (Sandia?)
 - Shintaro iwasaki
 - Xin Zhao (nVidia)
 
- v4.1.4 Released!
- A dozen bugfixes
 - UCC backported
 
 - v4.1.5
- Schedule: targeting ~6 mon (Nov 1)
 - No driver on schedule yet.
 
 
- 
A couple of critical new issues.
- Issue 10437 - blocker for even next RC.
 - Issue 10435 - a Regression from v4.1
 
 - 
Progress being made on missing Sessions symbols.
 - 
Looking for coll_han Tuning runs
- Joseph is planning to do runs, tho might not be next weeks
 - Tommy is working on
 - Also on Brendan and
 - Thomas Naughton also
 - 
mainandv5.0.xshould be the same, use either 
 - 
Call to Prte / PMIx
- Longest Pole in the tent right now.
 - If you want OMPI v5.0 released in near-ish future, please scare up some resources
 - Use PRRTE 
criticalandTarget v2.1labels for issues. 
 - 
Thomas Did testing on latest PRRTE (not submodule pointers)
- Ralph pulled in a larger PR that seemed to fix things.
 
 - 
Schedule:
- Blockers are still the same.
 - PRRTE blocker -
 - Right now looking like late summer (Us not having a PRRTE release for Packager to package)
- Call for help - If anyone has resources to help, we can move this release date much sooner.
 - Requires investment from us.
 
 - Blockers are listed Some are in the PRRTE project
 - Any Alternatives?
- The problem for Open MPI is not that PRRTE isn't ready to release. The parts we use, works great, but other parts still have issues (namely DVM)
 - Because we install PMIx and PRRTE as if they came from their own tarballs.
- This leaves Packagers no good way to distribute Open MPI.
 
 - How do we install PMIx and PRRTE in open-mpi/lib instead and get all of the 
rpathscorrect? - This might be the best bet (aside from fixing PRRTE ources of course)
 
 
 - 
Several Backported PRs
 - 
New issue opened on Performance when oversubscribed.
 - 
New issue topology issues when mapping by topology cache L3.
 
- Please HELP!
- Performance test default selection of Tuned vs HAN
 - Brian hasn't (and might not for a while) have time to send out instructions on how to test.
- Can anyone send out these instructions?
 
 - Call for folks to performance test at 16 nodes, and at whatever "makes sense" for them.
 
 - Accelerator stuff that William is working on, should be able to get out of draft.
- Edgar has been working on ROCME component of Framework
 - Post v5.0.0? Originally was shouldn't since release was close, but if it slips to end of summer, we'll see ...
 
 - Edgar finished ROCM component... appears to be working.
- William or Brian can comment on how close to merge to 
main. - William working on btl sm_cuda and rcache code. Could maybe merge at the end of this week.
 - Tommy, was going to get some nVidia people to review / test.
 - Discussion on 
btl sm_cuda- used to be a cloned copy ofsm, but it's the oldersmcomponent, notvaderwhich was renamed tosm.- Might be time to drop 
btl sm_cuda? - vader component does not have hooks to the new framework.
 - Uses where 
btl sm_cudamight get used today would be:- TCP path would use this for on-node
 - Node without UCX
 
 - even one-sided would not end up using 
btl sm_cuda. 
 - Might be time to drop 
 - v5.0.0 would be a good time to remove this.
- Based on old 
smis a big detractor. - Can we ALSO remove 
rcache? Unclear. 
 - Based on old 
 
 - William or Brian can comment on how close to merge to 
 - What's the status of accellerator branch on v5.0.x branch?
- PR is just to 
main. - We said we could do a backport, but that would be after it gets merged to 
main- If v5.0.0 is still a month out, is that enough time?
 - v5.0.0 is lurking closer.
 
 - This is a BIG chunk of code...
- But if v5.0.0 delays longer... this would be good to get in.
 
 - Answer is largely dependent on pmix and prte.
 - Also has implications on OMPI-next?
 
 - PR is just to 
 - Can anyone who understands packaging review: https://github.com/open-mpi/ompi/pull/10386 ?
 - Automate 3rd Party minimum version checks into a txt file that both
- configure and docs could read from a common file.
 - config.py runs at beginning of Sphynx and could read in files, etc.
 - Still iterating on.
 
 - 
https://github.com/open-mpi/ompi/pull/8941 -
- Like to get this in, or close it
 - Geoff will sent him an email to George to ask him to reiview.
 
 
- What are companies thinking about travel?
 - Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Should think about schedule, location, and topics.
 - Some new topics added this week. Please consider adding more topics.
 
 - MPI Forum was virtual
 - Next one Euro MPI will be hybrid.
- Plan to continue being hybrid with 1-2 meetings / year.