Sync meeting 2024 08 13 - multixscale/meetings GitHub Wiki
MultiXscale WP1+WP5 sync meetings
- Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T
- Notes of previous meetings at https://github.com/multixscale/meetings/wiki
Next meetings
-
Tue 13 Aug 2024 10:00 CEST
- planning to attend: Kenneth, Lara, Caspar, Thomas
- on summer break: Bob
-
Tue 10 Sep 2024 10:00 CEST
- planning to attend:
- not planning to attend
Agenda/notes 2024-08-13
attending: Neja, Nadia, Lara, Casper, Caspar, Kenneth, Thomas, Pedro, Satish
General updates:
-
Milestone 3 (M18 - June 2024, lead: UStuttgart) was met, right on time (thanks to everyone who worked hard on this!)
- Ran the
EESSI_ESPRESSO_LJ
test from the EESSI test suite on Vega (AMD Zen2, 7H12), Karolina (AMD Zen2, 7H12) and Deucalion (ARM A64FX) - Biggest challenge was running on Deucalion, without native EESSI
- Was achieved using
cvmfsexec
script, and a manually created wrapper (cvmfsexec_eessi.sh
) that allowed us to execute commands in a subshell started bycvmfsexec
. - Another wrapper for
orted
was needed, so thatorted
was also run through thecvmfsexec_eessi.sh
wrapper
- Was achieved using
- Resulted in this blog
- With the wrappers from the blog, we should be able to run EESSI 'natively' (i.e. without containers) on any system, even if it doesn't have an OS CVMFS installation. Only
cvmfsexec
has some requirements (new enough kernel) - Maybe we should also list in the docs on which EuroHPC systems EESSI is available. Either natively, or (tested) through
cvmfsexec
.- Should then probably add the
cvmfsexec
method under the 'installation and configuration' header in the docs
- Should then probably add the
- Ran the
-
EuroHPC system access: Lumi, BSC, Leonardo, Discoverer (?), Deucalion, Karolina/Vega (new grant)
- How long do they last? Need to figure out
- What will we do with this access? Who's interested? (Satish, Kenneth, Caspar)
- Try to run a test, e.g. the ESPResSo test. Potential for a new blog post including all those systems
- Other use cases to explore impact of optimized installations
- Task 1.1? End date M24
- also: running EESSI test suite
- Still need to figure out squashfs approach for offline workernodes => Deucalion now had offline worker nodes
- Someone should lobby/push sysadmins to do native installation, Alan?
WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
- As always: more software, bug fixes, etc.
dev.eessi.io
=> see notes + support issue #61- Required for our new CI/CD work
- Required for experimenting with our GPU software / prefixes
- Currently blocker for CI/CD & GPU progress. How do we expedite this?
- One of the main blockers seems to be separating budgets, e.g. this issue. How / who can we help / get this done?
- Alan set up a cluster on 12-08-2024 on Azure for
dev.eessi.io
, needs a little bit more work before it's good to go- Kenneth can look into completing the cluster setup
- Create logins for everyone involved, send login details
- Thomas can look into configuring bot there
- controlled via dedicated GitHub repo https://github.com/EESSI/dev.eessi.io
- dedicated branches per project?
- We'll need bot scripts in dev.eessi.io (Thomas)
- Kenneth can look into completing the cluster setup
- Pedro: unclear how to proceed with using easyconfig templates and "inject" commits provided by software developers
- Alan was working on a new feature in EasyBuild related to this, see EasyBuild framework PR #4608
- need to review description of Task 1.1, make sure all subtasks are covered
- => need to update project planning (Caspar, Kenneth)
- "we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, ..."
- Espresso + LAMMPS + OpenFOAM + ALL(?) (MultiXscale), GROMACS (BioExcel)
- We need to pick which applications, and who will run them to make sure we do this before Dec'24
- Do as part of the development access
- Plan focussed meeting on what to do with Development access
- Create overview of tasks / which software we have / etc
- We'll probably need Alan to give an overview, and arrange access for all
- => set up spreadsheet to create this overview?
- "increase stability of the shared software stack ... pro-actively by developing monitoring tools"
- proper monitoring for CVMFS network (S0 + S1s)
- RUG is pulling this, right? Any updates? Any issues expected getting it done by Dec'24?
- Bob is actively working on this. For now, very basic graphana dashboard with little info, but it's a starting point
- [RUG] T1.2 Extending support - D1.4 due M30 (June'25)
- zen4, sapphirerapids, A64FX
- Start with separate easystacks like this
- How/when do we integrate it into the main easystacks?
- Requires some solution for missing software (LMOD hooks?
eb_hooks
+--module-only
? Fallback tozen3
for particular software is probably difficult with RPATH-ing...). Who can implement this? Maybe Danilo, Caspar => Start of a discussion here https://gitlab.com/eessi/support/-/issues/30
- Requires some solution for missing software (LMOD hooks?
- A64FX: problematic with offline worker nodes. There are some that are still online, so if we manage to submit there somehow, could still work
- AMD ROCm
- Any updates? Is there good EB support already? (and not merge it) in
software.eessi.io
?- We can probably do a foss+ROCM thing, similar to CUDA
- Bob wanted to take a look at this.
- focused meeting being planned in Sept'24 (Pedro)
- Any updates? Is there good EB support already? (and not merge it) in
- should also look into Grace Hopper (JUPITER)
- SURF has a grace hopper. If there is interest, we could deploy a build bot there (but someone with more bot experience would need to help out with that)
- Alan/Kenneth have access to a Grace system @ JSC too
- Thomas is interested for the Norwegian system that will have 300 GH nodes.
- Caspar will figure out what's needed and contact Thomas
- Potentially: optimized install for ESPResSo on Grace Hopper, and compare to results from blog
- SURF has a grace hopper. If there is interest, we could deploy a build bot there (but someone with more bot experience would need to help out with that)
- zen4, sapphirerapids, A64FX
- [SURF] T1.3 Test suite - D1.5 due M30 (June'25)
- Several new tests (CP2K, PyTorch, LAMMPS is close)
- Documentation on how to add tests, with example [WIP], by Caspar. For now, here, will be part of main docs once finished.
- Potentially: look into mixin classes to execute some of the more 'standard' hooks we created for the EESSI test suite
- [BSC] T1.4 RISC-V (starts M13)
- cfr. efforts by Bob & Julian, incl.
riscv.eessi.io
, see docs - Started works to incorporate Extrae into
riscv.eessi.io
- Question: how do we add to
riscv.eessi.io
? Manually ingested tarballs by RUG?
- Question: how do we add to
- Presentation of the paper “Preparing to Hit the Ground Running: Adding RISC-V support to EESSI” in The Fourth International Workshop on RISC-V for HPC (workshop associated with ISC’24) in Hamburg. See presentation
- cfr. efforts by Bob & Julian, incl.
- [SURF] T1.5 Consolidation (starts M25 - Jan'25)
- (not started yet)
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- (FINISHED M12 [UGent] T5.1 Support portal)
- [SURF] T5.2 Monitoring/testing, D5.3 due M30 (June'25)
- discussions with SURF + initial work done on dashboard
- working on two dashboards: one detailed, one with overview
- in June-July we had two sync up meetings with Kenneth and Lara
- we have MVP for the performance dashboard and for the identity matrix dashboard
- we have ES set up and running on cloud + injection script that may take care of duplicates in DB
- Question: RUG has 5PMs here. What do we expect RUG to do? What do they want to do? Maybe deployment of the dashboard(s) in our cloud infra?
- (FINISHED M12 [UiB] T5.3 community contributions (bot))
- [UGent] T5.4 support/maintenance - D5.4 due M48 (Dec'26)
- support portal + rotation working well
- If you haven't looked at the new proposal for the rotation October through December, please do. See: https://gitlab.com/eessi/support/-/wikis/Proposal-support-rotation
- support issues in June + July
- Opened: 13 issues
- Closed: 9 issues
- total: 73 issues (24 open, 49 closed)
- bot release
- [UB] WP6 Community outreach, education, and training
- What other activities did we attend / are we planning to attend?
- [Alan] Submitted to Hipeac '25 (Jan'25 in Barcelona), tutorial was accepted
- [Alan] invited speaker for Nordic Industry Days (early Sept'24)
- submit BoF proposal on EESSI for SC24 (Atlanta, US)
- HPCNow! will be attending
- tutorial submission done
- CernVM-FS workshop (Sept'24, Geneva)
- Thomas submitted, not sure if it's accepted yet ("Status of EESSI / developments / plans")
- EESSI is in default CernVM-FS configuration
- could cover work on
dev.eessi.io
- Deliverable due: D6.2 (M24 - Dec'24).
- "Training Activity Technical Support Infrastructure": Report outlining the technical infrastructure created, maintained and used to support the training activities of the project.
- Any blockers expected in achieving this milestone?
- Thomas & Alan will work on this
- Deliverable D6.3 (M30 - June'25)
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- Interviewing Matej for general public, will be posted on multixscale website
- Working on software content for the website
- T7.1 Scientific applications provisioned on demand (lead: HPCNow)
- ...
- Task 7.2 - Dissemination and communication activities (lead: NIC)
- ...
- Task 7.3 - Sustainability (lead: NIC, started M18)
- ... (any updates on future legal entity for EESSI?)
- Task 7.4 - Industry-oriented training activities (lead: HPCNow)
- ...
- [NIC] WP8 (Management and Coordination)
-
We have a new shared drive, link can be found on multixscale/hub
-
Q2 2024 internal reporting is due. If you haven't done so already: please put your PMs in each workpackage, along with bullet point list of things you worked on.
-
amendment in the works, prep work by Alan:
- Discussed amendment (12-08 @ 11:00-12:00), see notes for details
- Add CI/CD work (
dev.eessi.io
functionality for developers) in Task 1.3 - Limit scope of Task 5.2, and move some effort from T5.2 => 1.3
- Scrap deliverable 1.5 (and integrate that a bit in D5.3)
- Create new deliverable for CI/CD work.
- Add CI/CD work (
- Discussed amendment (12-08 @ 11:00-12:00), see notes for details
-
IIT's cost declaration was rejected, we asked for explaination
- Got a response, but it had no real connection with the cost declaration
- Claimed that IIT highered postdoc later, lot's of comments on how the job was posted
- Currently drafting reply to PO
-
There will be an additional (interim) review
- Currently no more information, Neja is gathering this info
- Probably going to be online
-
Two deliverables due 5th of July (in response to project review)
- one on co-design (by Alan)
- focus on collaborating with projects like EUPILOT, EPI, EUPEX (rather than contacting vendors directly)
- one for scientific WPs
- both were submitted on 4+5 July, amendment will also include these
- one on co-design (by Alan)
-
Success story on collaboration with SKA community (Caspar)
- should we also make this a blog post MultiXscale website/EESSI blog?
- internal SURF talk by SKA was recorded, we should check if we can make this public
-
quarterly reports for 2024Q2 due end of this week
- almost done for WP1+WP5, WIP by Lara+Caspar, only missing summary+highlights
-
- [UGent] T1.1 Stable (EESSI) - D1.3 due M24 (Dec'24)
Notes
-
CI/CD call for EuroHPC
- is 100% funded (not 50/50 EU/countries)
- not published yet (updates?)
-
request for success story by CASTIEL2
- ideally end of June, by latest at end of August
- Caspar submitted success story on collaboration with SKA community
- ... did we submit anything else? (@neja?)
-
next general MultiXscale meeting
- ?
- hosted by Alan
- agenda point: update on pairing of technical + scientific WPs
-
(Susana) suggestions for blog are welcome
- ...
Notes of previous meetings
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-06-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-05-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-04-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-03-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-02-13
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2024-01-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-12-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-11-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-10-10
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-09-12
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-08-08
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-07-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-06-13
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-05-09
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-04-11
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-03-14
- https://github.com/multixscale/meetings/wiki/Sync-meeting-2023-02-14
- https://github.com/multixscale/meetings/wiki/sync-meeting-2023-01-10