{2025.06}[2025b] GROMACS 2025.4 with CUDA-12.9.1#1482
Conversation
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-rug for:arch=x86_64/amd/zen5,accel=nvidia/cc120 |
|
New job on instance
|
|
The build succeeded, but it fails in the CUDA sanity check: I guess it may be related to the |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-rug for:arch=x86_64/amd/zen5,accel=nvidia/cc120 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
@casparvl The icelake cc80 build with the Surf bot failed because of: Have you encountered this before? |
Maybe we just need to add |
I tried various things with an interactive job on Snellius, but edit: the zen4 job also ran out of memory according to Slurm, but somehow kept running and then timed out after a day. @casparvl do you have any idea what's going on? |
|
The only thing I can think of: these nodes don't have local disks, so |
I've seen this happen before. If you have, say, 3 processes running, OOM killer might kill one, leave 2 stray processes that just wait for the other one to do something. And that then runs indefinitely. SLURM doesn't end the job, since you still have running processes. |
|
I've done an interactive build on an A100 node on Snellius with my personal account and on top of EESSI (without a container), that worked fine: No memory issues, and the max memory usage was like ~4GB. I'll do another one with the container. |
|
Let me just try this again as well: bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
Hmmm, something is wrong. We changed some things in the bot config in our config management system, but for some reason it concludes it shouldn't submit a job based on the above commands. Will dig into why... |
|
@bedroge I don't know if you are looking at adding the |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
New job on instance
|
|
Created a small script in an attempt to figure out the memory consumption: A snapshot of that output, somewhat very close to the peak memory usage (right before running OOM): Looking at this file and the timestamps, the memory consumption increases gradually. It's almost like some process is allocated MPI buffers in I'm assuming that if we solve the underlying issue for UCX, this will fix the OOM as well. The tricky part is: when @bedroge manually starts the container, the It might be related to openucx/ucx#6264 which my colleague @satishskamath has also replied to. While that was for way older versions of UCX, and while I think that natively on our system the issue was resolved, the error does seem to come from the same |
|
bot:show_config |
|
Instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
Seems my bot can't really reply, but is picking up stuff from the PR. Let's see if I can trigger a build, that'll allow us to easily try different environment variables via the site config script of the bot to see if we can fix this. bot: build repo:eessi.io-2025.06-software instance:eessi-bot-casparvl for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
Requires: