Fix: image vulnerabilities#124
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Good vulnerability cleanup overall, especially pinning versions and removing stale metadata that causes false-positive My blocker is that this PR still depends on the report-only Trivy behavior from #122. The action still does not fail Because this PR’s goal is “fix image vulnerabilities”, I’d like to see one of these before merge:
The |
There was a problem hiding this comment.
Superseded — see the Changes Requested review below.
Correction: my earlier note here claimed nvidia-pytorch "now ships Jupyter." That was wrong — the image always had Jupyter; dropping RP_SKIP_JUPYTER just lets RunPod manage/patch it. Please disregard this review.
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
Requesting changes — a few quality fixes (details inline). The stale-metadata approach, the efa_metrics strip, and the version pins all look good, and this PR also resolves the missing if: guard I flagged on #122.
scrab_stale_metadatatypo (bake files +COPY --from=).scrub-stale-metadata.pyuses 3.10+ annotation syntax + an unimportedIterator.- Trailing space / missing newline in the new
requirements.txtfiles.
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
All four requested changes are in and verified (scrub typo across the bake files + Dockerfile, from __future__ import annotations + proper Iterator import in scrub-stale-metadata.py, and the requirements.txt whitespace/newline fixes). Threads resolved. Thanks!
This comment has been minimized.
This comment has been minimized.
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
Re-reviewed the new commits added since my approval (feat: push after grype + feat: pytorch timeout-minutes). Summary:
Good — this is a real improvement. Build now runs with load: true, push: false, then Extract refs → Grype → the new docker-push action, consistently across base/nvidia/rocm. So images are scanned before publish — the "scan before push" gap from the original review is now structurally closed. The autoresearch/pytorch push steps are correctly gated by steps.changes (same job that owns the changes step), and the docker-push composite action is sound (set -euo pipefail, fails on empty refs).
One concern to confirm before merge (inline): load: true materializes every image in the bake matrix into the runner's local Docker daemon before pushing. build-nvidia (single image) passed, but base and pytorch build large matrices of multi-GB CUDA/torch images — loading them all at once risks exhausting runner disk (ENOSPC). The current run is the first with load: true on those matrices and build-base/build-rocm are still in progress; worth watching them specifically. The blacksmith-32vcpu + timeout-minutes: 240 bump helps CPU/time but not disk.
Note: grype is still report-only by design, so push isn't actually blocked on findings yet — but the ordering means flipping the exit 1 later turns it into a real gate with no further restructuring. Fine as a foundation.
No blocking objection from me on the design; just confirm the base/pytorch runs go green with load: true.
| official-templates/shared/versions.hcl | ||
| official-templates/base/docker-bake.hcl | ||
| push: true | ||
| load: true |
There was a problem hiding this comment.
load: true imports the entire bake matrix into the runner's local Docker daemon before the push step. For the base CUDA matrix (and especially the pytorch matrix of ~25 multi-GB CUDA/torch images) this can blow the runner's disk (ENOSPC) since all images coexist locally instead of streaming straight to the registry. nvidia (1 image) is fine; please confirm build-base/build-pytorch actually pass with this. If disk becomes the bottleneck, consider load→scan→push→docker image rm per ref (or per small batch) so they don't all pile up at once.
There was a problem hiding this comment.
I'm still experimenting with load: true, as it really blows the runner's disk. Moving the PR back to draft until I'm done with it.
Summary
Drives all
runpod/*images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out ofofficial-templates/andhelper-templates/.What's fixed
Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)
jupyterlab,notebook, OpenSSH-related deps; stripped theefa_metricsdirectory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and thefind ... || trueguard keeps it a no-op on ROCm / CPU images.max-parallelismto 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).scrub-stale-metadata.py(see below) to remove orphan.dist-info/.egg-infotrees that kept Trivy reporting fixed wheels as still-vulnerable.Hadolint
DL3008/DL3009/DL3015findings fixed across the touched Dockerfiles (--no-install-recommends,apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).CI / tooling
nvidia.yml,rocm.yml,hadolint-pr.yml,hadolint-push.yml..github/actions/trivy— exposes askip_filesinput sonvidia-pytorchcan skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.RUNPODCTL_VERSION=v2.3.0inbase/Dockerfileto stop trackinglatest.docker/setup-qemu-actioninvocation that started failing after the action's input rename.New:
scripts/scrub-stale-metadata.pySmall helper invoked by Dockerfiles after
pip install. NGC base images bundle several Python packages as in-tree source builds whose.egg-infolives next to the source.pip install --upgradeupgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinnedrequirements.txtand deletes any.dist-info/.egg-infowhoseVersion:disagrees with the pin.What's NOT fixed (deliberate)
Three images still have findings we can't act on in this PR:
runpod/base:...-rocm644-...-pytorch251runpod/autoresearch:...-cuda1281-ubuntu2204runpod/autoresearch:...-cuda1281-ubuntu2404These are tracked separately; everything else is now clean.
Validation
Follow-ups (separate PRs)