Fix: image vulnerabilities by Chmokachka · Pull Request #124 · runpod/containers

Chmokachka · 2026-05-18T13:17:56Z

Summary

Drives all runpod/* images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out of official-templates/ and helper-templates/.

What's fixed

Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)

base — bumped jupyterlab, notebook, OpenSSH-related deps; stripped the efa_metrics directory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and the find ... || true guard keeps it a no-op on ROCm / CPU images.
autoresearch — fixed Hadolint findings, aligned with new base.
pytorch — Hadolint fixes; bumped max-parallelism to 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).
rocm — addressed all fixable CVEs; pinned the relevant deps.
nvidia-pytorch — patched OS-package CVEs; added scrub-stale-metadata.py (see below) to remove orphan .dist-info / .egg-info trees that kept Trivy reporting fixed wheels as still-vulnerable.

Hadolint

All DL3008 / DL3009 / DL3015 findings fixed across the touched Dockerfiles (--no-install-recommends, apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).
Hadolint-on-push workflow now ignores the rules we already chose to accept project-wide (matches the PR check behaviour).

CI / tooling

Upgraded GitHub Actions versions across nvidia.yml, rocm.yml, hadolint-pr.yml, hadolint-push.yml.
Replaced the brittle Trivy action call with our internal .github/actions/trivy — exposes a skip_files input so nvidia-pytorch can skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.
Pinned RUNPODCTL_VERSION=v2.3.0 in base/Dockerfile to stop tracking latest.
Fixed docker/setup-qemu-action invocation that started failing after the action's input rename.

New: `scripts/scrub-stale-metadata.py`

Small helper invoked by Dockerfiles after pip install. NGC base images bundle several Python packages as in-tree source builds whose .egg-info lives next to the source. pip install --upgrade upgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinned requirements.txt and deletes any .dist-info / .egg-info whose Version: disagrees with the pin.

What's NOT fixed (deliberate)

Three images still have findings we can't act on in this PR:

Image	Reason
`runpod/base:...-rocm644-...-pytorch251`	All remaining CVEs are in PyTorch 2.5.1 itself, fixed only in 2.6.0+. Two options: drop the 2.5.1 variant, or wait for an upstream backport. Left for a separate decision.
`runpod/autoresearch:...-cuda1281-ubuntu2204`	Findings are in transitive deps that need an autoresearch app-level dependency upgrade — out of scope for this PR.
`runpod/autoresearch:...-cuda1281-ubuntu2404`	Same as above.

These are tracked separately; everything else is now clean.

Validation

Trivy table-mode scans of each rebuilt tag — clean HIGH/CRITICAL on every targeted image.
Hadolint runs against the touched Dockerfiles — clean.

Follow-ups (separate PRs)

Open autoresearch-side PR to upgrade transitive deps.

…ties

kodxana · 2026-06-03T15:18:27Z

Good vulnerability cleanup overall, especially pinning versions and removing stale metadata that causes false-positive
Trivy reports.

My blocker is that this PR still depends on the report-only Trivy behavior from #122. The action still does not fail
when HIGH/CRITICAL findings are found, and the workflows still scan after push: true, so the CI does not prove that
vulnerable images are blocked from publication.

Because this PR’s goal is “fix image vulnerabilities”, I’d like to see one of these before merge:

Trivy exits non-zero for HIGH/CRITICAL fixed vulnerabilities and runs before publish, or
the PR clearly states that CI is not enforcing this yet and includes links/logs showing the claimed clean scans for
each targeted image.

The skip-files addition seems reasonable for the known civetweb cert false positives, but it makes the enforcement
story even more important so real findings do not get hidden behind a passing workflow.

TimPietruskyRunPod

Superseded — see the Changes Requested review below.

Correction: my earlier note here claimed nvidia-pytorch "now ships Jupyter." That was wrong — the image always had Jupyter; dropping RP_SKIP_JUPYTER just lets RunPod manage/patch it. Please disregard this review.

TimPietruskyRunPod

Requesting changes — a few quality fixes (details inline). The stale-metadata approach, the efa_metrics strip, and the version pins all look good, and this PR also resolves the missing if: guard I flagged on #122.

scrab_stale_metadata typo (bake files + COPY --from=).
scrub-stale-metadata.py uses 3.10+ annotation syntax + an unimported Iterator.
Trailing space / missing newline in the new requirements.txt files.

TimPietruskyRunPod

All four requested changes are in and verified (scrub typo across the bake files + Dockerfile, from __future__ import annotations + proper Iterator import in scrub-stale-metadata.py, and the requirements.txt whitespace/newline fixes). Threads resolved. Thanks!

TimPietruskyRunPod

Re-reviewed the new commits added since my approval (feat: push after grype + feat: pytorch timeout-minutes). Summary:

Good — this is a real improvement. Build now runs with load: true, push: false, then Extract refs → Grype → the new docker-push action, consistently across base/nvidia/rocm. So images are scanned before publish — the "scan before push" gap from the original review is now structurally closed. The autoresearch/pytorch push steps are correctly gated by steps.changes (same job that owns the changes step), and the docker-push composite action is sound (set -euo pipefail, fails on empty refs).

One concern to confirm before merge (inline): load: true materializes every image in the bake matrix into the runner's local Docker daemon before pushing. build-nvidia (single image) passed, but base and pytorch build large matrices of multi-GB CUDA/torch images — loading them all at once risks exhausting runner disk (ENOSPC). The current run is the first with load: true on those matrices and build-base/build-rocm are still in progress; worth watching them specifically. The blacksmith-32vcpu + timeout-minutes: 240 bump helps CPU/time but not disk.

Note: grype is still report-only by design, so push isn't actually blocked on findings yet — but the ordering means flipping the exit 1 later turns it into a real gate with no further restructuring. Fine as a foundation.

No blocking objection from me on the design; just confirm the base/pytorch runs go green with load: true.

TimPietruskyRunPod · 2026-06-10T09:21:28Z

            official-templates/shared/versions.hcl
            official-templates/base/docker-bake.hcl
-          push: true
+          load: true


load: true imports the entire bake matrix into the runner's local Docker daemon before the push step. For the base CUDA matrix (and especially the pytorch matrix of ~25 multi-GB CUDA/torch images) this can blow the runner's disk (ENOSPC) since all images coexist locally instead of streaming straight to the registry. nvidia (1 image) is fine; please confirm build-base/build-pytorch actually pass with this. If disk becomes the bottleneck, consider load→scan→push→docker image rm per ref (or per small batch) so they don't all pile up at once.

I'm still experimenting with load: true, as it really blows the runner's disk. Moving the PR back to draft until I'm done with it.

mchekm added 2 commits May 18, 2026 12:28

fix: jupyterlab, notebook, ssh vulnerabilities

2068183

fix: hadolint findings

3f2d769

Chmokachka changed the base branch from main to feat/image-security-scanner May 18, 2026 13:19

mchekm added 18 commits May 18, 2026 16:28

fix: autoresearch hadolint

a70220f

fix: trivy vulnerabilities; bake push: true

69641b2

fix: pytorch max-parallelism: 4

e562b5c

fix: pytorch max-parallelism: 3

de920b4

fix: autoresearch linter

d3a7e1e

fix: hadolint findings and RUNPODCTL_VERSION=v2.3.0

72ae1a4

fix: hadolint findings

d6049ac

fix: base build

74e86de

fix: hadolint findings

629e6c5

fix: rocm vulnerabilities

91dcbad

fix: script to scrub stale metadata

d368a69

align with base branch

eafb843

feat: upgrade github actions versions and increase pytorch timeout

cc797ec

fix: docker/setup-qemu-action

e04f71a

fix: nvidia-pythorch vulnerabilities

d8da79a

fix: base workflow

1feffe9

fix: ignore nvidia-pytorch trvy findings with certs

fe0cf15

Merge branch 'feat/image-security-scanner' into fix/image-vulnerabili…

8e54e0f

…ties

This comment has been minimized.

Sign in to view

mchekm added 3 commits May 21, 2026 12:07

fix: rocm vulnerabilities

dfdcc8f

check if filebrowser generates vulneralities

14fccdc

fix: nic_sampler vulnerabilities

6936050

Chmokachka marked this pull request as ready for review May 21, 2026 21:01

mchekm added 3 commits May 22, 2026 11:17

chore: added comment

417b296

fix: ignore some of hadolint findings on push

d7a76cc

fix: relocate scrub-stale-metadata.py

94f453f

mchekm added 5 commits May 25, 2026 18:37

fix: relocate scrub-stale-metadata.py

318ee0a

fix: scrab_stale_metadata

7e46af8

fix: scrub-stale-metadata.py

be84342

feat: bump version

5731245

Merge branch 'feat/image-security-scanner' into fix/image-vulnerabili…

93447a7

…ties

Chmokachka requested a review from TimPietruskyRunPod May 27, 2026 15:44

fix: do not run trivy if no changes

0441083

TimPietruskyRunPod reviewed Jun 5, 2026

View reviewed changes

TimPietruskyRunPod mentioned this pull request Jun 5, 2026

feat: add linting, vulnerability scanning, and rework build triggers #122

Merged

TimPietruskyRunPod requested changes Jun 5, 2026

View reviewed changes

Comment thread official-templates/base/docker-bake.hcl Outdated

Comment thread official-templates/base/Dockerfile Outdated

Comment thread scripts/scrub-stale-metadata.py

Comment thread official-templates/base/requirements.txt Outdated

fix: comments

0fc2cb4

TimPietruskyRunPod approved these changes Jun 8, 2026

View reviewed changes

Base automatically changed from feat/image-security-scanner to main June 8, 2026 19:25

mchekm added 2 commits June 9, 2026 10:49

Merge branch 'main' into fix/image-vulnerabilities

0c95b22

feat: increased runners and bake-action doesn't push images

357ec59

This comment has been minimized.

Sign in to view

mchekm added 4 commits June 9, 2026 14:08

fix: vulnerabilities in pip packages

be335b0

fix: nvidia requirements

75d308e

feat: push after grype

19d5d77

feat: pytorch timeout-minutes

c4d999d

TimPietruskyRunPod reviewed Jun 10, 2026

View reviewed changes

Chmokachka marked this pull request as draft June 10, 2026 09:41

mchekm added 2 commits June 10, 2026 12:59

fix: trivy leftovers and ci optimization

d4bdc52

fix: cache error

cfac9b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: image vulnerabilities#124

Fix: image vulnerabilities#124
Chmokachka wants to merge 41 commits into
mainfrom
fix/image-vulnerabilities

Chmokachka commented May 18, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

kodxana commented Jun 3, 2026

Uh oh!

TimPietruskyRunPod left a comment •

edited

Loading

Uh oh!

TimPietruskyRunPod left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimPietruskyRunPod left a comment

Uh oh!

This comment has been minimized.

TimPietruskyRunPod left a comment

Uh oh!

TimPietruskyRunPod Jun 10, 2026

Uh oh!

Chmokachka Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Chmokachka commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's fixed

Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)

Hadolint

CI / tooling

New: scripts/scrub-stale-metadata.py

What's NOT fixed (deliberate)

Validation

Follow-ups (separate PRs)

Uh oh!

This comment has been minimized.

This comment has been minimized.

kodxana commented Jun 3, 2026

Uh oh!

TimPietruskyRunPod left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

Uh oh!

TimPietruskyRunPod Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Chmokachka Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chmokachka commented May 18, 2026 •

edited

Loading

New: `scripts/scrub-stale-metadata.py`

TimPietruskyRunPod left a comment •

edited

Loading