Skip to content

Fix: image vulnerabilities#124

Draft
Chmokachka wants to merge 41 commits into
mainfrom
fix/image-vulnerabilities
Draft

Fix: image vulnerabilities#124
Chmokachka wants to merge 41 commits into
mainfrom
fix/image-vulnerabilities

Conversation

@Chmokachka

@Chmokachka Chmokachka commented May 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

Drives all runpod/* images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out of official-templates/ and helper-templates/.

What's fixed

Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)

  • base — bumped jupyterlab, notebook, OpenSSH-related deps; stripped the efa_metrics directory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and the find ... || true guard keeps it a no-op on ROCm / CPU images.
  • autoresearch — fixed Hadolint findings, aligned with new base.
  • pytorch — Hadolint fixes; bumped max-parallelism to 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).
  • rocm — addressed all fixable CVEs; pinned the relevant deps.
  • nvidia-pytorch — patched OS-package CVEs; added scrub-stale-metadata.py (see below) to remove orphan .dist-info / .egg-info trees that kept Trivy reporting fixed wheels as still-vulnerable.

Hadolint

  • All DL3008 / DL3009 / DL3015 findings fixed across the touched Dockerfiles (--no-install-recommends, apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).
  • Hadolint-on-push workflow now ignores the rules we already chose to accept project-wide (matches the PR check behaviour).

CI / tooling

  • Upgraded GitHub Actions versions across nvidia.yml, rocm.yml, hadolint-pr.yml, hadolint-push.yml.
  • Replaced the brittle Trivy action call with our internal .github/actions/trivy — exposes a skip_files input so nvidia-pytorch can skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.
  • Pinned RUNPODCTL_VERSION=v2.3.0 in base/Dockerfile to stop tracking latest.
  • Fixed docker/setup-qemu-action invocation that started failing after the action's input rename.

New: scripts/scrub-stale-metadata.py

Small helper invoked by Dockerfiles after pip install. NGC base images bundle several Python packages as in-tree source builds whose .egg-info lives next to the source. pip install --upgrade upgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinned requirements.txt and deletes any .dist-info / .egg-info whose Version: disagrees with the pin.

What's NOT fixed (deliberate)

Three images still have findings we can't act on in this PR:

Image Reason
runpod/base:...-rocm644-...-pytorch251 All remaining CVEs are in PyTorch 2.5.1 itself, fixed only in 2.6.0+. Two options: drop the 2.5.1 variant, or wait for an upstream backport. Left for a separate decision.
runpod/autoresearch:...-cuda1281-ubuntu2204 Findings are in transitive deps that need an autoresearch app-level dependency upgrade — out of scope for this PR.
runpod/autoresearch:...-cuda1281-ubuntu2404 Same as above.

These are tracked separately; everything else is now clean.

Validation

  • Trivy table-mode scans of each rebuilt tag — clean HIGH/CRITICAL on every targeted image.
  • Hadolint runs against the touched Dockerfiles — clean.

Follow-ups (separate PRs)

  • Open autoresearch-side PR to upgrade transitive deps.

@Chmokachka Chmokachka changed the base branch from main to feat/image-security-scanner May 18, 2026 13:19
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@Chmokachka Chmokachka marked this pull request as ready for review May 21, 2026 21:01
@kodxana

kodxana commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Good vulnerability cleanup overall, especially pinning versions and removing stale metadata that causes false-positive
Trivy reports.

My blocker is that this PR still depends on the report-only Trivy behavior from #122. The action still does not fail
when HIGH/CRITICAL findings are found, and the workflows still scan after push: true, so the CI does not prove that
vulnerable images are blocked from publication.

Because this PR’s goal is “fix image vulnerabilities”, I’d like to see one of these before merge:

  • Trivy exits non-zero for HIGH/CRITICAL fixed vulnerabilities and runs before publish, or
  • the PR clearly states that CI is not enforcing this yet and includes links/logs showing the claimed clean scans for
    each targeted image.

The skip-files addition seems reasonable for the known civetweb cert false positives, but it makes the enforcement
story even more important so real findings do not get hidden behind a passing workflow.

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded — see the Changes Requested review below.

Correction: my earlier note here claimed nvidia-pytorch "now ships Jupyter." That was wrong — the image always had Jupyter; dropping RP_SKIP_JUPYTER just lets RunPod manage/patch it. Please disregard this review.

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes — a few quality fixes (details inline). The stale-metadata approach, the efa_metrics strip, and the version pins all look good, and this PR also resolves the missing if: guard I flagged on #122.

  • scrab_stale_metadata typo (bake files + COPY --from=).
  • scrub-stale-metadata.py uses 3.10+ annotation syntax + an unimported Iterator.
  • Trailing space / missing newline in the new requirements.txt files.

Comment thread official-templates/base/docker-bake.hcl Outdated
Comment thread official-templates/base/Dockerfile Outdated
Comment thread scripts/scrub-stale-metadata.py
Comment thread official-templates/base/requirements.txt Outdated

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four requested changes are in and verified (scrub typo across the bake files + Dockerfile, from __future__ import annotations + proper Iterator import in scrub-stale-metadata.py, and the requirements.txt whitespace/newline fixes). Threads resolved. Thanks!

Base automatically changed from feat/image-security-scanner to main June 8, 2026 19:25
@blacksmith-sh

This comment has been minimized.

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the new commits added since my approval (feat: push after grype + feat: pytorch timeout-minutes). Summary:

Good — this is a real improvement. Build now runs with load: true, push: false, then Extract refs → Grype → the new docker-push action, consistently across base/nvidia/rocm. So images are scanned before publish — the "scan before push" gap from the original review is now structurally closed. The autoresearch/pytorch push steps are correctly gated by steps.changes (same job that owns the changes step), and the docker-push composite action is sound (set -euo pipefail, fails on empty refs).

One concern to confirm before merge (inline): load: true materializes every image in the bake matrix into the runner's local Docker daemon before pushing. build-nvidia (single image) passed, but base and pytorch build large matrices of multi-GB CUDA/torch images — loading them all at once risks exhausting runner disk (ENOSPC). The current run is the first with load: true on those matrices and build-base/build-rocm are still in progress; worth watching them specifically. The blacksmith-32vcpu + timeout-minutes: 240 bump helps CPU/time but not disk.

Note: grype is still report-only by design, so push isn't actually blocked on findings yet — but the ordering means flipping the exit 1 later turns it into a real gate with no further restructuring. Fine as a foundation.

No blocking objection from me on the design; just confirm the base/pytorch runs go green with load: true.

official-templates/shared/versions.hcl
official-templates/base/docker-bake.hcl
push: true
load: true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load: true imports the entire bake matrix into the runner's local Docker daemon before the push step. For the base CUDA matrix (and especially the pytorch matrix of ~25 multi-GB CUDA/torch images) this can blow the runner's disk (ENOSPC) since all images coexist locally instead of streaming straight to the registry. nvidia (1 image) is fine; please confirm build-base/build-pytorch actually pass with this. If disk becomes the bottleneck, consider load→scan→push→docker image rm per ref (or per small batch) so they don't all pile up at once.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still experimenting with load: true, as it really blows the runner's disk. Moving the PR back to draft until I'm done with it.

@Chmokachka Chmokachka marked this pull request as draft June 10, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants