Skip to content

[codex] fix shell runtime value quoting#30

Open
scoootscooob wants to merge 2 commits into
mainfrom
codex/fix-shell-exec-rendering
Open

[codex] fix shell runtime value quoting#30
scoootscooob wants to merge 2 commits into
mainfrom
codex/fix-shell-exec-rendering

Conversation

@scoootscooob
Copy link
Copy Markdown
Collaborator

@scoootscooob scoootscooob commented Jun 1, 2026

Summary

  • add quote-aware shell rendering for runtime placeholders
  • use it for shell=True execution checks in both verifier implementations
  • use the same renderer for background service commands
  • add {name:raw} as an explicit escape hatch for benchmark-owned shell fragments
  • document the shell placeholder contract in the v0.4 spec
  • add regression coverage for whitespace paths, shell metacharacters, pre-quoted placeholders, raw shell fragments, and background services

Root Cause

ExecutionCheck.shell defaults to True, but the shell path rendered placeholders with plain string substitution and then passed the command to /bin/sh -c. Runtime values containing whitespace were split by the shell, and metacharacters could be interpreted as shell syntax.

Compatibility

The new renderer treats {name} as literal data and quotes or escapes it according to shell quote context. Templates that intentionally need a runtime value to expand as a shell fragment can opt in at that insertion point with {name:raw}. The spec documents that {name:raw} is only for benchmark-owned fragments and should not be used for agent-produced or otherwise untrusted values.

Fixes #29.

Validation

  • /tmp/clawbench-shell-render-venv/bin/python -m pytest -q tests/test_execution_shell_rendering.py tests/test_environment_files.py tests/test_environment.py tests/test_services.py
  • /tmp/clawbench-shell-render-venv/bin/python -m ruff check clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py tests/test_execution_shell_rendering.py tests/test_services.py
  • /tmp/clawbench-shell-render-venv/bin/python -m compileall clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py
  • git diff --check

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 1, 2026

Codex review: needs maintainer review before merge. Reviewed June 1, 2026, 4:58 PM ET / 20:58 UTC.

Summary
Adds quote-aware rendering for runtime placeholders in shell-mode execution checks and background services, documents {name:raw}, and adds shell-rendering regression tests.

Reproducibility: yes. from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Review metrics: 3 noteworthy metrics.

  • Runtime Shell Paths: 3 changed. Both execution-check implementations and background services now share the new placeholder semantics.
  • Regression Coverage: 1 file added, 1 file extended. The tests cover both verifier implementations plus background service command rendering.
  • Diff Scope: 7 files, +269/-10. The patch is focused but touches core benchmark runtime behavior and the v0.4 task contract.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🌊 off-meta tidepool
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Maintainers should explicitly accept the literal-data default plus {name:raw} upgrade path before merge.

Risk before merge

  • [P1] Existing external task templates that intentionally relied on placeholder values expanding to multiple shell words, globs, redirects, pipes, or other shell syntax will need to switch those insertion points to {name:raw}.
  • [P1] Maintainers should explicitly accept {name:raw} as the compatibility path for benchmark-owned fragments before this shell placeholder contract becomes the default.

Maintainer options:

  1. Approve The Raw Escape Hatch Contract
    Maintainers can intentionally accept literal-data placeholders by default and use {name:raw} as the upgrade path for benchmark-owned shell fragments.
  2. Require A Task Migration Audit
    Before merge, maintainers can ask for a quick audit or migration note covering any known task templates that rely on placeholder values as shell syntax.

Next step before merge

  • No automated repair is indicated; the remaining action is maintainer approval of the shell placeholder compatibility contract for this collaborator-authored PR.

Security
Cleared: No supply-chain or secret-handling regression was found; the patch reduces shell metacharacter injection risk while leaving raw fragments explicitly benchmark-owned.

Review details

Best possible solution:

Merge a maintainer-approved literal-data shell placeholder contract with {name:raw} as the documented compatibility escape hatch, keeping the regression coverage across both execution-check implementations and background services.

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Is this the best way to solve the issue?

Yes, assuming maintainers accept the compatibility contract: quote-aware literal rendering is the narrow fix for runtime values as data, and {name:raw} gives benchmark-owned fragments an explicit opt-in path.

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1ca672acf013.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a normal-priority bug and hardening fix with limited blast radius but real compatibility implications for benchmark task authors.
  • merge-risk: 🚨 compatibility: Changing shell-mode placeholders to literal data by default can break existing tasks that used runtime values as shell fragments unless they adopt {name:raw}.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.
Evidence reviewed

What I checked:

  • Current main shell execution path: Current main renders ExecutionCheck.command with plain render_template before passing it to asyncio.create_subprocess_shell when spec.shell is true, so the linked shell quoting problem is not already fixed on main. (clawbench/environment.py:111, 1ca672acf013)
  • Current main file verifier path: The file-based verifier has the same render_template before create_subprocess_shell behavior on current main. (clawbench/environment_files.py:103, 1ca672acf013)
  • Current main background service path: Background services render commands with render_template and launch them with shell=True, so service command placeholders share the same shell interpretation risk. (clawbench/services.py:83, 1ca672acf013)
  • PR shell renderer: The PR head adds render_shell_template, quotes normal placeholders according to shell quote context, and leaves {name:raw} as the explicit unescaped path. (clawbench/render.py:74, c621e6b2004d)
  • PR documentation: The PR documents that shell-mode placeholders are literal data by default and that {name:raw} is only for benchmark-owned shell fragments. (CLAWBENCH_V0_4_SPEC.md:167, c621e6b2004d)
  • PR regression coverage: The PR adds regression tests for unquoted runtime values, shell metacharacters, {name:raw}, pre-quoted placeholders, and background service commands. (tests/test_execution_shell_rendering.py:23, c621e6b2004d)

Likely related people:

  • scoootscooob: Blame ties the original render_template, shell execution paths, and background service command rendering to commits 95b226d and 56531fb, so this person is connected to the current runtime surface beyond this PR. (role: introduced behavior and current area contributor; confidence: high; commits: 95b226dfedb0, 56531fbf432b; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
  • Peter Steinberger: Commit da0bf52 introduced render_argv_template and updated shell=False execution checks to preserve placeholder values as single argv arguments. (role: adjacent shell-rendering contributor; confidence: medium; commits: da0bf52c500e; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
  • Vincent Koc: Commit ed9adf8 recently changed runtime path hardening in the same environment and service code paths touched by this PR. (role: recent runtime hardening contributor; confidence: medium; commits: ed9adf8d8451, 88ab0f55646e; files: clawbench/environment.py, clawbench/services.py, tests/test_environment.py)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. labels Jun 1, 2026
@scoootscooob scoootscooob marked this pull request as ready for review June 1, 2026 18:27
@scoootscooob scoootscooob requested a review from a team as a code owner June 1, 2026 18:27
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 1, 2026
@scoootscooob
Copy link
Copy Markdown
Collaborator Author

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 1, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: shell=True execution checks mis-tokenize runtime values containing whitespace or shell metacharacters

1 participant