[codex] fix shell runtime value quoting by scoootscooob · Pull Request #30 · openclaw/clawbench

scoootscooob · 2026-06-01T18:18:38Z

Summary

add quote-aware shell rendering for runtime placeholders
use it for shell=True execution checks in both verifier implementations
use the same renderer for background service commands
add {name:raw} as an explicit escape hatch for benchmark-owned shell fragments
document the shell placeholder contract in the v0.4 spec
add regression coverage for whitespace paths, shell metacharacters, pre-quoted placeholders, raw shell fragments, and background services

Root Cause

ExecutionCheck.shell defaults to True, but the shell path rendered placeholders with plain string substitution and then passed the command to /bin/sh -c. Runtime values containing whitespace were split by the shell, and metacharacters could be interpreted as shell syntax.

Compatibility

The new renderer treats {name} as literal data and quotes or escapes it according to shell quote context. Templates that intentionally need a runtime value to expand as a shell fragment can opt in at that insertion point with {name:raw}. The spec documents that {name:raw} is only for benchmark-owned fragments and should not be used for agent-produced or otherwise untrusted values.

Fixes #29.

Validation

/tmp/clawbench-shell-render-venv/bin/python -m pytest -q tests/test_execution_shell_rendering.py tests/test_environment_files.py tests/test_environment.py tests/test_services.py
/tmp/clawbench-shell-render-venv/bin/python -m ruff check clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py tests/test_execution_shell_rendering.py tests/test_services.py
/tmp/clawbench-shell-render-venv/bin/python -m compileall clawbench/render.py clawbench/environment.py clawbench/environment_files.py clawbench/services.py
git diff --check

clawsweeper · 2026-06-01T18:19:43Z

Codex review: needs maintainer review before merge. Reviewed June 1, 2026, 4:58 PM ET / 20:58 UTC.

Summary
Adds quote-aware rendering for runtime placeholders in shell-mode execution checks and background services, documents {name:raw}, and adds shell-rendering regression tests.

Reproducibility: yes. from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Review metrics: 3 noteworthy metrics.

Runtime Shell Paths: 3 changed. Both execution-check implementations and background services now share the new placeholder semantics.
Regression Coverage: 1 file added, 1 file extended. The tests cover both verifier implementations plus background service command rendering.
Diff Scope: 7 files, +269/-10. The patch is focused but touches core benchmark runtime behavior and the v0.4 task contract.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🌊 off-meta tidepool
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P2] Maintainers should explicitly accept the literal-data default plus {name:raw} upgrade path before merge.

Risk before merge

[P1] Existing external task templates that intentionally relied on placeholder values expanding to multiple shell words, globs, redirects, pipes, or other shell syntax will need to switch those insertion points to {name:raw}.
[P1] Maintainers should explicitly accept {name:raw} as the compatibility path for benchmark-owned fragments before this shell placeholder contract becomes the default.

Maintainer options:

Approve The Raw Escape Hatch Contract
Maintainers can intentionally accept literal-data placeholders by default and use {name:raw} as the upgrade path for benchmark-owned shell fragments.
Require A Task Migration Audit
Before merge, maintainers can ask for a quick audit or migration note covering any known task templates that rely on placeholder values as shell syntax.

Next step before merge

No automated repair is indicated; the remaining action is maintainer approval of the shell placeholder compatibility contract for this collaborator-authored PR.

Security
Cleared: No supply-chain or secret-handling regression was found; the patch reduces shell metacharacter injection risk while leaving raw fragments explicitly benchmark-owned.

Review details

Best possible solution:

Merge a maintainer-approved literal-data shell placeholder contract with {name:raw} as the documented compatibility escape hatch, keeping the regression coverage across both execution-check implementations and background services.

Do we have a high-confidence way to reproduce the issue?

Yes, from source inspection: current main renders shell-mode commands with render_template and then passes them to /bin/sh, so whitespace and metacharacters in placeholder values are interpreted by the shell. I did not run local reproduction commands because this was a read-only review.

Is this the best way to solve the issue?

Yes, assuming maintainers accept the compatibility contract: quote-aware literal rendering is the narrow fix for runtime values as data, and {name:raw} gives benchmark-owned fragments an explicit opt-in path.

AGENTS.md: not found in the target repository.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1ca672acf013.

Label changes

Label changes:

add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.
remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

P2: This is a normal-priority bug and hardening fix with limited blast radius but real compatibility implications for benchmark task authors.
merge-risk: 🚨 compatibility: Changing shell-mode placeholders to literal data by default can break existing tasks that used runtime values as shell fragments unless they adopt {name:raw}.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🌊 off-meta tidepool and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate does not apply to this collaborator-authored PR; the PR body lists validation commands and GitHub checks for the head are successful.

Evidence reviewed

What I checked:

Current main shell execution path: Current main renders ExecutionCheck.command with plain render_template before passing it to asyncio.create_subprocess_shell when spec.shell is true, so the linked shell quoting problem is not already fixed on main. (clawbench/environment.py:111, 1ca672acf013)
Current main file verifier path: The file-based verifier has the same render_template before create_subprocess_shell behavior on current main. (clawbench/environment_files.py:103, 1ca672acf013)
Current main background service path: Background services render commands with render_template and launch them with shell=True, so service command placeholders share the same shell interpretation risk. (clawbench/services.py:83, 1ca672acf013)
PR shell renderer: The PR head adds render_shell_template, quotes normal placeholders according to shell quote context, and leaves {name:raw} as the explicit unescaped path. (clawbench/render.py:74, c621e6b2004d)
PR documentation: The PR documents that shell-mode placeholders are literal data by default and that {name:raw} is only for benchmark-owned shell fragments. (CLAWBENCH_V0_4_SPEC.md:167, c621e6b2004d)
PR regression coverage: The PR adds regression tests for unquoted runtime values, shell metacharacters, {name:raw}, pre-quoted placeholders, and background service commands. (tests/test_execution_shell_rendering.py:23, c621e6b2004d)

Likely related people:

scoootscooob: Blame ties the original render_template, shell execution paths, and background service command rendering to commits 95b226d and 56531fb, so this person is connected to the current runtime surface beyond this PR. (role: introduced behavior and current area contributor; confidence: high; commits: 95b226dfedb0, 56531fbf432b; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
Peter Steinberger: Commit da0bf52 introduced render_argv_template and updated shell=False execution checks to preserve placeholder values as single argv arguments. (role: adjacent shell-rendering contributor; confidence: medium; commits: da0bf52c500e; files: clawbench/render.py, clawbench/environment.py, clawbench/environment_files.py)
Vincent Koc: Commit ed9adf8 recently changed runtime path hardening in the same environment and service code paths touched by this PR. (role: recent runtime hardening contributor; confidence: medium; commits: ed9adf8d8451, 88ab0f55646e; files: clawbench/environment.py, clawbench/services.py, tests/test_environment.py)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

scoootscooob · 2026-06-01T20:52:11Z

@clawsweeper re-review

clawsweeper · 2026-06-01T20:52:14Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26781375966
Updated: 2026-06-01T20:58:58.768Z

fix shell runtime value quoting

ceda6e2

scoootscooob marked this pull request as ready for review June 1, 2026 18:27

scoootscooob requested a review from a team as a code owner June 1, 2026 18:27

add raw shell placeholder escape hatch

c621e6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] fix shell runtime value quoting#30

[codex] fix shell runtime value quoting#30
scoootscooob wants to merge 2 commits into
mainfrom
codex/fix-shell-exec-rendering

scoootscooob commented Jun 1, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

scoootscooob commented Jun 1, 2026

Uh oh!

clawsweeper Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

scoootscooob commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Compatibility

Validation

Uh oh!

clawsweeper Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scoootscooob commented Jun 1, 2026

Uh oh!

clawsweeper Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scoootscooob commented Jun 1, 2026 •

edited

Loading

clawsweeper Bot commented Jun 1, 2026 •

edited

Loading

clawsweeper Bot commented Jun 1, 2026 •

edited

Loading