test(eval): add ContextBench protocol fixtures#119
Conversation
Greptile SummaryThis PR freezes the ContextBench Phase 37 benchmark infrastructure: a deterministic task-selection script, six JSON fixture files (protocol, lanes, corrections, task manifest, selection exclusions, smoke pack), and two validation test suites. The four pre-existing slow test files are also updated to replace magic timeout literals with a named constant. Previously reported issues — live-dataset drift in Confidence Score: 5/5Safe to merge; all findings are P2 style suggestions with no runtime impact on the committed test fixtures or normal CI execution. All P1 issues from previous review rounds have been resolved in commit 04a6cfb. Remaining findings are non-blocking P2s: an unpinned evaluator clone in a manually-invoked probe mode and a duplicated constant in test files. Fixture hash values are well-formed, hash computation is consistent between script and tests, and the --check self-verification path is correctly isolated from live data. scripts/contextbench-select-slice.mjs — the probeEvaluator fallback clone at line ~563 is the only area that could become flaky over time, but it is not exercised in normal CI.
|
| Filename | Overview |
|---|---|
| scripts/contextbench-select-slice.mjs | New script for deterministic ContextBench task selection, payload materialization, and manifest verification. Previously reported issues (live-dataset drift in --check and task-push before failure) have been addressed. Minor concern: the fallback evaluator clone in probeEvaluator uses an unpinned --depth 1 ref. |
| tests/contextbench-task-manifest.test.ts | Comprehensive validation tests for the task manifest, exclusion log, payload materialization, gold-input writing, and checkout materialization. Tests exercise the script end-to-end with synthetic fixtures. Hash computation in tests mirrors the script's canonicalize/stableStringify path correctly. |
| tests/contextbench-protocol.test.ts | Fixture-only protocol invariant tests covering phase boundaries, lane isolation, run-count policy, correction governance, and anti-gaming tripwires. No live network access; all assertions are against committed JSON fixtures. |
| tests/fixtures/contextbench-corrections.json | Corrections ledger with one factual-erratum entry for the hardness-signal policy update. All required correction fields are present and the hashes are well-formed SHA-256 values. |
| tests/impact-2hop.test.ts | Timeout value replaced with a named constant SLOW_WINDOWS_TEST_TIMEOUT_MS (60 s); same constant duplicated across three other test files rather than being shared. |
| tests/search-decision-card.test.ts | Seven 30000 ms timeout literals replaced with SLOW_WINDOWS_TEST_TIMEOUT_MS; same duplication issue as the other test files. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[contextbench-select-slice.mjs] --> B{Mode}
B -->|--dry-run| C[loadRows from HuggingFace]
B -->|--write-fixtures| C
B -->|--write-task-payloads| D[loadRowsForArgs]
B -->|--write-gold| D
B -->|--check + no rows-file| E[verifyManifest self-hash only]
B -->|--check + rows-file| C
B -->|--probe-evaluator| F[createEvaluatorFixture]
B -->|--materialize-checkouts| G[cloneCheckout per task]
C --> H[buildPool - normalizeRow, dedup]
H --> I[selectTasks - language/source/repo/fill coverage]
I --> J[buildArtifacts]
J -->|--write-fixtures| K[tests/fixtures/contextbench-task-manifest.json]
J -->|--write-fixtures| L[tests/fixtures/contextbench-selection-exclusions.json]
J -->|--check| M[verifyManifest manifest vs recomputed]
D --> N[buildTaskPayloads - verify hashes, skip on failure]
N --> O[writeJson payloads file]
F --> P[python -m contextbench.evaluate probe]
P -->|module missing| Q[clone EuniAI/ContextBench - unpinned]
Q --> P
G --> R[git clone + checkout base_commit]
R --> S[withPayloadHash - excludes checkout_path / status_short / materialized_at]
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/pr/..." | Re-trigger Greptile
| failures.push('manifest_hash does not match manifest content'); | ||
| if (actualHash !== expected.manifest_hash) | ||
| failures.push('manifest differs from deterministic dataset selection'); | ||
| if (actual.tasks.length !== 20) failures.push(`expected 20 tasks, got ${actual.tasks.length}`); | ||
| if (actual.hardness_proxy_used !== false) | ||
| failures.push('manifest must set hardness_proxy_used false'); | ||
| if (actual.hardness_signal_status !== HARDNESS_STATUS) | ||
| failures.push('manifest has wrong hardness signal status'); | ||
| if (!actual.no_lane_outputs_observed_attestation) failures.push('missing no-output attestation'); | ||
| if (new Set(actual.tasks.map((task) => task.repo_url)).size < 2) | ||
| failures.push('selected tasks cover fewer than two repos'); | ||
| if (new Set(actual.tasks.map((task) => task.language)).size < 2) | ||
| failures.push('selected tasks cover fewer than two languages'); | ||
| return failures; | ||
| } | ||
|
|
||
| function run(command, args, cwd) { | ||
| const result = spawnSync(command, args, { cwd, encoding: 'utf8', env: childEnvForCommand(command) }); | ||
| return { status: result.status, stdout: result.stdout ?? '', stderr: result.stderr ?? '' }; | ||
| } |
There was a problem hiding this comment.
verifyManifest / --check will permanently break as dataset grows
verifyManifest declares failure when actualHash !== expected.manifest_hash (line 473), but expected is built by re-running buildArtifacts against the live dataset. taskPoolHash (which drives manifestBase and thus manifest_hash) is computed from the full eligible pool — so any new row added to Contextbench/ContextBench after the manifest was frozen will change taskPoolHash, causing --check to report "manifest differs from deterministic dataset selection" for a legitimately untampered manifest.
The PR description's validation step runs --check without --rows-file, meaning it hits the live HuggingFace API. As soon as the upstream dataset gains a single new eligible row this check becomes permanently broken and can no longer serve as an integrity guard.
The --rows-file option exists to bypass this, but the verification workflow should make it mandatory (or snapshot the dataset at freeze time) so the check stays meaningful.
There was a problem hiding this comment.
Clarifying previous reply: fixed in commit 04a6cfb. Check without rows-file now performs only committed-manifest self-verification and exits before live dataset loading. Passing rows-file remains the stronger deterministic reselection mode from frozen rows.
|
|
||
| function canonicalize(value) { | ||
| if (value === undefined) return 'undefined'; | ||
| if (value === null) return 'null'; | ||
| if (typeof value !== 'string') return stableStringify(value).replace(/\r\n?/g, '\n'); | ||
| const normalized = value.replace(/\r\n?/g, '\n'); | ||
| const trimmed = normalized.trim(); | ||
| if (trimmed.startsWith('{') || trimmed.startsWith('[')) { | ||
| try { | ||
| return stableStringify(JSON.parse(trimmed)); | ||
| } catch { | ||
| return normalized; | ||
| } | ||
| } | ||
| return normalized; | ||
| } | ||
|
|
||
| function sha256(value) { | ||
| return `sha256:${createHash('sha256').update(value, 'utf8').digest('hex')}`; | ||
| } | ||
|
|
||
| function hashObject(value) { | ||
| return sha256(stableStringify(value)); | ||
| } | ||
|
|
||
| function writeJson(path, value) { | ||
| mkdirSync(dirname(path), { recursive: true }); | ||
| writeFileSync(path, `${JSON.stringify(value, null, 2)}\n`, 'utf8'); | ||
| } | ||
|
|
There was a problem hiding this comment.
stableStringify/hashObject duplicated in script and test
Identical implementations appear in scripts/contextbench-select-slice.mjs (lines 113–142) and tests/contextbench-task-manifest.test.ts. The test's self-verification (manifest_hash re-computation at line 1626) cross-validates the two today, but a future divergence — even a subtle one like a different localeCompare locale argument — would silently produce different hashes in the script vs. the test without a clear failure message. Consider extracting to a shared utility module.
There was a problem hiding this comment.
Leaving this intentionally separate for this PR. The test-side hashing is an independent verifier for the frozen JSON artifact; extracting a shared utility would couple the producer and verifier and add a new shared module just for one script/test pair. The manifest self-check and hash-order regression still cover deterministic behavior.
| if (!row) { | ||
| failures.push(`${task.instance_id}: missing dataset row`); | ||
| continue; | ||
| } | ||
| const problemStatement = typeof row.problem_statement === 'string' ? row.problem_statement : ''; | ||
| if (!problemStatement.trim()) failures.push(`${task.instance_id}: missing problem_statement`); | ||
| const problemStatementHash = sha256(canonicalize(problemStatement)); | ||
| if (problemStatementHash !== task.problem_statement_hash) { | ||
| failures.push(`${task.instance_id}: problem_statement_hash mismatch`); | ||
| } | ||
| if (row.repo_url !== task.repo_url) failures.push(`${task.instance_id}: repo_url mismatch`); | ||
| if (row.base_commit !== task.base_commit) | ||
| failures.push(`${task.instance_id}: base_commit mismatch`); | ||
| tasks.push({ | ||
| instance_id: task.instance_id, | ||
| original_inst_id: task.original_inst_id, | ||
| repo: task.repo, | ||
| repo_url: task.repo_url, | ||
| base_commit: task.base_commit, | ||
| problem_statement: problemStatement, | ||
| problem_statement_hash: problemStatementHash, | ||
| problem_statement_hash_verified: problemStatementHash === task.problem_statement_hash, | ||
| repo_checkout_path: checkoutPathForTask(task, checkoutRoot), | ||
| repo_checkout_status: checkoutRoot ? 'planned_not_verified' : 'not_planned', | ||
| lane_outputs_observed: false | ||
| }); | ||
| } | ||
| if (failures.length > 0) | ||
| throw new Error(`task payload materialization failed:\n- ${failures.join('\n- ')}`); | ||
| const payloadBase = { | ||
| name: 'v2.4-contextbench-phase40-task-payloads', | ||
| protocolVersion: manifest.protocolVersion, | ||
| dataset: manifest.dataset, | ||
| datasetConfig: manifest.datasetConfig, | ||
| split: manifest.split, | ||
| claimBearing: false, | ||
| purpose: | ||
| 'Phase 40 task input materialization; not lane output and not benchmark evidence by itself.', | ||
| manifest_hash: manifest.manifest_hash, | ||
| hash_canonicalization_version: CANONICALIZATION_VERSION, | ||
| checkout_root: checkoutRoot ? resolve(checkoutRoot) : null, | ||
| task_count: tasks.length, | ||
| tasks | ||
| }; | ||
| return { ...payloadBase, payload_hash: hashObject(payloadBase) }; | ||
| } | ||
|
|
||
| function summarize(tasks) { | ||
| const countBy = (field) => | ||
| tasks.reduce((acc, task) => { | ||
| acc[task[field]] = (acc[task[field]] ?? 0) + 1; | ||
| return acc; |
There was a problem hiding this comment.
buildTaskPayloads pushes mismatched tasks before throwing
When row.repo_url !== task.repo_url or row.base_commit !== task.base_commit, the failure is pushed to failures[] but the task is still appended to tasks[] (line 356–369). The function throws before returning, so no corrupted data escapes today. However, if the error-accumulation logic is ever relaxed (e.g., to return partial results on non-fatal mismatches), the tasks array will silently contain entries with un-verified repo coordinates. Pushing to tasks should be conditioned on there being no per-task failure.
There was a problem hiding this comment.
Clarifying previous reply: fixed in commit 04a6cfb. Task payload construction now skips appending an entry whenever that task has validation failures, and the regression test verifies a mismatched problem statement fails without writing a payload artifact.
Adds ContextBench protocol, lane, correction, task-manifest, selection-exclusion, and smoke-pack fixtures with deterministic validation tests. This PR freezes/reviews benchmark inputs and helper materialization only; it does not include live benchmark results, Phase 42 pass claims, or product-performance claims.\n\nValidation run:\n- rtk pnpm exec vitest run tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.ts\n- rtk node scripts/contextbench-select-slice.mjs --check tests/fixtures/contextbench-task-manifest.json\n- rtk pnpm exec tsc --noEmit\n- pre-push hook passed after build/test execution