test(eval): add ContextBench protocol fixtures by PatrickSys · Pull Request #119 · PatrickSys/codebase-context

PatrickSys · 2026-04-29T15:58:10Z

Adds ContextBench protocol, lane, correction, task-manifest, selection-exclusion, and smoke-pack fixtures with deterministic validation tests. This PR freezes/reviews benchmark inputs and helper materialization only; it does not include live benchmark results, Phase 42 pass claims, or product-performance claims.\n\nValidation run:\n- rtk pnpm exec vitest run tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.ts\n- rtk node scripts/contextbench-select-slice.mjs --check tests/fixtures/contextbench-task-manifest.json\n- rtk pnpm exec tsc --noEmit\n- pre-push hook passed after build/test execution

greptile-apps · 2026-04-29T16:02:41Z

Greptile Summary

This PR freezes the ContextBench Phase 37 benchmark infrastructure: a deterministic task-selection script, six JSON fixture files (protocol, lanes, corrections, task manifest, selection exclusions, smoke pack), and two validation test suites. The four pre-existing slow test files are also updated to replace magic timeout literals with a named constant. Previously reported issues — live-dataset drift in --check mode and premature task-push before validation failure — have been addressed in commit 04a6cfb.

Confidence Score: 5/5

Safe to merge; all findings are P2 style suggestions with no runtime impact on the committed test fixtures or normal CI execution.

All P1 issues from previous review rounds have been resolved in commit 04a6cfb. Remaining findings are non-blocking P2s: an unpinned evaluator clone in a manually-invoked probe mode and a duplicated constant in test files. Fixture hash values are well-formed, hash computation is consistent between script and tests, and the --check self-verification path is correctly isolated from live data.

scripts/contextbench-select-slice.mjs — the probeEvaluator fallback clone at line ~563 is the only area that could become flaky over time, but it is not exercised in normal CI.

Important Files Changed

Filename	Overview
scripts/contextbench-select-slice.mjs	New script for deterministic ContextBench task selection, payload materialization, and manifest verification. Previously reported issues (live-dataset drift in --check and task-push before failure) have been addressed. Minor concern: the fallback evaluator clone in probeEvaluator uses an unpinned --depth 1 ref.
tests/contextbench-task-manifest.test.ts	Comprehensive validation tests for the task manifest, exclusion log, payload materialization, gold-input writing, and checkout materialization. Tests exercise the script end-to-end with synthetic fixtures. Hash computation in tests mirrors the script's canonicalize/stableStringify path correctly.
tests/contextbench-protocol.test.ts	Fixture-only protocol invariant tests covering phase boundaries, lane isolation, run-count policy, correction governance, and anti-gaming tripwires. No live network access; all assertions are against committed JSON fixtures.
tests/fixtures/contextbench-corrections.json	Corrections ledger with one factual-erratum entry for the hardness-signal policy update. All required correction fields are present and the hashes are well-formed SHA-256 values.
tests/impact-2hop.test.ts	Timeout value replaced with a named constant SLOW_WINDOWS_TEST_TIMEOUT_MS (60 s); same constant duplicated across three other test files rather than being shared.
tests/search-decision-card.test.ts	Seven 30000 ms timeout literals replaced with SLOW_WINDOWS_TEST_TIMEOUT_MS; same duplication issue as the other test files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[contextbench-select-slice.mjs] --> B{Mode}
    B -->|--dry-run| C[loadRows from HuggingFace]
    B -->|--write-fixtures| C
    B -->|--write-task-payloads| D[loadRowsForArgs]
    B -->|--write-gold| D
    B -->|--check + no rows-file| E[verifyManifest self-hash only]
    B -->|--check + rows-file| C
    B -->|--probe-evaluator| F[createEvaluatorFixture]
    B -->|--materialize-checkouts| G[cloneCheckout per task]
    C --> H[buildPool - normalizeRow, dedup]
    H --> I[selectTasks - language/source/repo/fill coverage]
    I --> J[buildArtifacts]
    J -->|--write-fixtures| K[tests/fixtures/contextbench-task-manifest.json]
    J -->|--write-fixtures| L[tests/fixtures/contextbench-selection-exclusions.json]
    J -->|--check| M[verifyManifest manifest vs recomputed]
    D --> N[buildTaskPayloads - verify hashes, skip on failure]
    N --> O[writeJson payloads file]
    F --> P[python -m contextbench.evaluate probe]
    P -->|module missing| Q[clone EuniAI/ContextBench - unpinned]
    Q --> P
    G --> R[git clone + checkout base_commit]
    R --> S[withPayloadHash - excludes checkout_path / status_short / materialized_at]

_{Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/pr/..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T16:02:46Z

+    failures.push('manifest_hash does not match manifest content');
+  if (actualHash !== expected.manifest_hash)
+    failures.push('manifest differs from deterministic dataset selection');
+  if (actual.tasks.length !== 20) failures.push(`expected 20 tasks, got ${actual.tasks.length}`);
+  if (actual.hardness_proxy_used !== false)
+    failures.push('manifest must set hardness_proxy_used false');
+  if (actual.hardness_signal_status !== HARDNESS_STATUS)
+    failures.push('manifest has wrong hardness signal status');
+  if (!actual.no_lane_outputs_observed_attestation) failures.push('missing no-output attestation');
+  if (new Set(actual.tasks.map((task) => task.repo_url)).size < 2)
+    failures.push('selected tasks cover fewer than two repos');
+  if (new Set(actual.tasks.map((task) => task.language)).size < 2)
+    failures.push('selected tasks cover fewer than two languages');
+  return failures;
+}
+
+function run(command, args, cwd) {
+  const result = spawnSync(command, args, { cwd, encoding: 'utf8', env: childEnvForCommand(command) });
+  return { status: result.status, stdout: result.stdout ?? '', stderr: result.stderr ?? '' };
+}


verifyManifest / --check will permanently break as dataset grows

verifyManifest declares failure when actualHash !== expected.manifest_hash (line 473), but expected is built by re-running buildArtifacts against the live dataset. taskPoolHash (which drives manifestBase and thus manifest_hash) is computed from the full eligible pool — so any new row added to Contextbench/ContextBench after the manifest was frozen will change taskPoolHash, causing --check to report "manifest differs from deterministic dataset selection" for a legitimately untampered manifest.

The PR description's validation step runs --check without --rows-file, meaning it hits the live HuggingFace API. As soon as the upstream dataset gains a single new eligible row this check becomes permanently broken and can no longer serve as an integrity guard.

The --rows-file option exists to bypass this, but the verification workflow should make it mandatory (or snapshot the dataset at freeze time) so the check stays meaningful.

Clarifying previous reply: fixed in commit 04a6cfb. Check without rows-file now performs only committed-manifest self-verification and exits before live dataset loading. Passing rows-file remains the stronger deterministic reselection mode from frozen rows.

greptile-apps · 2026-04-29T16:02:49Z

+
+function canonicalize(value) {
+  if (value === undefined) return 'undefined';
+  if (value === null) return 'null';
+  if (typeof value !== 'string') return stableStringify(value).replace(/\r\n?/g, '\n');
+  const normalized = value.replace(/\r\n?/g, '\n');
+  const trimmed = normalized.trim();
+  if (trimmed.startsWith('{') || trimmed.startsWith('[')) {
+    try {
+      return stableStringify(JSON.parse(trimmed));
+    } catch {
+      return normalized;
+    }
+  }
+  return normalized;
+}
+
+function sha256(value) {
+  return `sha256:${createHash('sha256').update(value, 'utf8').digest('hex')}`;
+}
+
+function hashObject(value) {
+  return sha256(stableStringify(value));
+}
+
+function writeJson(path, value) {
+  mkdirSync(dirname(path), { recursive: true });
+  writeFileSync(path, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
+}
+


stableStringify/hashObject duplicated in script and test

Identical implementations appear in scripts/contextbench-select-slice.mjs (lines 113–142) and tests/contextbench-task-manifest.test.ts. The test's self-verification (manifest_hash re-computation at line 1626) cross-validates the two today, but a future divergence — even a subtle one like a different localeCompare locale argument — would silently produce different hashes in the script vs. the test without a clear failure message. Consider extracting to a shared utility module.

Leaving this intentionally separate for this PR. The test-side hashing is an independent verifier for the frozen JSON artifact; extracting a shared utility would couple the producer and verifier and add a new shared module just for one script/test pair. The manifest self-check and hash-order regression still cover deterministic behavior.

greptile-apps · 2026-04-29T16:02:57Z

+    if (!row) {
+      failures.push(`${task.instance_id}: missing dataset row`);
+      continue;
+    }
+    const problemStatement = typeof row.problem_statement === 'string' ? row.problem_statement : '';
+    if (!problemStatement.trim()) failures.push(`${task.instance_id}: missing problem_statement`);
+    const problemStatementHash = sha256(canonicalize(problemStatement));
+    if (problemStatementHash !== task.problem_statement_hash) {
+      failures.push(`${task.instance_id}: problem_statement_hash mismatch`);
+    }
+    if (row.repo_url !== task.repo_url) failures.push(`${task.instance_id}: repo_url mismatch`);
+    if (row.base_commit !== task.base_commit)
+      failures.push(`${task.instance_id}: base_commit mismatch`);
+    tasks.push({
+      instance_id: task.instance_id,
+      original_inst_id: task.original_inst_id,
+      repo: task.repo,
+      repo_url: task.repo_url,
+      base_commit: task.base_commit,
+      problem_statement: problemStatement,
+      problem_statement_hash: problemStatementHash,
+      problem_statement_hash_verified: problemStatementHash === task.problem_statement_hash,
+      repo_checkout_path: checkoutPathForTask(task, checkoutRoot),
+      repo_checkout_status: checkoutRoot ? 'planned_not_verified' : 'not_planned',
+      lane_outputs_observed: false
+    });
+  }
+  if (failures.length > 0)
+    throw new Error(`task payload materialization failed:\n- ${failures.join('\n- ')}`);
+  const payloadBase = {
+    name: 'v2.4-contextbench-phase40-task-payloads',
+    protocolVersion: manifest.protocolVersion,
+    dataset: manifest.dataset,
+    datasetConfig: manifest.datasetConfig,
+    split: manifest.split,
+    claimBearing: false,
+    purpose:
+      'Phase 40 task input materialization; not lane output and not benchmark evidence by itself.',
+    manifest_hash: manifest.manifest_hash,
+    hash_canonicalization_version: CANONICALIZATION_VERSION,
+    checkout_root: checkoutRoot ? resolve(checkoutRoot) : null,
+    task_count: tasks.length,
+    tasks
+  };
+  return { ...payloadBase, payload_hash: hashObject(payloadBase) };
+}
+
+function summarize(tasks) {
+  const countBy = (field) =>
+    tasks.reduce((acc, task) => {
+      acc[task[field]] = (acc[task[field]] ?? 0) + 1;
+      return acc;


buildTaskPayloads pushes mismatched tasks before throwing

When row.repo_url !== task.repo_url or row.base_commit !== task.base_commit, the failure is pushed to failures[] but the task is still appended to tasks[] (line 356–369). The function throws before returning, so no corrupted data escapes today. However, if the error-accumulation logic is ever relaxed (e.g., to return partial results on non-fatal mismatches), the tasks array will silently contain entries with un-verified repo coordinates. Pushing to tasks should be conditioned on there being no per-task failure.

Clarifying previous reply: fixed in commit 04a6cfb. Task payload construction now skips appending an entry whenever that task has validation failures, and the regression test verifies a mismatched problem statement fails without writing a payload artifact.

…ench-protocol-fixtures

PatrickSys added 2 commits April 29, 2026 17:50

test(eval): add ContextBench protocol fixtures

4525281

fix(test): isolate ContextBench git fixtures

62d3110

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

PatrickSys added 4 commits April 29, 2026 19:55

fix(eval): harden ContextBench fixture verification

bed5064

fix(test): relax slow Windows integration timeouts

5675ebd

fix(eval): harden ContextBench manifest checks

04a6cfb

Merge remote-tracking branch 'origin/pr/git-hygiene' into pr/contextb…

581ded2

…ench-protocol-fixtures

PatrickSys changed the base branch from pr/git-hygiene to master April 29, 2026 19:20

PatrickSys closed this Apr 29, 2026

PatrickSys reopened this Apr 29, 2026

PatrickSys merged commit 787bb17 into master Apr 29, 2026
4 checks passed

PatrickSys deleted the pr/contextbench-protocol-fixtures branch April 30, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(eval): add ContextBench protocol fixtures#119

test(eval): add ContextBench protocol fixtures#119
PatrickSys merged 6 commits intomasterfrom
pr/contextbench-protocol-fixtures

PatrickSys commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

PatrickSys Apr 29, 2026

Uh oh!

PatrickSys Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

PatrickSys Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

PatrickSys Apr 29, 2026

Uh oh!

PatrickSys Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PatrickSys commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading