Skip to content

test(eval): add ContextBench protocol fixtures#119

Merged
PatrickSys merged 6 commits intomasterfrom
pr/contextbench-protocol-fixtures
Apr 29, 2026
Merged

test(eval): add ContextBench protocol fixtures#119
PatrickSys merged 6 commits intomasterfrom
pr/contextbench-protocol-fixtures

Conversation

@PatrickSys
Copy link
Copy Markdown
Owner

Adds ContextBench protocol, lane, correction, task-manifest, selection-exclusion, and smoke-pack fixtures with deterministic validation tests. This PR freezes/reviews benchmark inputs and helper materialization only; it does not include live benchmark results, Phase 42 pass claims, or product-performance claims.\n\nValidation run:\n- rtk pnpm exec vitest run tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.ts\n- rtk node scripts/contextbench-select-slice.mjs --check tests/fixtures/contextbench-task-manifest.json\n- rtk pnpm exec tsc --noEmit\n- pre-push hook passed after build/test execution

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR freezes the ContextBench Phase 37 benchmark infrastructure: a deterministic task-selection script, six JSON fixture files (protocol, lanes, corrections, task manifest, selection exclusions, smoke pack), and two validation test suites. The four pre-existing slow test files are also updated to replace magic timeout literals with a named constant. Previously reported issues — live-dataset drift in --check mode and premature task-push before validation failure — have been addressed in commit 04a6cfb.

Confidence Score: 5/5

Safe to merge; all findings are P2 style suggestions with no runtime impact on the committed test fixtures or normal CI execution.

All P1 issues from previous review rounds have been resolved in commit 04a6cfb. Remaining findings are non-blocking P2s: an unpinned evaluator clone in a manually-invoked probe mode and a duplicated constant in test files. Fixture hash values are well-formed, hash computation is consistent between script and tests, and the --check self-verification path is correctly isolated from live data.

scripts/contextbench-select-slice.mjs — the probeEvaluator fallback clone at line ~563 is the only area that could become flaky over time, but it is not exercised in normal CI.

Important Files Changed

Filename Overview
scripts/contextbench-select-slice.mjs New script for deterministic ContextBench task selection, payload materialization, and manifest verification. Previously reported issues (live-dataset drift in --check and task-push before failure) have been addressed. Minor concern: the fallback evaluator clone in probeEvaluator uses an unpinned --depth 1 ref.
tests/contextbench-task-manifest.test.ts Comprehensive validation tests for the task manifest, exclusion log, payload materialization, gold-input writing, and checkout materialization. Tests exercise the script end-to-end with synthetic fixtures. Hash computation in tests mirrors the script's canonicalize/stableStringify path correctly.
tests/contextbench-protocol.test.ts Fixture-only protocol invariant tests covering phase boundaries, lane isolation, run-count policy, correction governance, and anti-gaming tripwires. No live network access; all assertions are against committed JSON fixtures.
tests/fixtures/contextbench-corrections.json Corrections ledger with one factual-erratum entry for the hardness-signal policy update. All required correction fields are present and the hashes are well-formed SHA-256 values.
tests/impact-2hop.test.ts Timeout value replaced with a named constant SLOW_WINDOWS_TEST_TIMEOUT_MS (60 s); same constant duplicated across three other test files rather than being shared.
tests/search-decision-card.test.ts Seven 30000 ms timeout literals replaced with SLOW_WINDOWS_TEST_TIMEOUT_MS; same duplication issue as the other test files.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[contextbench-select-slice.mjs] --> B{Mode}
    B -->|--dry-run| C[loadRows from HuggingFace]
    B -->|--write-fixtures| C
    B -->|--write-task-payloads| D[loadRowsForArgs]
    B -->|--write-gold| D
    B -->|--check + no rows-file| E[verifyManifest self-hash only]
    B -->|--check + rows-file| C
    B -->|--probe-evaluator| F[createEvaluatorFixture]
    B -->|--materialize-checkouts| G[cloneCheckout per task]
    C --> H[buildPool - normalizeRow, dedup]
    H --> I[selectTasks - language/source/repo/fill coverage]
    I --> J[buildArtifacts]
    J -->|--write-fixtures| K[tests/fixtures/contextbench-task-manifest.json]
    J -->|--write-fixtures| L[tests/fixtures/contextbench-selection-exclusions.json]
    J -->|--check| M[verifyManifest manifest vs recomputed]
    D --> N[buildTaskPayloads - verify hashes, skip on failure]
    N --> O[writeJson payloads file]
    F --> P[python -m contextbench.evaluate probe]
    P -->|module missing| Q[clone EuniAI/ContextBench - unpinned]
    Q --> P
    G --> R[git clone + checkout base_commit]
    R --> S[withPayloadHash - excludes checkout_path / status_short / materialized_at]
Loading

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/pr/..." | Re-trigger Greptile

Comment on lines +465 to +484
failures.push('manifest_hash does not match manifest content');
if (actualHash !== expected.manifest_hash)
failures.push('manifest differs from deterministic dataset selection');
if (actual.tasks.length !== 20) failures.push(`expected 20 tasks, got ${actual.tasks.length}`);
if (actual.hardness_proxy_used !== false)
failures.push('manifest must set hardness_proxy_used false');
if (actual.hardness_signal_status !== HARDNESS_STATUS)
failures.push('manifest has wrong hardness signal status');
if (!actual.no_lane_outputs_observed_attestation) failures.push('missing no-output attestation');
if (new Set(actual.tasks.map((task) => task.repo_url)).size < 2)
failures.push('selected tasks cover fewer than two repos');
if (new Set(actual.tasks.map((task) => task.language)).size < 2)
failures.push('selected tasks cover fewer than two languages');
return failures;
}

function run(command, args, cwd) {
const result = spawnSync(command, args, { cwd, encoding: 'utf8', env: childEnvForCommand(command) });
return { status: result.status, stdout: result.stdout ?? '', stderr: result.stderr ?? '' };
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 verifyManifest / --check will permanently break as dataset grows

verifyManifest declares failure when actualHash !== expected.manifest_hash (line 473), but expected is built by re-running buildArtifacts against the live dataset. taskPoolHash (which drives manifestBase and thus manifest_hash) is computed from the full eligible pool — so any new row added to Contextbench/ContextBench after the manifest was frozen will change taskPoolHash, causing --check to report "manifest differs from deterministic dataset selection" for a legitimately untampered manifest.

The PR description's validation step runs --check without --rows-file, meaning it hits the live HuggingFace API. As soon as the upstream dataset gains a single new eligible row this check becomes permanently broken and can no longer serve as an integrity guard.

The --rows-file option exists to bypass this, but the verification workflow should make it mandatory (or snapshot the dataset at freeze time) so the check stays meaningful.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying previous reply: fixed in commit 04a6cfb. Check without rows-file now performs only committed-manifest self-verification and exits before live dataset loading. Passing rows-file remains the stronger deterministic reselection mode from frozen rows.

Comment on lines +113 to +142

function canonicalize(value) {
if (value === undefined) return 'undefined';
if (value === null) return 'null';
if (typeof value !== 'string') return stableStringify(value).replace(/\r\n?/g, '\n');
const normalized = value.replace(/\r\n?/g, '\n');
const trimmed = normalized.trim();
if (trimmed.startsWith('{') || trimmed.startsWith('[')) {
try {
return stableStringify(JSON.parse(trimmed));
} catch {
return normalized;
}
}
return normalized;
}

function sha256(value) {
return `sha256:${createHash('sha256').update(value, 'utf8').digest('hex')}`;
}

function hashObject(value) {
return sha256(stableStringify(value));
}

function writeJson(path, value) {
mkdirSync(dirname(path), { recursive: true });
writeFileSync(path, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 stableStringify/hashObject duplicated in script and test

Identical implementations appear in scripts/contextbench-select-slice.mjs (lines 113–142) and tests/contextbench-task-manifest.test.ts. The test's self-verification (manifest_hash re-computation at line 1626) cross-validates the two today, but a future divergence — even a subtle one like a different localeCompare locale argument — would silently produce different hashes in the script vs. the test without a clear failure message. Consider extracting to a shared utility module.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this intentionally separate for this PR. The test-side hashing is an independent verifier for the frozen JSON artifact; extracting a shared utility would couple the producer and verifier and add a new shared module just for one script/test pair. The manifest self-check and hash-order regression still cover deterministic behavior.

Comment on lines +337 to +388
if (!row) {
failures.push(`${task.instance_id}: missing dataset row`);
continue;
}
const problemStatement = typeof row.problem_statement === 'string' ? row.problem_statement : '';
if (!problemStatement.trim()) failures.push(`${task.instance_id}: missing problem_statement`);
const problemStatementHash = sha256(canonicalize(problemStatement));
if (problemStatementHash !== task.problem_statement_hash) {
failures.push(`${task.instance_id}: problem_statement_hash mismatch`);
}
if (row.repo_url !== task.repo_url) failures.push(`${task.instance_id}: repo_url mismatch`);
if (row.base_commit !== task.base_commit)
failures.push(`${task.instance_id}: base_commit mismatch`);
tasks.push({
instance_id: task.instance_id,
original_inst_id: task.original_inst_id,
repo: task.repo,
repo_url: task.repo_url,
base_commit: task.base_commit,
problem_statement: problemStatement,
problem_statement_hash: problemStatementHash,
problem_statement_hash_verified: problemStatementHash === task.problem_statement_hash,
repo_checkout_path: checkoutPathForTask(task, checkoutRoot),
repo_checkout_status: checkoutRoot ? 'planned_not_verified' : 'not_planned',
lane_outputs_observed: false
});
}
if (failures.length > 0)
throw new Error(`task payload materialization failed:\n- ${failures.join('\n- ')}`);
const payloadBase = {
name: 'v2.4-contextbench-phase40-task-payloads',
protocolVersion: manifest.protocolVersion,
dataset: manifest.dataset,
datasetConfig: manifest.datasetConfig,
split: manifest.split,
claimBearing: false,
purpose:
'Phase 40 task input materialization; not lane output and not benchmark evidence by itself.',
manifest_hash: manifest.manifest_hash,
hash_canonicalization_version: CANONICALIZATION_VERSION,
checkout_root: checkoutRoot ? resolve(checkoutRoot) : null,
task_count: tasks.length,
tasks
};
return { ...payloadBase, payload_hash: hashObject(payloadBase) };
}

function summarize(tasks) {
const countBy = (field) =>
tasks.reduce((acc, task) => {
acc[task[field]] = (acc[task[field]] ?? 0) + 1;
return acc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 buildTaskPayloads pushes mismatched tasks before throwing

When row.repo_url !== task.repo_url or row.base_commit !== task.base_commit, the failure is pushed to failures[] but the task is still appended to tasks[] (line 356–369). The function throws before returning, so no corrupted data escapes today. However, if the error-accumulation logic is ever relaxed (e.g., to return partial results on non-fatal mismatches), the tasks array will silently contain entries with un-verified repo coordinates. Pushing to tasks should be conditioned on there being no per-task failure.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying previous reply: fixed in commit 04a6cfb. Task payload construction now skips appending an entry whenever that task has validation failures, and the regression test verifies a mismatched problem statement fails without writing a payload artifact.

@PatrickSys PatrickSys changed the base branch from pr/git-hygiene to master April 29, 2026 19:20
@PatrickSys PatrickSys closed this Apr 29, 2026
@PatrickSys PatrickSys reopened this Apr 29, 2026
@PatrickSys PatrickSys merged commit 787bb17 into master Apr 29, 2026
4 checks passed
@PatrickSys PatrickSys deleted the pr/contextbench-protocol-fixtures branch April 30, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant