Skip to content

Add ContextBench harness core#120

Merged
PatrickSys merged 12 commits intomasterfrom
pr/contextbench-harness-core
Apr 29, 2026
Merged

Add ContextBench harness core#120
PatrickSys merged 12 commits intomasterfrom
pr/contextbench-harness-core

Conversation

@PatrickSys
Copy link
Copy Markdown
Owner

Summary

  • Adds the non-claim-bearing ContextBench harness runner, retrieval gate, structured answer parsing, scoring, artifact, trajectory, and Phase 42 evidence-gate utilities.
  • Adds harness tests and fixtures for baseline snapshots/runs, schema enforcement, setup/index evidence, official-evaluator handling, lane isolation, scoring, and verification failure modes.
  • Stabilizes Windows hook execution for ContextBench temp Git repos and slow search integration tests without changing benchmark claims.

Verification

  • rtk node scripts/contextbench-runner.mjs --validate-fixtures
  • rtk node scripts/contextbench-runner.mjs --validate-lane-setup
  • rtk pnpm exec vitest run tests/contextbench-runner-contract.test.ts tests/contextbench-lane-setup.test.ts tests/contextbench-scoring.test.ts tests/contextbench-trajectory.test.ts tests/contextbench-baseline-schema-gate.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-runner.test.ts tests/contextbench-phase42-evidence-gate.test.ts tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.ts
  • rtk pnpm run format:check
  • rtk pnpm exec tsc --noEmit
  • rtk pnpm run build
  • Pre-push hook completed successfully during rtk git push -u origin pr/contextbench-harness-core

Claim Posture

  • This PR adds harness infrastructure only.
  • It does not run live benchmark rows, flip claimAllowed, or claim Phase 42/product improvement success.
  • Existing diagnostic artifacts remain non-claim-bearing.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cad646d9d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/contextbench-runner.mjs Outdated
};
const rawTrace = {
executor,
model: executor === 'claude' ? model : 'fake-executor',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve actual executor model in raw trace

Set rawTrace.model to the selected model for all real executors, not just Claude. As written, non-Claude runs (codex, gemini, opencode) are recorded as "fake-executor" while the manifest row stores taskExecution.model from --model, so Phase 42 provenance checks (rawTrace.model === row.taskExecution.model) will fail even when the run is otherwise valid, blocking claim-grade verification for those lanes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 867ac70: raw traces now record model: executor === 'fake' ? 'fake-executor' : model, so Codex/Gemini/OpenCode preserve the selected model. The adapter smoke test now asserts
awTrace.model === row.taskExecution.model and executor consistency for all three adapters.

Comment thread scripts/contextbench-runner.mjs Outdated
return {
laneId: laneCard.laneId,
proven,
sourceKind: telemetry?.proofSource ? 'env_override' : 'not_captured',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit claim-eligible lane isolation source kind

When telemetry evidence is present, this always emits sourceKind: 'env_override', but the Phase 42 gate explicitly rejects env_override as insufficient lane-isolation proof. That means rows with otherwise good observed-tools evidence can never satisfy lane isolation in claim verification, because the producer and verifier disagree on the accepted source kind.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the current branch: lane telemetry preserves sourceKind from the evidence source, including proxy and ranscript, instead of collapsing everything to �nv_override. The Phase 42 gate still rejects �nv_override as diagnostic-only, so claim-grade rows require explicit proxy/transcript evidence.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying previous reply: this was already addressed in the branch. Lane telemetry now preserves the evidence source kind from the telemetry payload, including proxy and transcript, instead of treating every telemetry-backed row as env_override. The verifier still rejects env_override for claim-grade lane proof.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR adds the non-claim-bearing ContextBench harness: runner, retrieval gate, structured answer parser, scoring module, artifact utilities, evidence gate, trajectory normaliser, and a comprehensive test suite. The two previously-flagged regressions (scoring field mismatches and hardcoded setupDurationMs) are addressed.

  • P1 — scorer/gate artifact contract gap: scoreWithOfficialEvaluatorFirst writes stdout/stderr as inline text, but hasOfficialEvaluatorProof unconditionally requires stdoutPath and stderrPath to be populated paths with matching SHA-256 hashes. Any score artifact produced by the TypeScript scorer will always fail the official_evaluator_missing gate check. The evidence gate test bypasses this by constructing paths manually in passingArtifacts(), so the gap is not caught by existing tests.

Confidence Score: 4/5

Safe to merge as non-claim-bearing infrastructure; the P1 scorer/gate artifact gap must be resolved before claim-bearing runs are attempted.

One P1 defect: the TypeScript scorer writes stdout/stderr as inline text but the evidence gate requires stdoutPath/stderrPath file paths, so any real scorer artifact will permanently fail hasOfficialEvaluatorProof. Since claimAllowed is false throughout this PR the gate is never exercised end-to-end yet, keeping the PR safe to land as infrastructure — but the gap must be closed before the claim path is activated.

src/eval/contextbench-scoring.ts — ContextBenchScoreResult must add stdoutPath/stderrPath and the function must write stdout/stderr to separate log files.

Important Files Changed

Filename Overview
src/eval/contextbench-scoring.ts Scorer emits inline stdout/stderr text but evidence gate requires stdoutPath/stderrPath file paths — gate will always emit official_evaluator_missing for any artifact produced by this module.
src/eval/contextbench-evidence-gate.ts Evidence gate logic is thorough and well-structured; all gate checks (official evaluator, lane isolation, setup/index cost, runner provenance, denominator contract) are coherent and correctly gated by evidenceMode.
src/eval/contextbench-artifacts.ts buildManifestRow now accepts caller-provided setupIndex; scoring fields are deliberately hardcoded to non-claim-bearing values for Phase 38 smoke runs, consistent with test assertions.
src/eval/contextbench-trajectory.ts Trajectory normalisation is correct; pred_steps[0].spans and pred_spans share the same object reference, which could be problematic if consumers mutate the trajectory output.
tests/contextbench-phase42-evidence-gate.test.ts Comprehensive gate test coverage; passingArtifacts() constructs stdoutPath/stderrPath manually, masking the gap between the TypeScript scorer's output and the gate's requirements.
tests/contextbench-scoring.test.ts Tests cover scorer return value fields and fallback metadata well, but do not verify that the written score JSON artifact satisfies the evidence gate's stdoutPath/stderrPath requirements.
tests/contextbench-runner-contract.test.ts Runner contract tests cover fixture validation, fake-executor smoke runs, manifest append semantics, and setupIndex propagation cleanly.

Sequence Diagram

sequenceDiagram
    participant Runner as contextbench-runner.mjs
    participant Scorer as scoreWithOfficialEvaluatorFirst (TS)
    participant Disk as Score Artifact (score.json)
    participant Gate as evaluateContextBenchEvidenceGate

    Runner->>Scorer: run official evaluator
    Scorer->>Disk: writeJson(outputPath, { stdout, stderr, exitCode, ... })
    Note over Disk: stdoutPath/stderrPath absent
    Runner->>Gate: artifactsByRunId[runId].score = parse(score.json)
    Gate->>Gate: hasOfficialEvaluatorProof(row, score, hashes)
    Note over Gate: checks score.stdoutPath → undefined → returns false
    Gate-->>Runner: official_evaluator_missing failure
Loading

Reviews (2): Last reviewed commit: "fix(test): harden ContextBench schema cl..." | Re-trigger Greptile

Comment thread src/eval/contextbench-scoring.ts Outdated
Comment on lines +40 to +70
missingEvidenceFiles: string[];
unsupportedClaim: boolean;
falseReady: boolean;
reasons: string[];
}

function writeJson(filePath: string, value: unknown): void {
mkdirSync(path.dirname(filePath), { recursive: true });
writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
}

export async function scoreWithOfficialEvaluatorFirst(
params: OfficialEvaluatorParams
): Promise<ContextBenchScoreResult> {
const args = [
'-m',
'contextbench.evaluate',
'--gold',
params.goldPath,
'--pred',
params.predictionPath
];
if (params.cachePath) args.push('--cache', params.cachePath);
args.push('--out', params.outputPath);
const command = `python ${args.join(' ')}`;
const result = await params.runner('python', args, params.cwd);
if (result.status === 0) {
const score = {
status: 'completed' as const,
mode: 'official_evaluator' as const,
claimBearing: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ContextBenchScoreResult is incompatible with ContextBenchScoreEvidence

scoreWithOfficialEvaluatorFirst returns (and writes) a score object with exitStatus, but ContextBenchScoreEvidence (consumed by hasOfficialEvaluatorProof in the evidence gate) expects exitCode. Additionally, officialEvaluatorInvoked is absent from ContextBenchScoreResult. Because of these two mismatches, any TypeScript harness that stores this function's return value as the score artifact will cause hasOfficialEvaluatorProof to always return false — permanently blocking the claim gate even for a valid run.

The runner .mjs correctly emits both exitCode and officialEvaluatorInvoked: true inline (lines ~1091–1120), but the TypeScript module diverges silently. The two representations need to be reconciled.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the current branch before this latest push: ContextBench scoring now emits the gate-compatible evaluator fields, including exitCode, officialEvaluatorFirst, officialEvaluatorAttempted, officialEvaluatorInvoked, command, outputPath, stdoutPath, and stderrPath. The scorer tests cover claimAllowed false versus true behavior and the metadata contract.

Comment on lines +113 to +135
return content.split('\n').map((line) => JSON.parse(line) as ContextBenchRunManifestRow);
}

export function buildManifestRow(params: {
runId: string;
protocolVersion: string;
protocolHash: string;
taskManifestHash: string;
laneCard: ContextBenchLaneToolCard;
task: ContextBenchTaskIdentity;
repeatIndex: number;
status: ContextBenchTerminalStatus;
startedAt: string;
completedAt: string;
paths: ArtifactPathSet;
hashes: Record<string, string>;
executor: ContextBenchExecutor;
model: string;
timeoutSeconds: number;
maxContextTokens: number;
maxAnswerTokens: number;
}): ContextBenchRunManifestRow {
return {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 buildManifestRow hardcodes setupDurationMs/indexDurationMs to 0

hasMeasuredSetupIndex in the evidence gate checks if (evidence.setupStatus === 'completed' && setupDuration <= 0) return false, so any row emitted by this function whose setupStatus is 'completed' will permanently fail the setup_index_cost_missing gate. Additionally, the cross-check row.setupIndex.setupDurationMs !== evidence.setupDurationMs compares the row's hardcoded 0 against the artifact's actual measured value, causing a guaranteed mismatch.

The runner script uses a separate buildManifestRowForArtifacts that receives actual measurements; this TypeScript helper should similarly require measured values rather than silently defaulting to 0.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the current branch: buildManifestRow now requires caller-provided setupIndex evidence instead of hardcoding setup/index durations to zero, and the runner contract test asserts setupIndex propagation into manifest rows.

isError?: boolean;
};

const SLOW_WINDOWS_TEST_TIMEOUT_MS = 60000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate constant across two test files

SLOW_WINDOWS_TEST_TIMEOUT_MS is defined identically in both search-decision-card.test.ts and search-snippets.test.ts. Consider extracting it to a shared test-helper (e.g., tests/helpers/timeouts.ts) to keep the value in one place.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged, but not changing for this PR. The duplicate timeout constant is test-local and non-blocking; extracting a shared helper would add a new file for two call sites while the active merge blockers here are ContextBench evidence/provenance contracts.

@PatrickSys PatrickSys changed the base branch from pr/contextbench-protocol-fixtures to master April 29, 2026 19:28
@PatrickSys PatrickSys closed this Apr 29, 2026
@PatrickSys PatrickSys reopened this Apr 29, 2026
Comment on lines +28 to +107
export interface ContextBenchScoreResult {
status: 'completed' | 'judge_failed';
mode: 'official_evaluator' | 'diagnostic_fallback';
claimBearing: boolean;
command: string;
stdout: string;
stderr: string;
exitStatus: number | null;
exitCode: number | null;
officialEvaluatorFirst: boolean;
officialEvaluatorAttempted: boolean;
officialEvaluatorInvoked: boolean;
outputPath: string;
fallbackReason?: string;
}

export interface FactRecallDiagnosticResult {
missingRequiredFacts: string[];
missingEvidenceFiles: string[];
unsupportedClaim: boolean;
falseReady: boolean;
reasons: string[];
}

function writeJson(filePath: string, value: unknown): void {
mkdirSync(path.dirname(filePath), { recursive: true });
writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
}

export async function scoreWithOfficialEvaluatorFirst(
params: OfficialEvaluatorParams
): Promise<ContextBenchScoreResult> {
const args = [
'-m',
'contextbench.evaluate',
'--gold',
params.goldPath,
'--pred',
params.predictionPath
];
if (params.cachePath) args.push('--cache', params.cachePath);
args.push('--out', params.outputPath);
const command = `python ${args.join(' ')}`;
const result = await params.runner('python', args, params.cwd);
if (result.status === 0) {
const score = {
status: 'completed' as const,
mode: 'official_evaluator' as const,
claimBearing: params.claimAllowed === true,
command,
stdout: result.stdout,
stderr: result.stderr,
exitStatus: result.status,
exitCode: result.status,
officialEvaluatorFirst: true,
officialEvaluatorAttempted: true,
officialEvaluatorInvoked: true,
outputPath: params.outputPath
};
writeJson(params.outputPath, score);
return score;
}

const score = {
status: 'judge_failed' as const,
mode: 'diagnostic_fallback' as const,
claimBearing: false,
command,
stdout: result.stdout,
stderr: result.stderr,
exitStatus: result.status,
exitCode: result.status,
officialEvaluatorFirst: true,
officialEvaluatorAttempted: true,
officialEvaluatorInvoked: true,
outputPath: params.outputPath,
fallbackReason: 'official_evaluator_failed'
};
writeJson(params.outputPath, score);
return score;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Scorer artifact missing stdoutPath/stderrPath; evidence gate will always reject it

scoreWithOfficialEvaluatorFirst writes stdout and stderr as inline raw text fields in the score JSON artifact. But hasOfficialEvaluatorProof in the evidence gate unconditionally checks all three of these conditions:

typeof score.stdoutPath === 'string' && score.stdoutPath.length > 0 &&
hasSha256Hash(artifactHashesByPath[score.stdoutPath]) &&
typeof score.stderrPath === 'string' && score.stderrPath.length > 0 &&
hasSha256Hash(artifactHashesByPath[score.stderrPath])

Because ContextBenchScoreResult has no stdoutPath/stderrPath fields, the serialised score artifact will always have stdoutPath === undefined, causing hasOfficialEvaluatorProof to return false and permanently emitting an official_evaluator_missing failure—even for a successful, claim-allowed run.

The evidence gate test constructs stdoutPath/stderrPath by hand in passingArtifacts(), so this gap is not caught by the existing scorer tests. The scorer must write stdout/stderr to separate log files and include their paths in the score artifact for the gate contract to close.

@PatrickSys PatrickSys merged commit 9e09dad into master Apr 29, 2026
4 checks passed
@PatrickSys PatrickSys deleted the pr/contextbench-harness-core branch April 30, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant