Skip to content

feat(agent-eval): trajectory HTML visualizer + eval-time generation#880

Open
leggetter wants to merge 2 commits intomainfrom
feat/agent-eval-visualizer
Open

feat(agent-eval): trajectory HTML visualizer + eval-time generation#880
leggetter wants to merge 2 commits intomainfrom
feat/agent-eval-visualizer

Conversation

@leggetter
Copy link
Copy Markdown
Collaborator

Summary

Adds a self-contained trajectory.html timeline for eval runs: tool steps, turn labels, optional heuristic / LLM score pills, optional reference “green path” labels (scenario 01), and filters (tool kind, documentation vs code reads, doc-signal heuristics).

After each successful npm run eval, trajectory.html is written next to transcript.json (same pipeline as npm run viz:trajectory for regeneration). Includes npm run test:trajectory smoke assertions.

Future work (not in this PR)

  • Dashboard prompt success — Measure whether operators who copy the Outpost prompt from the Hookdeck dashboard complete integrations successfully (product / analytics / support signals).
  • Prompt optimization for large scenarios — Larger integration scenarios are long-running; iterate on the onboarding prompt and eval harness so agents stay on track with less wall time and fewer detours.

Notes

  • docs/agent-evaluation/results/r* remains gitignored; reviewers can run a local eval or viz:trajectory on an existing run directory.

Root README.md / AGENTS.md edits about website deploy triggers were left uncommitted on this branch so this PR stays scoped to agent-eval.

Made with Cursor

Extract tool steps from transcript.json into a structured model, render a
self-contained trajectory.html (filters, doc heuristics, optional reference
path labels), and write it automatically after each successful eval run.
Add fixture smoke test and reference trajectory for scenario 01.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 24, 2026 17:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an HTML-based trajectory visualizer for agent eval runs and integrates generation into the eval pipeline so each run produces a self-contained trajectory.html next to transcript.json.

Changes:

  • Implement transcript → ordered tool-step extraction with turn boundaries, tagging, and light redaction.
  • Add viz:trajectory CLI + reference “green path” milestone labeling, and generate trajectory.html automatically after npm run eval.
  • Add a minimal fixture + smoke script (test:trajectory) and document the new workflow.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/agent-evaluation/src/transcript-trajectory.ts Core extraction logic for tool steps, turns, tags, doc signals, previews, and redaction.
docs/agent-evaluation/src/trajectory-fixture-smoke.ts Standalone smoke assertions for trajectory extraction against a fixture.
docs/agent-evaluation/src/run-agent-eval.ts Hooks trajectory HTML generation into the eval run flow and updates CLI output text.
docs/agent-evaluation/src/reference-trajectory.ts Loads scenario reference milestones and labels steps as on-path/detour/neutral.
docs/agent-evaluation/src/generate-trajectory-html.ts Generates the self-contained trajectory.html and provides the viz:trajectory CLI.
docs/agent-evaluation/src/fixtures/trajectory-minimal.json Minimal transcript fixture used by the smoke script.
docs/agent-evaluation/scenarios/reference-trajectories/01.json Scenario 01 reference milestone definition for “green path” labeling.
docs/agent-evaluation/package.json Adds viz:trajectory and test:trajectory scripts.
docs/agent-evaluation/README.md Documents trajectory visualization usage, privacy note, and regression check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +27 to +29
const __dirname = dirname(fileURLToPath(import.meta.url));
const EVAL_ROOT = evalRootFromHere(import.meta.url);

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__dirname is declared but not used in this module. Removing it would avoid dead code and keep the script tidy.

Copilot uses AI. Check for mistakes.
Comment on lines +1038 to +1041
const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {
evalRoot: EVAL_ROOT,
});
console.error(`Wrote ${trajectoryPath}`);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeTrajectoryHtmlForTranscript is called inside the main scenario try block without its own error handling. If trajectory generation fails (e.g., malformed transcript, sidecar parse bug, filesystem perms), the entire scenario run is marked failed and the eval aborts, even though trajectory.html is a non-essential artifact. Consider wrapping this call in a nested try/catch and only logging a warning on failure so eval success isn’t coupled to visualization generation.

Suggested change
const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {
evalRoot: EVAL_ROOT,
});
console.error(`Wrote ${trajectoryPath}`);
try {
const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {
evalRoot: EVAL_ROOT,
});
console.error(`Wrote ${trajectoryPath}`);
} catch (err) {
console.warn(
`Warning: failed to generate trajectory.html for ${file}:`,
err,
);
}

Copilot uses AI. Check for mistakes.
clearSel.addEventListener("click", function () {
if (selected) selected.classList.remove("selected");
selected = null;
history.replaceState(null, "", " ");
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Clear selection” handler calls history.replaceState(null, "", " "), which sets the document URL to a single space. This can produce an invalid/odd URL in the address bar and may break reload/bookmark behavior. Prefer clearing just the hash (e.g., location.hash = "") or replacing state with location.pathname + location.search.

Suggested change
history.replaceState(null, "", " ");
history.replaceState(null, "", location.pathname + location.search);

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +342
function findToolResult(
messages: unknown[],
afterMessageIndex: number,
toolUseId: string,
): { isError: boolean; preview: string } | null {
for (let i = afterMessageIndex + 1; i < messages.length; i++) {
const m = messages[i];
if (!isRecord(m) || m.type !== "user") continue;
const inner = m.message;
if (!isRecord(inner)) continue;
const content = inner.content;
if (!Array.isArray(content)) continue;
for (const block of content) {
if (!isRecord(block) || block.type !== "tool_result") continue;
if (String(block.tool_use_id ?? "") !== toolUseId) continue;
return {
isError: Boolean(block.is_error),
preview: summarizeToolResultContent(block.content),
};
}
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findToolResult linearly scans forward through messages for every tool_use, making extraction O(tool_uses × messages). With long transcripts this can noticeably slow npm run eval now that trajectory generation runs after every scenario. Consider a single pre-pass that indexes tool_result blocks by tool_use_id (or at least caching lookups) before iterating tool_use blocks.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +248
function extractSdkHints(toolName: string, input: unknown): string[] {
if (
toolName !== "Bash" &&
toolName !== "Write" &&
toolName !== "Edit" &&
toolName !== "NotebookEdit"
) {
return [];
}
let corpus = "";
try {
corpus = JSON.stringify(input);
} catch {
corpus = "";
}
const hints: string[] = [];
for (const { id, re } of SDK_HINT_PATTERNS) {
if (re.test(corpus)) hints.push(id);
}
return hints;
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractSdkHints does JSON.stringify(input) for Bash/Write/Edit/NotebookEdit. For Write/Edit inputs, content can be large, which can add significant CPU/memory overhead during trajectory generation. Consider extracting/truncating only the relevant string fields (e.g., command, file_path, maybe first N chars of content) before running regexes.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +92
s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");
s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redactSecrets currently only redacts Bearer-style tokens. The HTML embeds tool command/detail and result previews, so other common secret formats (e.g., x-api-key: ..., Authorization: Basic ..., ?api_key=..., HOOKDECK_API_KEY=...) can still leak into trajectory.html. Consider expanding the redaction patterns and/or adding an option to omit result previews entirely for safer sharing.

Suggested change
s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");
s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");
const replacements: readonly [RegExp, string][] = [
[/Authorization:\s*Bearer\s+[^\s"'`]+/gi, "Authorization: Bearer [REDACTED]"],
[/Authorization:\s*Basic\s+[A-Za-z0-9+/=]+/gi, "Authorization: Basic [REDACTED]"],
[/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"],
[/Bearer\s+[A-Za-z0-9._~-]{20,}/g, "Bearer [REDACTED]"],
[/\b(?:x-api-key|api-key|x-auth-token|access-token)\s*:\s*[^\s,;]+/gi, "$1: [REDACTED]"],
[/\b([A-Z][A-Z0-9_]*(?:TOKEN|KEY|SECRET|PASSWORD))=([^\s"'`]+)/g, "$1=[REDACTED]"],
[/([?&](?:api[_-]?key|access[_-]?token|token|key|client[_-]?secret|secret)=)([^&#\s]+)/gi, "$1[REDACTED]"],
];
for (const [pattern, replacement] of replacements) {
s = s.replace(pattern, replacement);
}

Copilot uses AI. Check for mistakes.
Guard generate-trajectory-html main() so importing from run-agent-eval does
not parse eval CLI flags (fixes eval:ci ERR_PARSE_ARGS_UNKNOWN_OPTION).

Wrap trajectory HTML generation in try/catch so viz failures do not fail the
run. Fix clear-selection URL replaceState; drop unused imports.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants