feat(agent-eval): trajectory HTML visualizer + eval-time generation#880
feat(agent-eval): trajectory HTML visualizer + eval-time generation#880
Conversation
Extract tool steps from transcript.json into a structured model, render a self-contained trajectory.html (filters, doc heuristics, optional reference path labels), and write it automatically after each successful eval run. Add fixture smoke test and reference trajectory for scenario 01. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds an HTML-based trajectory visualizer for agent eval runs and integrates generation into the eval pipeline so each run produces a self-contained trajectory.html next to transcript.json.
Changes:
- Implement transcript → ordered tool-step extraction with turn boundaries, tagging, and light redaction.
- Add
viz:trajectoryCLI + reference “green path” milestone labeling, and generatetrajectory.htmlautomatically afternpm run eval. - Add a minimal fixture + smoke script (
test:trajectory) and document the new workflow.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/agent-evaluation/src/transcript-trajectory.ts | Core extraction logic for tool steps, turns, tags, doc signals, previews, and redaction. |
| docs/agent-evaluation/src/trajectory-fixture-smoke.ts | Standalone smoke assertions for trajectory extraction against a fixture. |
| docs/agent-evaluation/src/run-agent-eval.ts | Hooks trajectory HTML generation into the eval run flow and updates CLI output text. |
| docs/agent-evaluation/src/reference-trajectory.ts | Loads scenario reference milestones and labels steps as on-path/detour/neutral. |
| docs/agent-evaluation/src/generate-trajectory-html.ts | Generates the self-contained trajectory.html and provides the viz:trajectory CLI. |
| docs/agent-evaluation/src/fixtures/trajectory-minimal.json | Minimal transcript fixture used by the smoke script. |
| docs/agent-evaluation/scenarios/reference-trajectories/01.json | Scenario 01 reference milestone definition for “green path” labeling. |
| docs/agent-evaluation/package.json | Adds viz:trajectory and test:trajectory scripts. |
| docs/agent-evaluation/README.md | Documents trajectory visualization usage, privacy note, and regression check. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const __dirname = dirname(fileURLToPath(import.meta.url)); | ||
| const EVAL_ROOT = evalRootFromHere(import.meta.url); | ||
|
|
There was a problem hiding this comment.
__dirname is declared but not used in this module. Removing it would avoid dead code and keep the script tidy.
| const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, { | ||
| evalRoot: EVAL_ROOT, | ||
| }); | ||
| console.error(`Wrote ${trajectoryPath}`); |
There was a problem hiding this comment.
writeTrajectoryHtmlForTranscript is called inside the main scenario try block without its own error handling. If trajectory generation fails (e.g., malformed transcript, sidecar parse bug, filesystem perms), the entire scenario run is marked failed and the eval aborts, even though trajectory.html is a non-essential artifact. Consider wrapping this call in a nested try/catch and only logging a warning on failure so eval success isn’t coupled to visualization generation.
| const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, { | |
| evalRoot: EVAL_ROOT, | |
| }); | |
| console.error(`Wrote ${trajectoryPath}`); | |
| try { | |
| const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, { | |
| evalRoot: EVAL_ROOT, | |
| }); | |
| console.error(`Wrote ${trajectoryPath}`); | |
| } catch (err) { | |
| console.warn( | |
| `Warning: failed to generate trajectory.html for ${file}:`, | |
| err, | |
| ); | |
| } |
| clearSel.addEventListener("click", function () { | ||
| if (selected) selected.classList.remove("selected"); | ||
| selected = null; | ||
| history.replaceState(null, "", " "); |
There was a problem hiding this comment.
The “Clear selection” handler calls history.replaceState(null, "", " "), which sets the document URL to a single space. This can produce an invalid/odd URL in the address bar and may break reload/bookmark behavior. Prefer clearing just the hash (e.g., location.hash = "") or replacing state with location.pathname + location.search.
| history.replaceState(null, "", " "); | |
| history.replaceState(null, "", location.pathname + location.search); |
| function findToolResult( | ||
| messages: unknown[], | ||
| afterMessageIndex: number, | ||
| toolUseId: string, | ||
| ): { isError: boolean; preview: string } | null { | ||
| for (let i = afterMessageIndex + 1; i < messages.length; i++) { | ||
| const m = messages[i]; | ||
| if (!isRecord(m) || m.type !== "user") continue; | ||
| const inner = m.message; | ||
| if (!isRecord(inner)) continue; | ||
| const content = inner.content; | ||
| if (!Array.isArray(content)) continue; | ||
| for (const block of content) { | ||
| if (!isRecord(block) || block.type !== "tool_result") continue; | ||
| if (String(block.tool_use_id ?? "") !== toolUseId) continue; | ||
| return { | ||
| isError: Boolean(block.is_error), | ||
| preview: summarizeToolResultContent(block.content), | ||
| }; | ||
| } | ||
| } |
There was a problem hiding this comment.
findToolResult linearly scans forward through messages for every tool_use, making extraction O(tool_uses × messages). With long transcripts this can noticeably slow npm run eval now that trajectory generation runs after every scenario. Consider a single pre-pass that indexes tool_result blocks by tool_use_id (or at least caching lookups) before iterating tool_use blocks.
| function extractSdkHints(toolName: string, input: unknown): string[] { | ||
| if ( | ||
| toolName !== "Bash" && | ||
| toolName !== "Write" && | ||
| toolName !== "Edit" && | ||
| toolName !== "NotebookEdit" | ||
| ) { | ||
| return []; | ||
| } | ||
| let corpus = ""; | ||
| try { | ||
| corpus = JSON.stringify(input); | ||
| } catch { | ||
| corpus = ""; | ||
| } | ||
| const hints: string[] = []; | ||
| for (const { id, re } of SDK_HINT_PATTERNS) { | ||
| if (re.test(corpus)) hints.push(id); | ||
| } | ||
| return hints; | ||
| } |
There was a problem hiding this comment.
extractSdkHints does JSON.stringify(input) for Bash/Write/Edit/NotebookEdit. For Write/Edit inputs, content can be large, which can add significant CPU/memory overhead during trajectory generation. Consider extracting/truncating only the relevant string fields (e.g., command, file_path, maybe first N chars of content) before running regexes.
| s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"); | ||
| s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]"); |
There was a problem hiding this comment.
redactSecrets currently only redacts Bearer-style tokens. The HTML embeds tool command/detail and result previews, so other common secret formats (e.g., x-api-key: ..., Authorization: Basic ..., ?api_key=..., HOOKDECK_API_KEY=...) can still leak into trajectory.html. Consider expanding the redaction patterns and/or adding an option to omit result previews entirely for safer sharing.
| s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"); | |
| s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]"); | |
| const replacements: readonly [RegExp, string][] = [ | |
| [/Authorization:\s*Bearer\s+[^\s"'`]+/gi, "Authorization: Bearer [REDACTED]"], | |
| [/Authorization:\s*Basic\s+[A-Za-z0-9+/=]+/gi, "Authorization: Basic [REDACTED]"], | |
| [/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"], | |
| [/Bearer\s+[A-Za-z0-9._~-]{20,}/g, "Bearer [REDACTED]"], | |
| [/\b(?:x-api-key|api-key|x-auth-token|access-token)\s*:\s*[^\s,;]+/gi, "$1: [REDACTED]"], | |
| [/\b([A-Z][A-Z0-9_]*(?:TOKEN|KEY|SECRET|PASSWORD))=([^\s"'`]+)/g, "$1=[REDACTED]"], | |
| [/([?&](?:api[_-]?key|access[_-]?token|token|key|client[_-]?secret|secret)=)([^&#\s]+)/gi, "$1[REDACTED]"], | |
| ]; | |
| for (const [pattern, replacement] of replacements) { | |
| s = s.replace(pattern, replacement); | |
| } |
Guard generate-trajectory-html main() so importing from run-agent-eval does not parse eval CLI flags (fixes eval:ci ERR_PARSE_ARGS_UNKNOWN_OPTION). Wrap trajectory HTML generation in try/catch so viz failures do not fail the run. Fix clear-selection URL replaceState; drop unused imports. Made-with: Cursor
Summary
Adds a self-contained
trajectory.htmltimeline for eval runs: tool steps, turn labels, optional heuristic / LLM score pills, optional reference “green path” labels (scenario 01), and filters (tool kind, documentation vs code reads, doc-signal heuristics).After each successful
npm run eval,trajectory.htmlis written next totranscript.json(same pipeline asnpm run viz:trajectoryfor regeneration). Includesnpm run test:trajectorysmoke assertions.Future work (not in this PR)
Notes
docs/agent-evaluation/results/r*remains gitignored; reviewers can run a local eval orviz:trajectoryon an existing run directory.Root
README.md/AGENTS.mdedits about website deploy triggers were left uncommitted on this branch so this PR stays scoped to agent-eval.Made with Cursor