feat(agent-eval): trajectory HTML visualizer + eval-time generation by leggetter · Pull Request #880 · hookdeck/outpost

leggetter · 2026-04-24T17:02:49Z

Summary

Adds a self-contained trajectory.html timeline for eval runs: tool steps, turn labels, optional heuristic / LLM score pills, optional reference “green path” labels (scenario 01), and filters (tool kind, documentation vs code reads, doc-signal heuristics).

After each successful npm run eval, trajectory.html is written next to transcript.json (same pipeline as npm run viz:trajectory for regeneration). Includes npm run test:trajectory smoke assertions.

Future work (not in this PR)

Dashboard prompt success — Measure whether operators who copy the Outpost prompt from the Hookdeck dashboard complete integrations successfully (product / analytics / support signals).
Prompt optimization for large scenarios — Larger integration scenarios are long-running; iterate on the onboarding prompt and eval harness so agents stay on track with less wall time and fewer detours.

Notes

docs/agent-evaluation/results/r* remains gitignored; reviewers can run a local eval or viz:trajectory on an existing run directory.

Root README.md / AGENTS.md edits about website deploy triggers were left uncommitted on this branch so this PR stays scoped to agent-eval.

Made with Cursor

Extract tool steps from transcript.json into a structured model, render a self-contained trajectory.html (filters, doc heuristics, optional reference path labels), and write it automatically after each successful eval run. Add fixture smoke test and reference trajectory for scenario 01. Made-with: Cursor

Copilot

Pull request overview

Adds an HTML-based trajectory visualizer for agent eval runs and integrates generation into the eval pipeline so each run produces a self-contained trajectory.html next to transcript.json.

Changes:

Implement transcript → ordered tool-step extraction with turn boundaries, tagging, and light redaction.
Add viz:trajectory CLI + reference “green path” milestone labeling, and generate trajectory.html automatically after npm run eval.
Add a minimal fixture + smoke script (test:trajectory) and document the new workflow.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/agent-evaluation/src/transcript-trajectory.ts	Core extraction logic for tool steps, turns, tags, doc signals, previews, and redaction.
docs/agent-evaluation/src/trajectory-fixture-smoke.ts	Standalone smoke assertions for trajectory extraction against a fixture.
docs/agent-evaluation/src/run-agent-eval.ts	Hooks trajectory HTML generation into the eval run flow and updates CLI output text.
docs/agent-evaluation/src/reference-trajectory.ts	Loads scenario reference milestones and labels steps as on-path/detour/neutral.
docs/agent-evaluation/src/generate-trajectory-html.ts	Generates the self-contained `trajectory.html` and provides the `viz:trajectory` CLI.
docs/agent-evaluation/src/fixtures/trajectory-minimal.json	Minimal transcript fixture used by the smoke script.
docs/agent-evaluation/scenarios/reference-trajectories/01.json	Scenario 01 reference milestone definition for “green path” labeling.
docs/agent-evaluation/package.json	Adds `viz:trajectory` and `test:trajectory` scripts.
docs/agent-evaluation/README.md	Documents trajectory visualization usage, privacy note, and regression check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T17:05:53Z

+const __dirname = dirname(fileURLToPath(import.meta.url));
+const EVAL_ROOT = evalRootFromHere(import.meta.url);
+


__dirname is declared but not used in this module. Removing it would avoid dead code and keep the script tidy.

Copilot · 2026-04-24T17:05:53Z

+      const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {
+        evalRoot: EVAL_ROOT,
+      });
+      console.error(`Wrote ${trajectoryPath}`);


writeTrajectoryHtmlForTranscript is called inside the main scenario try block without its own error handling. If trajectory generation fails (e.g., malformed transcript, sidecar parse bug, filesystem perms), the entire scenario run is marked failed and the eval aborts, even though trajectory.html is a non-essential artifact. Consider wrapping this call in a nested try/catch and only logging a warning on failure so eval success isn’t coupled to visualization generation.

Suggested change

const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {

evalRoot: EVAL_ROOT,

});

console.error(`Wrote ${trajectoryPath}`);

try {

const trajectoryPath = await writeTrajectoryHtmlForTranscript(outPath, {

evalRoot: EVAL_ROOT,

});

console.error(`Wrote ${trajectoryPath}`);

} catch (err) {

console.warn(

`Warning: failed to generate trajectory.html for ${file}:`,

err,

);

}

Copilot · 2026-04-24T17:05:54Z

+  clearSel.addEventListener("click", function () {
+    if (selected) selected.classList.remove("selected");
+    selected = null;
+    history.replaceState(null, "", " ");


The “Clear selection” handler calls history.replaceState(null, "", " "), which sets the document URL to a single space. This can produce an invalid/odd URL in the address bar and may break reload/bookmark behavior. Prefer clearing just the hash (e.g., location.hash = "") or replacing state with location.pathname + location.search.

Suggested change

history.replaceState(null, "", " ");

history.replaceState(null, "", location.pathname + location.search);

Copilot · 2026-04-24T17:05:54Z

+function findToolResult(
+  messages: unknown[],
+  afterMessageIndex: number,
+  toolUseId: string,
+): { isError: boolean; preview: string } | null {
+  for (let i = afterMessageIndex + 1; i < messages.length; i++) {
+    const m = messages[i];
+    if (!isRecord(m) || m.type !== "user") continue;
+    const inner = m.message;
+    if (!isRecord(inner)) continue;
+    const content = inner.content;
+    if (!Array.isArray(content)) continue;
+    for (const block of content) {
+      if (!isRecord(block) || block.type !== "tool_result") continue;
+      if (String(block.tool_use_id ?? "") !== toolUseId) continue;
+      return {
+        isError: Boolean(block.is_error),
+        preview: summarizeToolResultContent(block.content),
+      };
+    }
+  }


findToolResult linearly scans forward through messages for every tool_use, making extraction O(tool_uses × messages). With long transcripts this can noticeably slow npm run eval now that trajectory generation runs after every scenario. Consider a single pre-pass that indexes tool_result blocks by tool_use_id (or at least caching lookups) before iterating tool_use blocks.

Copilot · 2026-04-24T17:05:54Z

+function extractSdkHints(toolName: string, input: unknown): string[] {
+  if (
+    toolName !== "Bash" &&
+    toolName !== "Write" &&
+    toolName !== "Edit" &&
+    toolName !== "NotebookEdit"
+  ) {
+    return [];
+  }
+  let corpus = "";
+  try {
+    corpus = JSON.stringify(input);
+  } catch {
+    corpus = "";
+  }
+  const hints: string[] = [];
+  for (const { id, re } of SDK_HINT_PATTERNS) {
+    if (re.test(corpus)) hints.push(id);
+  }
+  return hints;
+}


extractSdkHints does JSON.stringify(input) for Bash/Write/Edit/NotebookEdit. For Write/Edit inputs, content can be large, which can add significant CPU/memory overhead during trajectory generation. Consider extracting/truncating only the relevant string fields (e.g., command, file_path, maybe first N chars of content) before running regexes.

Copilot · 2026-04-24T17:05:55Z

+  s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");
+  s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");


redactSecrets currently only redacts Bearer-style tokens. The HTML embeds tool command/detail and result previews, so other common secret formats (e.g., x-api-key: ..., Authorization: Basic ..., ?api_key=..., HOOKDECK_API_KEY=...) can still leak into trajectory.html. Consider expanding the redaction patterns and/or adding an option to omit result previews entirely for safer sharing.

Suggested change

s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");

s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");

const replacements: readonly [RegExp, string][] = [

[/Authorization:\s*Bearer\s+[^\s"'`]+/gi, "Authorization: Bearer [REDACTED]"],

[/Authorization:\s*Basic\s+[A-Za-z0-9+/=]+/gi, "Authorization: Basic [REDACTED]"],

[/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"],

[/Bearer\s+[A-Za-z0-9._~-]{20,}/g, "Bearer [REDACTED]"],

[/\b(?:x-api-key|api-key|x-auth-token|access-token)\s*:\s*[^\s,;]+/gi, "$1: [REDACTED]"],

[/\b([A-Z][A-Z0-9_]*(?:TOKEN|KEY|SECRET|PASSWORD))=([^\s"'`]+)/g, "$1=[REDACTED]"],

[/([?&](?:api[_-]?key|access[_-]?token|token|key|client[_-]?secret|secret)=)([^&#\s]+)/gi, "$1[REDACTED]"],

];

for (const [pattern, replacement] of replacements) {

s = s.replace(pattern, replacement);

}

Guard generate-trajectory-html main() so importing from run-agent-eval does not parse eval CLI flags (fixes eval:ci ERR_PARSE_ARGS_UNKNOWN_OPTION). Wrap trajectory HTML generation in try/catch so viz failures do not fail the run. Fix clear-selection URL replaceState; drop unused imports. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 24, 2026 17:02

Copilot started reviewing on behalf of leggetter April 24, 2026 17:03 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-eval): trajectory HTML visualizer + eval-time generation#880

feat(agent-eval): trajectory HTML visualizer + eval-time generation#880
leggetter wants to merge 2 commits intomainfrom
feat/agent-eval-visualizer

leggetter commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const __dirname = dirname(fileURLToPath(import.meta.url));
		const EVAL_ROOT = evalRootFromHere(import.meta.url);

	history.replaceState(null, "", " ");
	history.replaceState(null, "", location.pathname + location.search);

		s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");
		s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");

-  s = s.replace(/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]");
-  s = s.replace(/Bearer\s+[A-Za-z0-9_-]{40,}/g, "Bearer [REDACTED]");
+  const replacements: readonly [RegExp, string][] = [
+    [/Authorization:\s*Bearer\s+[^\s"'`]+/gi, "Authorization: Bearer [REDACTED]"],
+    [/Authorization:\s*Basic\s+[A-Za-z0-9+/=]+/gi, "Authorization: Basic [REDACTED]"],
+    [/Bearer\s+sk-ant-api[^\s"'`]+/gi, "Bearer [REDACTED]"],
+    [/Bearer\s+[A-Za-z0-9._~-]{20,}/g, "Bearer [REDACTED]"],
+    [/\b(?:x-api-key|api-key|x-auth-token|access-token)\s*:\s*[^\s,;]+/gi, "$1: [REDACTED]"],
+    [/\b([A-Z][A-Z0-9_]*(?:TOKEN|KEY|SECRET|PASSWORD))=([^\s"'`]+)/g, "$1=[REDACTED]"],
+    [/([?&](?:api[_-]?key|access[_-]?token|token|key|client[_-]?secret|secret)=)([^&#\s]+)/gi, "$1[REDACTED]"],
+  ];
+  for (const [pattern, replacement] of replacements) {
+    s = s.replace(pattern, replacement);
+  }

Conversation

leggetter commented Apr 24, 2026

Summary

Future work (not in this PR)

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants