Skip to content

Add the triage skills and strategies experiments#12

Closed
ralphbean wants to merge 1 commit into
mainfrom
triage-skills-and-strategies
Closed

Add the triage skills and strategies experiments#12
ralphbean wants to merge 1 commit into
mainfrom
triage-skills-and-strategies

Conversation

@ralphbean

Copy link
Copy Markdown
Member

These came from fullsend-ai/fullsend#170 and were used to form the basis of our real triage agent from fullsend-ai/fullsend#279

These came from fullsend-ai/fullsend#170 and
were used to form the basis of our real triage agent from
fullsend-ai/fullsend#279
@ralphbean ralphbean requested a review from a team as a code owner April 29, 2026 18:03

@waynesun09 waynesun09 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with a 10-agent review squad. Posting the top 5 most actionable findings inline — 2 are script bugs that crash or fail on macOS, 2 are data integrity issues affecting experiment results, and 1 is a JSON parsing bug that silently truncates output.

The $SCENARIO_NAME_ unbound variable (github-adapter.sh:76) and grep -oP portability issue (github-adapter.sh:80) are the quickest wins. The data integrity findings in the README and judge.sh are worth addressing before drawing conclusions from the experiment results.


---
_This issue was created by the triage-skill-comparison experiment._
_Strategy: $STRATEGY_NAME | Scenario: $SCENARIO_NAME_" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — Unbound variable crash on every run

$SCENARIO_NAME_ (with trailing underscore) is interpreted by bash as a single variable name, since _ is a valid identifier character. This variable is never set, so with set -euo pipefail (line 3), this line will crash every invocation with unbound variable: SCENARIO_NAME_.

Suggested change
_Strategy: $STRATEGY_NAME | Scenario: $SCENARIO_NAME_" \
--body "_Strategy: $STRATEGY_NAME | Scenario: ${SCENARIO_NAME}_"

Use ${SCENARIO_NAME}_ to explicitly delimit the variable name from the trailing underscore literal.

--label "$LABEL_TRIAGE" \
2>/dev/null)"

ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oP '\d+$')"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — grep -oP is GNU-only, fails on macOS

grep -P (Perl regex) is not available on macOS's default BSD grep. This will fail with grep: invalid option -- P on any macOS contributor's machine.

Suggested change
ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oP '\d+$')"
ISSUE_NUMBER="$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')"

grep -oE with POSIX extended regex achieves the same result and works on both GNU and BSD grep.

Comment on lines +176 to +179
| Rank | Strategy | Mean score | Reliability |
|------|----------|-----------|-------------|
| 1 (tie) | omo-prometheus | 4.38 | 98% |
| 1 (tie) | omc-deep-interview | 4.38 | 97% |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data integrity — Results table is incomplete and reliability numbers don't match trial data

Two issues with this rankings table:

  1. Incomplete results presented as final rankings: slow-search and wrong-search-results scenarios have zero results, and silent-data-corruption only has 2 of 5 strategies. The rankings here are drawn from partial data and may change significantly once all scenarios are run.

  2. Reliability percentages contradict trial data: The table shows values like 98% and 97%, but examining the actual result files, all trials show parse_failures: 0 — suggesting either 100% reliability or a different calculation method that isn't documented.

Consider either marking this table as preliminary/partial, or holding it until all scenario × strategy combinations have results.

}

echo "$JUDGE_JSON" | jq '.' > "$TRIAL_DIR/judge-assessment.json"
SCORE="$(echo "$JUDGE_JSON" | jq -r '.weighted_total // 0')"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data integrity — weighted_total values are unreliable

Two problems with trusting the LLM-provided weighted_total:

  1. Arithmetic drift: Spot-checking ~33 of 120 judge assessment files shows 0.05–0.15 point discrepancies between the LLM's weighted_total and the sum you'd get from applying the stated weights to the individual scores. These small errors can change rankings.

  2. Inconsistent nesting: At least one file (crash-on-save/structured-triage/trial-8/judge-assessment.json) has weighted_total nested inside .scores instead of at the top level, causing this jq expression to return 0 via the // 0 fallback — silently zeroing out the score.

Consider computing weighted_total deterministically from the component scores rather than trusting the LLM's arithmetic, and normalize the JSON structure before reading it.

Comment on lines +119 to +127
# Try first { ... } block
local braced
braced="$(echo "$raw" | awk '/{/{found=1} found{print} /}/{if(found) exit}')"
if [[ -n "$braced" ]] && echo "$braced" | jq . &>/dev/null; then
echo "$braced"; return 0
fi

echo "$raw"
return 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — extract_json truncates nested JSON objects

The awk pattern /{/,/}/ exits on the first closing } it encounters. For any JSON with nested objects (which is the expected output format for triage responses), this silently truncates the response — cutting off fields that appear after the first nested object closes.

For example, given:

{
  "priority": { "level": "high", "reason": "crash" },
  "component": "auth"
}

The function would return only { "priority": { "level": "high", "reason": "crash" } — dropping "component" entirely.

Consider using a brace-depth counter in awk, or piping through jq to extract the first valid JSON object from the mixed output.

@ralphbean

Copy link
Copy Markdown
Member Author

Rather than fix this one up, I'm going to drop it. Focusing on other things.

@ralphbean ralphbean closed this Jun 16, 2026
@fullsend-ai-retro

fullsend-ai-retro Bot commented Jun 16, 2026

Copy link
Copy Markdown

🤖 Finished Retro · ✅ Success · Started 8:24 PM UTC · Completed 8:32 PM UTC
Commit: f40693c · View workflow run →

@fullsend-ai-retro

Copy link
Copy Markdown

Retro: PR #12 — Add the triage skills and strategies experiments

PR #12 was a human-authored PR by ralphbean adding 3,833 files (136K lines) of triage experiment data. It was opened April 29, received a thorough CHANGES_REQUESTED review from waynesun09 on May 19 (citing a "10-agent review squad"), and was closed without merge on June 16 when the author chose to drop it.

Workflow observations

  • Limited agent involvement: The review dispatch fired on PR creation, but the actual review was posted by a human 20 days later. No fix agent ran after the CHANGES_REQUESTED review. The workflow was primarily human-driven.
  • Retro value questionable: Running a retro on a human-authored, closed-without-merge PR with minimal automated agent interaction yields limited actionable signal.

Existing issue coverage

All potential improvement areas are already tracked by open issues in fullsend-ai/fullsend:

No new proposals are warranted — existing issues adequately cover the improvement opportunities observed in this workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants