Skip to content

feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap#41

Merged
drewstone merged 1 commit into
mainfrom
feat/autodata-dataset-builder
Jun 25, 2026
Merged

feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap#41
drewstone merged 1 commit into
mainfrom
feat/autodata-dataset-builder

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

The LIVE Autodata / Agentic Self-Instruct application in agent-knowledge: ground on a real source document, run the agentic data-creation inner loop with real two-tier solver models over the Tangle router, score answers with an llmJudge rubric, keep only examples that discriminate a strong solver from a weak one, and report the empirical strong/weak gap (the paper's Table 1) — or an honest null.

src/autodata/ — 8 files, ~1,734 LOC. Importable as @tangle-network/agent-knowledge/autodata; runnable with pnpm autodata under dotenvx.

Reuse — the loop is composed, not reinvented

The inner loop (createDataCreationLoop + discriminativeAcceptRule + qualityCheck) is vendored from agent-runtime examples/agentic-data-creation with a provenance + lift-candidate note. That example is not a published runtime export (examples aren't in the npm dist), so it can't be imported — the brief's sanctioned "copy with a note" path. It composes only published substrate primitives, nothing re-implements a judge/sampler/corpus/cost-counter:

Piece Reused primitive
loop kernel + N× sampling runLoop (@tangle-network/agent-runtime/loops)
accepted-example store InMemoryCorpus (loops)
in-process worker seam inProcessSandboxClient (loops)
rubric judge llmJudge + createChatClient (@tangle-network/agent-eval)
cost accounting CostLedger (agent-eval)
grounding politeFetchhtmlToTextchunkMarkdown (this repo's ingestion)

The only genuinely new piece is discriminativeAcceptRule (the paper's reward: strong ≥ minStrong, weak < maxWeak, gap ≥ minGap). agent-runtime dep stays ^0.77.0 (already ships every primitive used; nothing to bump).

One real transport seam — routerChat — drives all four roles (challenger, weak solver, strong solver, judge). Per-call USD is the router's own cost when it returns one, else a rate-table estimate over the EXACT token counts, with the source flagged. The judge's spend is recorded into the same CostLedger.

Correctness fix carried into the vendored loop

defaultSelectWinner falls back to the best-scoring iteration when none is valid, so the example would "accept" a rejected candidate. Acceptance is now gated on verdict.valid, and a new refinedGaps exposes the best gap reached per slot — so the plain-vs-refined calibration stays informative even when nothing clears the bar.

Result — an HONEST NULL (mechanism proven, paper tier unavailable)

The brief's Qwen tier is not provisioned on the router for this key. Every Qwen id (qwen/qwen-2.5-7b-instruct, qwen/qwen3-235b-a22b, …) returns 401 No API key configured for model — verified by probing /v1/chat/completions across the /v1/models catalog. Only the GLM family is callable. The cost gate caught this before any burn. The closest real small-vs-large tier on this router: glm-4.5-air (weak) vs glm-5.2 (strong); challenger + judge = glm-5.2.

Live run (grounded on the real Transformer paper, ar5iv 1706.03762, the multi-head-attention section; target=3, samples=3, maxRetries=4):

metric value
accepted (discriminating) examples 0 / 3
plain first-draft gap (n=3) −0.444
refined best-gap per slot (n=3) 0.000
total spend $0.2292 (challenger $0.035, judge $0.057, strong $0.105, weak $0.033)

This is a real null, not a measurement artifact — autopsy confirmed:

  • Judge discriminates correctly: a good answer → 1.00, partial → 0.45, wrong → 0.00, empty → 0.00.
  • Both solvers return full answers (finish=stop, no truncation; strong is more detailed).
  • So the two available GLM tiers are simply too close in capability on doc-grounded QA to separate — glm-4.5-air answers these reasoning questions about as well as glm-5.2, so the discriminative gap never opens. The paper's ~30× Qwen gap (7B vs 235B) would likely separate, but that tier isn't serveable here.

No Table-1 reproduction on this key — reported plainly rather than massaged. Swap the two solver constants back to the Qwen ids once the router provisions that upstream and re-run pnpm autodata.

Verification

  • pnpm typecheck clean · pnpm lint clean · pnpm test 197 passed (13 new offline autodata tests, credentialless via scripted workers + mock-transport judge) · pnpm build emits the autodata subpath.
  • Cost-gated live run executed end-to-end on real models (3-model smoke → ground → loop → JSONL → cost ledger).

…irical strong/weak gap

The LIVE Autodata / Agentic Self-Instruct application: ground on a real source
document, run the agentic data-creation inner loop with REAL two-tier solver
models over the Tangle router, score with an llmJudge rubric, keep only examples
that DISCRIMINATE a strong solver from a weak one, and report the empirical
strong/weak gap (paper Table 1) — or an honest null.

Reuse, not reinvention:
- The inner loop (createDataCreationLoop + discriminativeAcceptRule + qualityCheck)
  is vendored from agent-runtime examples/agentic-data-creation (an unpublished
  example, not shipped in the npm dist) with a provenance + lift-candidate note.
  It composes only PUBLISHED substrate primitives: runLoop, InMemoryCorpus,
  inProcessSandboxClient (agent-runtime/loops); CostLedger, llmJudge,
  createChatClient (agent-eval). Nothing re-implements a judge, sampler, corpus,
  or cost counter.
- Grounding reuses agent-knowledge ingestion: politeFetch -> htmlToText ->
  chunkMarkdown over a real arXiv (ar5iv) document.
- One router transport seam (routerChat) drives all roles; per-call cost is the
  router's own when returned, else a rate-table estimate over EXACT token counts,
  with the source flagged. The judge's own spend is recorded into the same ledger.

Correctness fix carried into the vendored loop: defaultSelectWinner falls back to
the best-scoring iteration when none is valid, so the example would "accept" a
rejected candidate; acceptance is now gated on verdict.valid, and refinedGaps
exposes the best gap reached per slot for an informative plain-vs-refined
calibration even when nothing clears the bar.

Offline tests are credentialless (scripted workers + mock-transport judge) so CI
stays green without keys. Run live with `pnpm autodata` under dotenvx.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 4a2b7e41

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T20:50:06Z

@drewstone drewstone merged commit 26c617f into main Jun 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants