feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap by drewstone · Pull Request #41 · tangle-network/agent-knowledge

drewstone · 2026-06-25T20:49:57Z

What

The LIVE Autodata / Agentic Self-Instruct application in agent-knowledge: ground on a real source document, run the agentic data-creation inner loop with real two-tier solver models over the Tangle router, score answers with an llmJudge rubric, keep only examples that discriminate a strong solver from a weak one, and report the empirical strong/weak gap (the paper's Table 1) — or an honest null.

src/autodata/ — 8 files, ~1,734 LOC. Importable as @tangle-network/agent-knowledge/autodata; runnable with pnpm autodata under dotenvx.

Reuse — the loop is composed, not reinvented

The inner loop (createDataCreationLoop + discriminativeAcceptRule + qualityCheck) is vendored from agent-runtime examples/agentic-data-creation with a provenance + lift-candidate note. That example is not a published runtime export (examples aren't in the npm dist), so it can't be imported — the brief's sanctioned "copy with a note" path. It composes only published substrate primitives, nothing re-implements a judge/sampler/corpus/cost-counter:

Piece	Reused primitive
loop kernel + N× sampling	`runLoop` (`@tangle-network/agent-runtime/loops`)
accepted-example store	`InMemoryCorpus` (loops)
in-process worker seam	`inProcessSandboxClient` (loops)
rubric judge	`llmJudge` + `createChatClient` (`@tangle-network/agent-eval`)
cost accounting	`CostLedger` (agent-eval)
grounding	`politeFetch` → `htmlToText` → `chunkMarkdown` (this repo's ingestion)

The only genuinely new piece is discriminativeAcceptRule (the paper's reward: strong ≥ minStrong, weak < maxWeak, gap ≥ minGap). agent-runtime dep stays ^0.77.0 (already ships every primitive used; nothing to bump).

One real transport seam — routerChat — drives all four roles (challenger, weak solver, strong solver, judge). Per-call USD is the router's own cost when it returns one, else a rate-table estimate over the EXACT token counts, with the source flagged. The judge's spend is recorded into the same CostLedger.

Correctness fix carried into the vendored loop

defaultSelectWinner falls back to the best-scoring iteration when none is valid, so the example would "accept" a rejected candidate. Acceptance is now gated on verdict.valid, and a new refinedGaps exposes the best gap reached per slot — so the plain-vs-refined calibration stays informative even when nothing clears the bar.

Result — an HONEST NULL (mechanism proven, paper tier unavailable)

The brief's Qwen tier is not provisioned on the router for this key. Every Qwen id (qwen/qwen-2.5-7b-instruct, qwen/qwen3-235b-a22b, …) returns 401 No API key configured for model — verified by probing /v1/chat/completions across the /v1/models catalog. Only the GLM family is callable. The cost gate caught this before any burn. The closest real small-vs-large tier on this router: glm-4.5-air (weak) vs glm-5.2 (strong); challenger + judge = glm-5.2.

Live run (grounded on the real Transformer paper, ar5iv 1706.03762, the multi-head-attention section; target=3, samples=3, maxRetries=4):

metric	value
accepted (discriminating) examples	0 / 3
plain first-draft gap (n=3)	−0.444
refined best-gap per slot (n=3)	0.000
total spend	$0.2292 (challenger $0.035, judge $0.057, strong $0.105, weak $0.033)

This is a real null, not a measurement artifact — autopsy confirmed:

Judge discriminates correctly: a good answer → 1.00, partial → 0.45, wrong → 0.00, empty → 0.00.
Both solvers return full answers (finish=stop, no truncation; strong is more detailed).
So the two available GLM tiers are simply too close in capability on doc-grounded QA to separate — glm-4.5-air answers these reasoning questions about as well as glm-5.2, so the discriminative gap never opens. The paper's ~30× Qwen gap (7B vs 235B) would likely separate, but that tier isn't serveable here.

No Table-1 reproduction on this key — reported plainly rather than massaged. Swap the two solver constants back to the Qwen ids once the router provisions that upstream and re-run pnpm autodata.

Verification

pnpm typecheck clean · pnpm lint clean · pnpm test 197 passed (13 new offline autodata tests, credentialless via scripted workers + mock-transport judge) · pnpm build emits the autodata subpath.
Cost-gated live run executed end-to-end on real models (3-model smoke → ground → loop → JSONL → cost ledger).

…irical strong/weak gap The LIVE Autodata / Agentic Self-Instruct application: ground on a real source document, run the agentic data-creation inner loop with REAL two-tier solver models over the Tangle router, score with an llmJudge rubric, keep only examples that DISCRIMINATE a strong solver from a weak one, and report the empirical strong/weak gap (paper Table 1) — or an honest null. Reuse, not reinvention: - The inner loop (createDataCreationLoop + discriminativeAcceptRule + qualityCheck) is vendored from agent-runtime examples/agentic-data-creation (an unpublished example, not shipped in the npm dist) with a provenance + lift-candidate note. It composes only PUBLISHED substrate primitives: runLoop, InMemoryCorpus, inProcessSandboxClient (agent-runtime/loops); CostLedger, llmJudge, createChatClient (agent-eval). Nothing re-implements a judge, sampler, corpus, or cost counter. - Grounding reuses agent-knowledge ingestion: politeFetch -> htmlToText -> chunkMarkdown over a real arXiv (ar5iv) document. - One router transport seam (routerChat) drives all roles; per-call cost is the router's own when returned, else a rate-table estimate over EXACT token counts, with the source flagged. The judge's own spend is recorded into the same ledger. Correctness fix carried into the vendored loop: defaultSelectWinner falls back to the best-scoring iteration when none is valid, so the example would "accept" a rejected candidate; acceptance is now gated on verdict.valid, and refinedGaps exposes the best gap reached per slot for an informative plain-vs-refined calibration even when nothing clears the bar. Offline tests are credentialless (scripted workers + mock-transport judge) so CI stays green without keys. Run live with `pnpm autodata` under dotenvx.

tangletools

✅ Auto-approved PR — `4a2b7e41`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T20:50:06Z}

tangletools approved these changes Jun 25, 2026

View reviewed changes

drewstone merged commit 26c617f into main Jun 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap#41

feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap#41
drewstone merged 1 commit into
mainfrom
feat/autodata-dataset-builder

drewstone commented Jun 25, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 25, 2026

What

Reuse — the loop is composed, not reinvented

Correctness fix carried into the vendored loop

Result — an HONEST NULL (mechanism proven, paper tier unavailable)

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 4a2b7e41

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `4a2b7e41`