feat(research): Autodata dataset builder — real two-tier solvers, empirical strong/weak gap#41
Merged
Merged
Conversation
…irical strong/weak gap The LIVE Autodata / Agentic Self-Instruct application: ground on a real source document, run the agentic data-creation inner loop with REAL two-tier solver models over the Tangle router, score with an llmJudge rubric, keep only examples that DISCRIMINATE a strong solver from a weak one, and report the empirical strong/weak gap (paper Table 1) — or an honest null. Reuse, not reinvention: - The inner loop (createDataCreationLoop + discriminativeAcceptRule + qualityCheck) is vendored from agent-runtime examples/agentic-data-creation (an unpublished example, not shipped in the npm dist) with a provenance + lift-candidate note. It composes only PUBLISHED substrate primitives: runLoop, InMemoryCorpus, inProcessSandboxClient (agent-runtime/loops); CostLedger, llmJudge, createChatClient (agent-eval). Nothing re-implements a judge, sampler, corpus, or cost counter. - Grounding reuses agent-knowledge ingestion: politeFetch -> htmlToText -> chunkMarkdown over a real arXiv (ar5iv) document. - One router transport seam (routerChat) drives all roles; per-call cost is the router's own when returned, else a rate-table estimate over EXACT token counts, with the source flagged. The judge's own spend is recorded into the same ledger. Correctness fix carried into the vendored loop: defaultSelectWinner falls back to the best-scoring iteration when none is valid, so the example would "accept" a rejected candidate; acceptance is now gated on verdict.valid, and refinedGaps exposes the best gap reached per slot for an informative plain-vs-refined calibration even when nothing clears the bar. Offline tests are credentialless (scripted workers + mock-transport judge) so CI stays green without keys. Run live with `pnpm autodata` under dotenvx.
tangletools
approved these changes
Jun 25, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 4a2b7e41
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-25T20:50:06Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The LIVE Autodata / Agentic Self-Instruct application in agent-knowledge: ground on a real source document, run the agentic data-creation inner loop with real two-tier solver models over the Tangle router, score answers with an
llmJudgerubric, keep only examples that discriminate a strong solver from a weak one, and report the empirical strong/weak gap (the paper's Table 1) — or an honest null.src/autodata/— 8 files, ~1,734 LOC. Importable as@tangle-network/agent-knowledge/autodata; runnable withpnpm autodataunder dotenvx.Reuse — the loop is composed, not reinvented
The inner loop (
createDataCreationLoop+discriminativeAcceptRule+qualityCheck) is vendored from agent-runtimeexamples/agentic-data-creationwith a provenance + lift-candidate note. That example is not a published runtime export (examples aren't in the npm dist), so it can't be imported — the brief's sanctioned "copy with a note" path. It composes only published substrate primitives, nothing re-implements a judge/sampler/corpus/cost-counter:runLoop(@tangle-network/agent-runtime/loops)InMemoryCorpus(loops)inProcessSandboxClient(loops)llmJudge+createChatClient(@tangle-network/agent-eval)CostLedger(agent-eval)politeFetch→htmlToText→chunkMarkdown(this repo's ingestion)The only genuinely new piece is
discriminativeAcceptRule(the paper's reward:strong ≥ minStrong,weak < maxWeak,gap ≥ minGap). agent-runtime dep stays^0.77.0(already ships every primitive used; nothing to bump).One real transport seam —
routerChat— drives all four roles (challenger, weak solver, strong solver, judge). Per-call USD is the router's own cost when it returns one, else a rate-table estimate over the EXACT token counts, with the source flagged. The judge's spend is recorded into the sameCostLedger.Correctness fix carried into the vendored loop
defaultSelectWinnerfalls back to the best-scoring iteration when none is valid, so the example would "accept" a rejected candidate. Acceptance is now gated onverdict.valid, and a newrefinedGapsexposes the best gap reached per slot — so the plain-vs-refined calibration stays informative even when nothing clears the bar.Result — an HONEST NULL (mechanism proven, paper tier unavailable)
The brief's Qwen tier is not provisioned on the router for this key. Every Qwen id (
qwen/qwen-2.5-7b-instruct,qwen/qwen3-235b-a22b, …) returns401 No API key configured for model— verified by probing/v1/chat/completionsacross the/v1/modelscatalog. Only the GLM family is callable. The cost gate caught this before any burn. The closest real small-vs-large tier on this router:glm-4.5-air(weak) vsglm-5.2(strong); challenger + judge =glm-5.2.Live run (grounded on the real Transformer paper, ar5iv
1706.03762, the multi-head-attention section; target=3, samples=3, maxRetries=4):This is a real null, not a measurement artifact — autopsy confirmed:
stop, no truncation; strong is more detailed).glm-4.5-airanswers these reasoning questions about as well asglm-5.2, so the discriminative gap never opens. The paper's ~30× Qwen gap (7B vs 235B) would likely separate, but that tier isn't serveable here.No Table-1 reproduction on this key — reported plainly rather than massaged. Swap the two solver constants back to the Qwen ids once the router provisions that upstream and re-run
pnpm autodata.Verification
pnpm typecheckclean ·pnpm lintclean ·pnpm test197 passed (13 new offline autodata tests, credentialless via scripted workers + mock-transport judge) ·pnpm buildemits theautodatasubpath.