feat: Anthropic SDK backend + per-model backend selection#265
feat: Anthropic SDK backend + per-model backend selection#265anticomputer wants to merge 23 commits into
Conversation
Adds anthropic_sdk as a third backend adapter driving the native Anthropic Messages API (/v1/messages) via the official anthropic Python SDK. Supports streaming, MCP tool calling, and adaptive thinking with configurable reasoning effort. Key changes: - New backend: sdk/anthropic_sdk/backend.py implementing AgentBackend - Per-model backend selection via model_settings.backend (allows mixed backends in a single taskflow, e.g. Anthropic for code_analysis + OpenAI for general_tasks) - Both anthropic and github-copilot-sdk are now regular dependencies (not optional) since per-model backend config means any SDK could be needed at runtime - BackendSdk/ApiType Literals extended for anthropic_sdk/messages - _resolve_task_model() returns per-task backend override - stream_thinking model_settings option (opt-in, default off) - README and GRAMMAR.md updated with backend docs Auth: CAPI's /v1/messages expects Authorization: Bearer (not x-api-key); the adapter passes the bearer header via default_headers. Thinking: Uses adaptive thinking with output_config.effort. CAPI returns encrypted thinking signatures (content not readable); the stream_thinking flag is ready for when/if thinking content is exposed. Tested: basic messages, streaming, multi-turn tool calling via MCP, mixed-backend taskflows, all reasoning effort levels (low/medium/high/ max), error handling, openai_agents regression. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a third SDK adapter (anthropic_sdk) and allows selecting the backend per model so a single taskflow can mix providers/SDKs while sharing the same runner + MCP/tooling surface.
Changes:
- Introduces
AnthropicSDKBackend(Anthropic Messages API viaanthropicSDK) with streaming and MCP tool-calling support. - Extends model resolution to support per-model
backendoverrides and threads that through agent deployment. - Promotes
anthropicandgithub-copilot-sdkto regular dependencies and updates user-facing docs/grammar accordingly.
Show a summary per file
| File | Description |
|---|---|
| src/seclab_taskflow_agent/sdk/anthropic_sdk/backend.py | New Anthropic SDK backend adapter implementing the shared backend protocol. |
| src/seclab_taskflow_agent/sdk/anthropic_sdk/init.py | Adds package marker/docstring for the new backend module. |
| src/seclab_taskflow_agent/sdk/init.py | Registers anthropic_sdk as a known backend in the lazy backend factory. |
| src/seclab_taskflow_agent/runner.py | Adds per-task/per-model backend override support to model resolution and deployment. |
| src/seclab_taskflow_agent/models.py | Extends ApiType and BackendSdk literals to include the new options. |
| README.md | Documents the new backend and per-model backend selection precedence. |
| pyproject.toml | Moves anthropic and github-copilot-sdk into core dependencies; removes the copilot extra. |
| doc/GRAMMAR.md | Updates grammar docs to include backend in model_settings and the new messages api_type. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 8/8 changed files
- Comments generated: 7
- Remove unused json import (lint/CodeQL) - Validate reasoning.effort against allowed values upfront - Pass through temperature/top_p to Anthropic API - Add exclude_from_context support (stop after tool results) - Thread exclude_from_context into _AnthropicHandle Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MCP tools can have description=None; the Anthropic API requires a valid string. Fall back to tool name when description is None. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update doc examples to use claude-opus-4.7 and show api_type: messages - Add tests/test_sdk_anthropic_adapter.py (18 tests covering validate, tool conversion, token resolution, tool result parsing) - Fix test_runner.py: update _resolve_task_model unpacking to 6-tuple - Fix test_sdk_base.py: update backend resolution tests to match new behavior (endpoint no longer auto-selects copilot_sdk) - Add test for explicit anthropic_sdk backend selection Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allows the backend to work with both CAPI (Authorization: Bearer) and direct Anthropic endpoints (x-api-key) without code changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Access the MCP session directly to get the raw tool list, bypassing the openai-agents tool_filter which requires run_context/agent args not available outside its run loop. Apply blocked_tools filtering and namespace prefixing in our own code. Tested: blocked tool correctly hidden from model, unblocked tools work normally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The anthropic backend was reaching into openai-agents private attrs (`_obj`, `.session`) to bypass tool_filter at tool-enumeration time. This required duplicating the namespace-prefix logic that already lives on MCPNamespaceWrap and risked double-prefixing on the fallback path. Move the 'list tools without invoking the agent-side tool_filter' logic into MCPNamespaceWrap.list_tools_unfiltered(), where the wrapper already owns its namespace and session reference. The anthropic backend becomes a one-liner; double-prefix risk is eliminated; openai-agents internal access is centralized in one place (mcp_utils.py). Also bump default_model in the provider registry from gpt-4.1 to gpt-5.5 (Copilot and OpenAI direct), openai/gpt-4.1 to openai/gpt-5.5 (GitHub Models). Only affects callers who do not specify a model -- the audit pipeline always specifies models via model_config, so this is purely a fallback for community users. Tests added: tests/test_mcp_utils.py (6 tests covering prefix correctness, no-double-prefix, tool attribute preservation, missing-session error, caller-state isolation, regression of existing list_tools()). Tests updated: test_capi_extended.py (default_model assertions). 274 tests pass, ruff clean. Local audit on anticomputer/vulnerable-test-app produced 4 vulnerabilities (verifying MCP tools enumerated + called correctly). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove quotes from _FakeTool return type (UP037) - Use raw string for regex pattern in pytest.raises match (RUF043) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default_model for the OpenAI direct provider was bumped to gpt-5.5
in the bearer_auth refactor, but _OpenAIProvider.check_tool_calls()'s
prefix allowlist still only matched gpt-3.5/gpt-4/o-series. This meant
supports_tool_calls('gpt-5.5', ...) returned False, so list_tool_call_models()
would omit the default model from the catalog output -- a contradiction
with the model being the configured default.
Add 'gpt-5' to the prefix tuple and a regression test covering gpt-5,
gpt-5.5, gpt-5.5-mini, and a hypothetical gpt-5.6.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
README.md: Add the per-task backend override as the highest-precedence selection level. Tasks can put 'backend:' in their own model_settings block to override the model-level value, per _resolve_task_model(). doc/GRAMMAR.md: Tighten the 'passed through to the selected SDK backend' claim. openai_agents accepts the standard OpenAI parameter set, anthropic_sdk forwards a curated subset (temperature, top_p, reasoning, max_tokens, stream_thinking), and copilot_sdk consumes only its own exposed keys (e.g. reasoning_effort) and silently ignores the rest. Avoid misleading users about arbitrary key forwarding. mcp_utils.py: Make list_tools_unfiltered idempotent on the prefix. Strip an existing namespace prefix before re-applying so the method is safe to call repeatedly even if the underlying session somehow returns a cached/reused tool object whose name was previously namespaced. Uses str.removeprefix() (no-op when prefix is absent). Regression test added covering the previously-prefixed-input path. 276 tests pass, hatch fmt clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'cache_control: {type: ephemeral}' to messages.stream() calls. The
API auto-places a cache breakpoint at the longest cacheable prefix
(tools + system + accumulated messages) and moves it forward on each
turn -- multi-turn agent loops get cache reads on every turn after the
first.
Default-on because all current Claude models support cache_control and
CAPI accepts it (validated end-to-end against claude-mythos-5 via CAPI
on 2026-06-12). Callers pointed at proxies that strip / reject
cache_control can opt out with 'prompt_caching: false' in model_settings.
A string value (e.g. 'prompt_caching: 1h') sets a custom TTL.
Local validation against anticomputer/vulnerable-test-app on the same
audit pipeline, same model config, only changing prompt_caching:
metric | off | on | delta
--------------------+-----------+-----------+------------
requests | 60 | 62 | +2 (noise)
input tokens fresh | 909,806 | 124 | -99.99%
cache read tokens | 0 | 728,079 | new
cache write tokens | 0 | 210,261 | new
output tokens | 42,300 | 44,933 | similar
vulnerabilities | 4 | 5 | +1
est. mythos cost | $11.21 | $5.60 | -50%
Same or better audit quality, half the token cost. Real audits with
larger system prompts + more tool definitions amortize the cache writes
over more reads, so production savings are typically larger than 50%.
Tests added:
- prompt_caching default-on emits cache_control
- prompt_caching=False suppresses cache_control (opt-out)
- prompt_caching='1h' includes the ttl field
23 tests pass total, hatch fmt clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reviewer was correct that blocked_tools was effectively a no-op in the
anthropic_sdk backend: taskflow YAML supplies raw tool names like
'read_file', but list_tools_unfiltered() returns namespace-prefixed
names like '{hash}read_file'. The old 'tool.name not in blocked' check
never matched, silently letting every blocked tool through. This is
the security bug the reviewer flagged on PR #265.
Fix: match the raw name against the un-prefixed portion of each tool's
namespaced name, in addition to the literal name. The mcp_server_map
keys stay namespaced because that's what Anthropic sends in tool_use.
Regression tests:
- raw 'read_file' filters out '{hash}read_file' (the bug case)
- already-namespaced names still match (backwards compat)
doc/GRAMMAR.md: also fix an inaccuracy the reviewer flagged in the
same review pass -- the docs claimed copilot_sdk 'silently ignores'
unsupported model_settings keys, but it actually raises
BackendCapabilityError on 'temperature' and 'parallel_tool_calls' at
validate() time. Updated wording to distinguish 'ignored' (anthropic_sdk)
from 'rejected' (copilot_sdk) so users aren't surprised by a hard fail
when they expected a silent drop.
281 tests pass, hatch fmt clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Coworker review flagged that gpt-5.5 is not a viable default: - gpt-5 family models require the responses API, but APIProvider has no api_type field to signal that — callers using the default would silently hit the wrong endpoint shape - GitHub Models never received gpt-5.5; gpt-4.1 is what's still supported there, so 'openai/gpt-5.5' would 404 - Most callers specify models explicitly via model_config anyway, so the default is only a fallback safety net — keep it on a model that exists on all three providers Reverts the registry defaults and dataclass default; keeps the gpt-5 prefix in _OpenAIProvider._CHAT_PREFIXES (direct OpenAI API does serve gpt-5 family, and the prefix check is independent of default selection). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the catch — reverted the default_model bump in c6ef3ae. You're right on both counts: gpt-5 family needs the responses API (which |
Three behavior fixes flagged by review or by re-reading the diff: 1. 4xx exception mapping (reviewer-flagged): previously only anthropic.BadRequestError (400) was mapped to BackendBadRequestError. Auth (401), permission (403), not-found (404), conflict (409), unprocessable (422) all fell through to BackendUnexpectedError and surfaced as 'Agent Exception' instead of a clean request error. Catch anthropic.APIStatusError and map any 4xx status to BackendBadRequestError; 5xx still falls through to BackendUnexpectedError (the request was well-formed). 2. Empty-token failure mode: build() now raises BackendBadRequestError with a clear message when no API token can be resolved, instead of either leaking RuntimeError from get_AI_token() or letting the Anthropic client be constructed with an empty 'Bearer ' header (which produces an opaque 401 mid-stream much later). 3. Stale module docstring in sdk/__init__.py: said 'Two backends are supported' and referenced the removed '[copilot]' optional-extra. Updated to reflect the current three-backend reality. Test cleanup (reviewer-flagged): - DRY'd 3x duplicate _FakeStreamCtx boilerplate in the prompt-caching tests into a single _make_fake_client() helper at the top of the file. The helper uses a proper empty async iterator class instead of the 'return; yield' empty-generator pattern the reviewer flagged as awkward. Added regression coverage: - test_build_raises_bad_request_when_no_token_available - test_4xx_api_status_errors_map_to_bad_request (parameterized over 400/401/403/404/409/422) - test_5xx_api_status_errors_map_to_unexpected 281 -> 289 passing; lint clean (hatch fmt --linter --check). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| for c in content: | ||
| text = getattr(c, "text", None) | ||
| if text: | ||
| parts.append(text) | ||
| return "\n".join(parts) if parts else str(result) |
There was a problem hiding this comment.
Good catch — fixed in 4f1f440. Changed if text: to if text is not None: so explicit empty strings are preserved verbatim. The str(result) fallback now only fires when there are genuinely no text-bearing content blocks at all (not when all blocks have empty text). Added two regression tests covering an only-empty result and an empty among non-empty siblings.
| # Clear every token-source env var the standard chain consults | ||
| for var in ("AI_API_TOKEN", "OPENAI_API_KEY", "AZURE_OPENAI_API_KEY", | ||
| "ANTHROPIC_API_KEY", "GITHUB_TOKEN", "GH_TOKEN"): | ||
| monkeypatch.delenv(var, raising=False) |
There was a problem hiding this comment.
Real flakiness risk — fixed in 4f1f440. Trimmed the cargo-culted list of unrelated API keys down to just the two variables that capi.get_AI_token() actually consults (AI_API_TOKEN and COPILOT_TOKEN) and explicitly delete both. Docstring updated to call out the dependency on the token chain so the next person editing get_AI_token knows to update the test too.
Two more reviewer-flagged issues: 1. _call_tool_result_to_text() dropped empty TextContent The truthy check 'if text:' treated TextContent(text='') the same as text=None and skipped it. With an only-empty content list, parts would be [] and the helper fell through to the str(result) fallback (which is a noisy repr of the result object) instead of returning the actual empty result the tool reported. Fix: 'if text is not None:' preserves explicit empty strings; the str(result) fallback now only fires when there are no text-bearing blocks at all. 2. test_build_raises_bad_request_when_no_token_available was flaky The test cleared a long list of API key env vars (defensive cargo cult) but missed COPILOT_TOKEN, which is the second variable that capi.get_AI_token() consults. On runners with COPILOT_TOKEN set (e.g. CI envs authed to Copilot), the test would unexpectedly find a token and the assertion would fail. Simplified to clear only the two vars the chain actually consults: AI_API_TOKEN and COPILOT_TOKEN. +2 regression tests for empty-string preservation; 291 passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds
anthropic_sdkas a third backend adapter and enables per-model backend selection so different models in a single taskflow can use different SDKs.Changes
New backend:
anthropic_sdkDrives the native Anthropic Messages API (
/v1/messages) via the officialanthropicPython SDK. Implemented insdk/anthropic_sdk/backend.py(284 lines), implementing the existingAgentBackendprotocol.Capabilities:
reasoning.effort(low/medium/high/max)stream_thinkingmodel_settings option (opt-in, default off)Backend*ErrortypesAuth: The Anthropic SDK sends
x-api-keyby default. Providers that use Bearer auth (APIProvider.bearer_auth=True-- includes CAPI, GitHub Models, OpenAI, and any custom proxy registered withbearer_auth=True) getAuthorization: Bearerviadefault_headerswith a placeholderapi_key. Unknown endpoints (e.g. directapi.anthropic.com) default to native SDK auth viax-api-key. Token resolution usesget_AI_token()(AI_API_TOKENthenCOPILOT_TOKENfallback) with optional per-modeltokenenv-var override.Thinking: Uses
thinking.type: adaptivewithoutput_config.effort, matching CAPI requirements. CAPI currently returns encrypted thinking signatures (content not readable through the proxy), but thestream_thinkingflag is ready for when thinking content is exposed.Per-model backend selection
The
backendkey can now be set in per-modelmodel_settings, not just globally on the model config document. This allows mixed-backend taskflows:_resolve_task_model()now returns a 6th element (per-task backend override), whichdeploy_task_agentsuses with fallback to the global backend.Dependency changes
Both
anthropicandgithub-copilot-sdkmoved from optional to regular dependencies. With per-model backend config, any SDK could be needed at runtime, so optional installs no longer make sense.Testing
All tests performed against CAPI with the pipeline
AI_API_TOKEN:Files changed
src/seclab_taskflow_agent/sdk/anthropic_sdk/-- new backend adaptersrc/seclab_taskflow_agent/sdk/__init__.py-- registry for third backendsrc/seclab_taskflow_agent/models.py-- extendedBackendSdkandApiTypeLiteralssrc/seclab_taskflow_agent/runner.py-- per-model backend in_resolve_task_model(), threaded throughdeploy_task_agents()pyproject.toml-- anthropic + copilot-sdk as regular depsREADME.md-- updated backend docsdoc/GRAMMAR.md-- updated model_settings docs