fix(llm): sanitize tool_call arguments and cap tool result size for provider validators by nap-liu · Pull Request #468 · dataelement/Clawith

nap-liu · 2026-04-23T08:28:00Z

Addresses #401 — partial. Implements the 输出清洗 (output sanitization) pillar; retry/降级 pillars intentionally out of scope and left for follow-up.

Summary

Two classes of HTTP 400 failures that surface intermittently when agents use tool calls against providers that validate function.arguments as strict JSON (DashScope/Qwen observed here; the issue references MiniMax with the same root cause):

function.arguments must be in JSON format — model streaming occasionally emits tool_call arguments with trailing commas or unescaped control characters. Current _process_tool_call calls json.loads with a silent fallback to {}, but the raw malformed string is still written back into conversation history, so every subsequent round re-submits it and fails the validator.
Range of input length should be [1, 983616] — large tool results (e.g. execute_code stdout, long query responses) accumulate across rounds and push the full conversation past the model's input cap.

Fix

json_recovery.py (new): string-aware JSON repair that strips trailing commas and unescapes control chars without corrupting data inside string literals. Returns a canonical form guaranteed to round-trip through json.loads.
tool_result_shaping.py (new): head+tail truncation with an explicit marker so the LLM knows content was dropped.
caller.py: _process_tool_call and _try_model (called by call_agent_llm_with_tools) both canonicalize arguments back into tc["function"]["arguments"] and shape results to TOOL_RESULT_MAX_CHARS (20k). Shared helpers _canonicalize_tc_arguments / _shape_tool_content_for_context avoid duplication. Pre-existing tuple-arity bug in _try_model's early return is also fixed.

Tests

test_llm_json_recovery.py (+118): trailing-comma strip, non-dict coercion, unicode handling, edge cases
test_llm_tool_result_shaping.py (+80): head/tail composition, truncation marker accuracy, zero/negative max_chars guards
test_llm_caller_integration.py (+140): in-place canonicalization via tool_call fixtures, shaping at the caller boundary, logging contract

Notes

Vision tool_call content (list[dict]) passes shaping untouched — base64 image payloads preserved.
_canonicalize_tc_arguments mutates tc["function"] in place by design so the repaired args reach history; a comment locks this invariant in case of a future deepcopy refactor.
TOOL_RESULT_MAX_CHARS = 20_000 is a conservative constant for this PR; making it per-agent configurable is a natural follow-up.

Commits

Kept granular (7 commits) for easier review. Happy to squash on merge if preferred.

Pure-Python helper that parses malformed tool_call.arguments strings (trailing commas, unescaped control chars) and produces a canonical JSON string guaranteed to round-trip through json.loads. Will be used by caller._process_tool_call to prevent malformed JSON from leaking into subsequent DashScope requests.

Two silent-data-loss paths found in code review: 1. Regex-based trailing-comma strip would eat commas inside string values ({"a": "hello,}", ...} -> {"a": "hello}", ...}). Replaced with a string-aware walker that tracks in_string and escape state. 2. Non-dict top-level parses (array, scalar, null) were silently coerced to {} with method="clean", making the data-loss event invisible to observability. Added explicit method="non_dict_coerced" so the caller can log-and-alert when this happens. Tests: 3 new (string-content comma, escaped quote, non-dict coercion). Total: 12 passed.

Large tool results (e.g. execute_code stdout) accumulate across rounds and can push conversation context over Qwen3.5-plus's 983k char input limit. shape_tool_result keeps head + tail with an explicit truncation marker so the LLM knows truncation happened.

Two silent-corruption edge cases found in code review: 1. Negative max_chars (e.g. -5) produced nonsense via Python's negative slice semantics — overlapping slices + a lying marker claiming more chars dropped than the input had. 2. max_chars=0 returned only the marker, violating the contract that output stays within max_chars + small marker overhead. Both now return empty string with was_truncated accurately reflecting whether any content was dropped. Tests added for both cases plus renamed test_marker_reports_dropped_byte_count → ..._char_count since len(str) counts code points, not UTF-8 bytes.

_process_tool_call now uses canonicalize_tool_arguments to repair malformed JSON from Qwen streaming (trailing commas, unescaped control chars) and writes the canonical form back to tc.function.arguments so that subsequent LLM rounds receive a history entry that passes DashScope's JSON validator. Prevents HTTP 400 'function.arguments must be in JSON format'. Also caps single tool result content at 20k chars via head+tail shaping. Prevents single-run context blowup that caused HTTP 400 'Range of input length should be [1, 983616]'. Both paths log a structured warning when repair or truncation fires, for observability.

Code review found call_agent_llm_with_tools._try_model has an identical tool loop to _process_tool_call, used by background services (scheduler, task executor, heartbeat). It still ran the naive json.loads path without canonicalize or shaping, leaving the same two 400 errors unfixed on the background-service code path. Extracted shared helpers _canonicalize_tc_arguments and _shape_tool_content_for_context at module level so both callers use the same normalization + logging. Also: - Hoisted json_recovery and tool_result_shaping imports to top of file - Promoted TOOL_RESULT_MAX_CHARS to module constant - Unified log fields (orig_len/new_len) and added dropped= on truncation - Tightened test assertion from <50_000 to <=TOOL_RESULT_MAX_CHARS+500 - Added 3 direct tests for the helpers

Three small items surfaced by final integrated review: 1. _try_model declared -> tuple[str, bool, bool] but the no-tool-calls happy-path early return produced a 2-tuple. Caller unpacked 3 values so the background-services path would ValueError on any immediate non-tool answer. Pre-existing; folded in here since this PR already edits the function. 2. Added a one-line NB comment above both assistant-with-tool-calls message appends documenting that tc['function'] must remain shared by reference with _canonicalize_tc_arguments in-place write — a future deepcopy refactor would silently carry pre-repair arguments into history. 3. Removed dead 'import re' and '_UNESCAPED_CONTROL_RE' from json_recovery.py — unused after the string-aware walker rewrite.

nap-liu added 7 commits April 23, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): sanitize tool_call arguments and cap tool result size for provider validators#468

fix(llm): sanitize tool_call arguments and cap tool result size for provider validators#468
nap-liu wants to merge 7 commits intodataelement:mainfrom
nap-liu:pr/llm-tool-call-robustness

nap-liu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nap-liu commented Apr 23, 2026

Summary

Fix

Tests

Notes

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant