Skip to content

Sanitize surrogates and non-UTF-8 bytes in pydantic data converter#1449

Closed
xumaple wants to merge 1 commit intomainfrom
maplexu/pydantic-surrogate-sanitization
Closed

Sanitize surrogates and non-UTF-8 bytes in pydantic data converter#1449
xumaple wants to merge 1 commit intomainfrom
maplexu/pydantic-surrogate-sanitization

Conversation

@xumaple
Copy link
Copy Markdown
Contributor

@xumaple xumaple commented Apr 14, 2026

Summary

pydantic_core's Rust to_json() serializer crashes when values contain:

  • Strings with Unicode surrogates (U+D800–U+DFFF) — e.g., from subprocess output decoded with errors='surrogateescape'
  • Bytes with non-UTF-8 content — pydantic serializes bytes via UTF-8 decode, so binary data like b'\x89PNG...' crashes

This was discovered in a real workload where a sandbox exec activity captured stdout from a command that read binary files (PNG headers, fonts, etc.).

Fix

Adds a _sanitize_for_json() fallback in PydanticJSONPlainPayloadConverter.to_payload():

  • On the happy path, the existing Rust serializer runs unchanged
  • On failure, strings are sanitized via UTF-16 round-trip (surrogate pairs become proper codepoints, lone surrogates become U+FFFD) and bytes are re-encoded as valid UTF-8, then the same serializer retries — preserving exclude_unset and all existing behavior

Related

Test plan

  • 9 new unit tests covering surrogates, invalid bytes, exclude_unset preservation, Pydantic models, dataclasses, and nested structures
  • All 16 existing pydantic tests still pass
  • All lints pass (ruff, pyright, basedpyright, pydocstyle)

🤖 Generated with Claude Code

@xumaple xumaple requested a review from a team as a code owner April 14, 2026 22:16
@xumaple xumaple marked this pull request as draft April 14, 2026 22:17
@xumaple xumaple force-pushed the maplexu/pydantic-surrogate-sanitization branch 2 times, most recently from 6e2564e to e146a5f Compare April 14, 2026 22:24
pydantic_core's Rust serializer crashes on strings with Unicode surrogate
characters and bytes with non-UTF-8 content. This adds a fallback that
sanitizes the value and retries, preserving all existing serializer behavior
including exclude_unset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@xumaple xumaple force-pushed the maplexu/pydantic-surrogate-sanitization branch from e146a5f to 91fb2ea Compare April 14, 2026 22:27
@xumaple xumaple closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant