Skip to content

[codex] trace and reduce remote exec-server latency#30266

Closed
richardopenai wants to merge 1 commit into
mainfrom
codex/exec-server-remote-latency-phases
Closed

[codex] trace and reduce remote exec-server latency#30266
richardopenai wants to merge 1 commit into
mainfrom
codex/exec-server-remote-latency-phases

Conversation

@richardopenai

Copy link
Copy Markdown
Contributor

Summary

  • add end-to-end registration, websocket, Noise-handshake, initialization, authorization, and RPC phase spans
  • propagate W3C trace context through ERS and Rendezvous requests
  • add a repeatable 30-sample remote latency benchmark with legacy-read and event-stream A/B modes
  • use pushed process events in unified exec, with retained-read recovery after receiver lag
  • carry executor sandbox-denial state on the terminal event with backward-compatible protocol defaults
  • enable TCP_NODELAY on Rendezvous websocket connections

Why

The benchmark showed two distinct bottlenecks:

  1. cold registration paid lazy ERS presence-client initialization
  2. completed process tools paid an unnecessary second wide-area process/read round trip

RPC serialization, queueing, and deserialization were below 0.6 ms; more than 99.8% of RPC time was remote response wait.

Impact

Across three staging runs:

Metric Baseline median Optimized median Change
cold connection/ready 1,395 ms 549 ms -60.6%
one-shot process completion ~193 ms 118.7 ms -38.5%
same-route completion p95 182.4 ms 131.7 ms -27.8%

The event path removes process/read from successful process completion while preserving ordered output, exit status, transport failure, recovery, and sandbox-denial behavior. Named spans account for over 97% of connection time and over 99.8% of RPC time.

Companion service PR: https://github.com/openai/openai/pull/1080070

Validation

  • just test -p codex-exec-server: 294 passed, 2 skipped
  • Codex core streaming tests: 6 passed
  • exec-server protocol tests: 5 passed
  • full codex-rmcp-client test suite passed
  • scoped just fix passed for exec-server, core, protocol, and rmcp-client
  • three 30-sample read-control runs and three 30-sample event-stream runs against staging
  • verified bounded route dimensions resolve to unified-0s / c1-u0s

Copy link
Copy Markdown
Contributor Author

Superseded by focused fix PRs:

The telemetry and benchmark scaffolding remain on this branch for reference, but are intentionally excluded from the production fix PRs pending an explicit sampling and volume plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant