feat: operator rate control, pause-on-typing, quiet-peer liveness & forms-only protocol#40
Merged
Merged
Conversation
… reach the hub
The dashboard sent {mode: action} but the hub gates and dispatches /ui
control-mode on the "action" key (_MUTATING_COMMANDS / _apply_ui_command),
so the frame matched no command key and was silently dropped, leaving
operator pause/resume/stop/reset dead over the WebSocket. Send {action}.
Add a vitest guard pinning the corrected wire format (fails on the old
{mode} payload) and a hub-side test asserting an operator action:pause
actually flips the room mode.
HubState.set_rate_limit retunes the global send rate at runtime: it updates
the per-peer bucket defaults (so future registrations inherit them) and
reconfigures every live and reaped peer's TokenBucket in place. New
TokenBucket.reconfigure credits elapsed idle time under the old rate, adopts
the new params and clamps tokens down to the new capacity — mutating in
place rather than reconstructing, which would re-trigger __post_init__ and
silently reseed tokens to a full burst on each retune.
Validation is server-side and a strict no-op on rejection (refill_rate > 0,
capacity >= 1.0): an invalid frame leaves the defaults and every bucket
byte-identical. The current rate is pushed as a {type:rate} event and added
to the /ui snapshot so the dashboard can reflect and edit it.
_apply_ui_command now handles {set_rate:{refill_rate,capacity}} by calling
state.set_rate_limit, and set_rate joins _MUTATING_COMMANDS so an observer
connection is refused with the standard forbidden error. A reserved "peer"
key is rejected as a no-op today so the planned per-peer override extends
the wire contract instead of breaking it; malformed payloads are ignored.
peer_info now derives two signals from one configurable threshold so the operator can see who has gone silent. A peer is "quiet" when it is live, not operator-paused, and has passed QUIET_AFTER_SECONDS (default 180s, overridable via CAUCUS_QUIET_AFTER_SECONDS) with BOTH no /receive poll AND no set_status update — the poll-age guard means an actively polling peer is never flagged. The threshold sits between a realistic single long turn (a passive bridge fires no tool call mid-turn; a native does not poll while reasoning) and the 300s reaper, so a normal turn never trips it but a genuinely silent peer surfaces before it is reaped. status_stale is a looser cosmetic dim derived from the same tunable (0.66x), so the operator holds a single number. Both ride the existing peers_info path into the /ui health tick and snapshot; the snapshot also carries the active quiet_after threshold.
…igns of life Amend PROTOCOL_TEXT and bump PROTOCOL_VERSION 14 -> 15. Agents now talk to the human exclusively through ask_operator forms — never a plain say() — and must announce in the room before any private exchange with the operator, so the room always knows a side conversation is happening even if it cannot see it. Strengthen the set_status cadence into a load-bearing rule: a peer that neither polls nor reports a status looks dead to the hub and surfaces as "quiet" on the operator console, so agents (especially ones peers are waiting on) must give regular signs of life via set_status between turns. A reminder comment at PROTOCOL_VERSION points to the caucus-protocol.md mirror so the two never drift.
Mirror the PROTOCOL_VERSION 15 amendment into the human-readable protocol copy: add ask_operator/list_forms to the tool table, a new "Asking the human (forms)" section stating forms are the only channel to the operator and the signal-before-private rule, and a Discipline bullet on giving regular signs of life via set_status before going "quiet". A test_protocol_md drift guard pins the shared "taking this to the operator privately" sentence in both this file and the hub's PROTOCOL_TEXT so they cannot diverge.
RateControl lets the operator retune the global send rate at runtime: a value
field, a per-second/per-minute unit toggle and a burst field that convert to
the hub's {set_rate:{refill_rate,capacity}} frame. The store gains a rate slice
(RateInfo, snapshot.rate, the {type:rate} event, and sendSetRate). The panel is
operator-only and notes that a change applies to all peers and clamps in-flight
bursts.
A "Pause while typing" toggle (off by default, persisted) pauses the room the instant the operator starts typing into an empty composer and auto-resumes it on send. A four-phase state machine (idle -> waiting -> confirmed -> cancelled) arms the auto-resume only once the hub echoes the pause back, so it never fights a manual pause and never spuriously resumes when another operator changed the mode meanwhile. Clearing the box or turning the toggle off also resumes; an inline hint shows while auto-paused.
PeerInfo gains quiet and status_stale, and the HealthPanel renders them: a live peer that has gone silent (no poll and no status update past the threshold) gets an amber "quiet" badge with an advisory tooltip — amber, not red, because a peer mid-long-turn can legitimately be quiet — plus a "N quiet" figure in the stats bar. Each peer's self-reported status now shows its age and dims when stale. This gives the operator an at-a-glance read of who is alive and what they are doing.
…s-and-liveness # Conflicts: # caucus-protocol.md # src/caucus/hub.py
…ars the composer Typing the leading "/" of a slash-command goes empty to non-empty, so with "Pause while typing" on the composer arms an auto-pause. Accepting the command from the autocomplete dropdown clears the box through executeCommand, which bypassed handleChange's clear-box-resume branch — so the auto-pause state machine leaked: a stale phase, a lingering hint, a spurious resume on the next send, and for /export (which sets no mode of its own) a room left silently paused with no box content left to clear to release it. executeCommand now releases the transient typing-pause itself: it resumes for /export, and for the mode-setting commands (pause/resume/stop/reset) it forgets the auto-pause without a spurious resume so the command's own terminal mode stays authoritative. Covered by two new composer tests driving the real autocomplete-accept path.
set_rate_limit validated only the lower bounds (refill_rate > 0, capacity >= 1). NaN already fails those comparisons, but +inf slips through (inf > 0, inf >= 1) and is accepted, silently disabling the limiter — never a legitimate config for an attacker-shaped /ui frame. Guard both operands with math.isfinite so the reject stays a strict no-op. Extends the existing reject test with inf/nan.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Operator controls & agent liveness
Five operator-facing improvements to the hub, plus one latent bug found along the way. No manual version bump — release-please owns the release from these conventional commits.
PROTOCOL_VERSIONis bumped 14 → 15 (independent of the package version).What's in it
/uicontrol wire (found while scoping pause-on-typing): the dashboard sent{mode: action}but the hub only dispatches on{action}, so operator pause/resume/stop were silently dropped over the socket. Now sends{action}, pinned by a fails-before vitest guard + a hub-side coverage test.set_rate_limitretunes every live and reaped bucket in place (never reconstructs — that would reseed tokens to a full burst), validates first and is a strict no-op on reject. A reservedpeerkey is rejected today so a per-peer override can extend the wire later.idle → waiting → confirmed → cancelled) arms the auto-resume only after the hub echoes the pause, so it never fights a manual pause or another operator's mode change.QUIET_AFTER_SECONDS, default 180s, env-overridable) with both no/receivepoll and noset_status. Surfaced as an amber "quiet" badge in the HealthPanel — advisory, not an error, since a peer mid-long-turn can legitimately be quiet. The threshold sits below the 300s reaper so a peer surfaces before it vanishes.set_statusis now load-bearing: agents (especially ones peers wait on) report status between turns or surface as quiet. The HealthPanel shows each peer's status age and dims it when stale.ask_operatorforms, and must signal in the room before any private exchange with the operator.PROTOCOL_TEXTamended,caucus-protocol.mdmirror synced, drift-guarded by a shared canonical phrase.Why pause/resume "never worked"
The wire mismatch was latent: control-mode over
/uihas been a no-op the whole time (the REST/controlpath works). Pause-on-typing reuses that path, so the fix ships first as commit 1.Verification
ruffclean,mypy --strictclean (13 files).tsc -bstrict +vite buildclean; the UI bundle (src/caucus/ui/) is rebuilt and committed with each web change.Design decisions (locked with the requester)