Skip to content

MCU status byte attribution race: failed-move status can be clobbered before broadcast #541

Description

@hongquanli

Background

PR #540 added a firmware-side mcu_cmd_execution_status byte so the host can detect failed moves (e.g. filter-wheel move before INITFILTERWHEEL, or tmc4361A_moveTo returning non-zero). The status is set to COMPLETED_WITHOUT_ERRORS at the start of every received command and overwritten to CMD_EXECUTION_ERROR by mark_move_failed() / report_move_error() on the failure path. The next position-update broadcast carries the status to the host.

Surfaced during review by copilot-pull-request-reviewer[bot] (PR #540 review comment).

The race

The current implementation in serial_communication.cpp::process_serial_message resets the status to COMPLETED_WITHOUT_ERRORS at the start of every received command, before the callback runs:

if (buffer_rx[1] != HEARTBEAT)
{
    mcu_cmd_execution_status = COMPLETED_WITHOUT_ERRORS;
}

This assumes the previous command's status has already been broadcast by the timer-driven send_position_update. The assumption breaks when:

  1. Command N fails → status = CMD_EXECUTION_ERROR.
  2. Command N+1 arrives before the next position-update broadcast fires.
  3. Status is reset → the host never sees the error for command N.

What PR #540 mitigates

The if (buffer_rx[1] != HEARTBEAT) guard handles the common case: a heartbeat keepalive arriving between a failed move and the next broadcast no longer clobbers the error. This addresses the dominant in-practice race because the heartbeat thread is the only command source that interleaves with wait_till_operation_is_completed.

What's still vulnerable

  • Two failed commands back-to-back before a broadcast: second command's reset clobbers the first's status. The first failure is lost.
  • Future background-sender commands: any new command type that may arrive while a move is in flight (other than HEARTBEAT) would reintroduce the race. The skip list is hardcoded to a single entry.
  • Misattribution if cmd_id advances: even when status survives, the broadcast reports it tagged with the latest received cmd_id. If a heartbeat (or any non-status-bearing command) lands between the failed move and the broadcast, the host's CMD_EXECUTION_ERROR handler still triggers correctly only because the host's own _cmd_id has been bumped by the heartbeat thread in lockstep. In any cross-process or cross-host scenario where command IDs are not synchronized between the failing thread and the interleaving thread, attribution breaks.

Why this is fundamentally hard

The current protocol uses state broadcasts (timer-driven) for what is essentially event reporting (per-command outcomes). State broadcasts are inherently lossy when the underlying state changes faster than the broadcast cadence. The status byte is per-command state, so any loss attaches to whichever cmd_id the byte happened to belong to at the moment of broadcast.

Suggested fix approaches (in increasing invasiveness)

  1. Tag status with its cmd_id. Add mcu_cmd_execution_status_cmd_id global. Broadcast reports (latest_cmd_id, status, status_cmd_id). Host rendezvous's on status_cmd_id == _cmd_id. Catches the cross-cmd_id misattribution for the most recent failure. Earlier failures (if two arrive between broadcasts) are still lost.

  2. Reset-only-when-observed. Add an _observed flag set in send_position_update after the SerialUSB write, cleared on receipt in process_serial_message. Reset status only if _observed was true. Catches the heartbeat-like cases generally without a hardcoded skip list. Doesn't help cmd_id attribution (broadcast still reports the new command's cmd_id with the old command's status).

  3. Combine (1) + (2). Status tagged with real cmd_id, and only reset after broadcast. Closes both the loss and attribution gaps for the most recent failure. ~20 lines of firmware + a small host change to read the new byte. Best correctness-per-effort ratio.

  4. Event-driven status reporting. Trigger a broadcast from inside the failing callback (before returning) so the status is delivered before another command can be processed. Cleanest semantically but constrains SerialUSB write timing and requires care around interrupt context.

Acceptance criteria

  • Failed moves are reported to the host regardless of intervening commands (any type, any interleaving).
  • Attribution is unambiguous: the host can determine which cmd_id the reported failure belongs to.
  • New behavior is opt-in via firmware version (FIRMWARE_VERSION_MINOR bump) so old hosts can still talk to new firmware safely.
  • Tests cover: (a) failed move + heartbeat interleave, (b) two failed moves before broadcast, (c) failed move + arbitrary non-move command interleave.

Related

Priority

Low for typical use today (host serializes commands except for heartbeat; the HEARTBEAT skip closes that gap; the absolute-MOVETO containment in PR #540 means a missed error still self-corrects on the next successful move). Higher priority if a future feature interleaves commands at rates faster than interval_send_pos_update, or if a stall-class incident (firmware-undetectable mechanical stall) is reported and we want the broadcast/verify successor PR to land cleanly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions