MCU status byte attribution race: failed-move status can be clobbered before broadcast

## Background

PR #540 added a firmware-side `mcu_cmd_execution_status` byte so the host can detect failed moves (e.g. filter-wheel move before `INITFILTERWHEEL`, or `tmc4361A_moveTo` returning non-zero). The status is set to `COMPLETED_WITHOUT_ERRORS` at the start of every received command and overwritten to `CMD_EXECUTION_ERROR` by `mark_move_failed()` / `report_move_error()` on the failure path. The next position-update broadcast carries the status to the host.

Surfaced during review by `copilot-pull-request-reviewer[bot]` (PR #540 review comment).

## The race

The current implementation in `serial_communication.cpp::process_serial_message` resets the status to `COMPLETED_WITHOUT_ERRORS` at the **start of every received command**, before the callback runs:

```cpp
if (buffer_rx[1] != HEARTBEAT)
{
    mcu_cmd_execution_status = COMPLETED_WITHOUT_ERRORS;
}
```

This assumes the previous command's status has already been broadcast by the timer-driven `send_position_update`. The assumption breaks when:

1. Command N fails → `status = CMD_EXECUTION_ERROR`.
2. Command N+1 arrives before the next position-update broadcast fires.
3. Status is reset → the host never sees the error for command N.

### What PR #540 mitigates

The `if (buffer_rx[1] != HEARTBEAT)` guard handles the common case: a heartbeat keepalive arriving between a failed move and the next broadcast no longer clobbers the error. This addresses the dominant in-practice race because the heartbeat thread is the only command source that interleaves with `wait_till_operation_is_completed`.

### What's still vulnerable

- **Two failed commands back-to-back before a broadcast**: second command's reset clobbers the first's status. The first failure is lost.
- **Future background-sender commands**: any new command type that may arrive while a move is in flight (other than `HEARTBEAT`) would reintroduce the race. The skip list is hardcoded to a single entry.
- **Misattribution if `cmd_id` advances**: even when status survives, the broadcast reports it tagged with the *latest* received `cmd_id`. If a heartbeat (or any non-status-bearing command) lands between the failed move and the broadcast, the host's CMD_EXECUTION_ERROR handler still triggers correctly only because the host's own `_cmd_id` has been bumped by the heartbeat thread in lockstep. In any cross-process or cross-host scenario where command IDs are not synchronized between the failing thread and the interleaving thread, attribution breaks.

## Why this is fundamentally hard

The current protocol uses **state broadcasts** (timer-driven) for what is essentially **event reporting** (per-command outcomes). State broadcasts are inherently lossy when the underlying state changes faster than the broadcast cadence. The status byte is per-command state, so any loss attaches to whichever `cmd_id` the byte happened to belong to at the moment of broadcast.

## Suggested fix approaches (in increasing invasiveness)

1. **Tag status with its `cmd_id`.** Add `mcu_cmd_execution_status_cmd_id` global. Broadcast reports `(latest_cmd_id, status, status_cmd_id)`. Host rendezvous's on `status_cmd_id == _cmd_id`. Catches the cross-cmd_id misattribution for the *most recent* failure. Earlier failures (if two arrive between broadcasts) are still lost.

2. **Reset-only-when-observed.** Add an `_observed` flag set in `send_position_update` after the SerialUSB write, cleared on receipt in `process_serial_message`. Reset status only if `_observed` was true. Catches the heartbeat-like cases generally without a hardcoded skip list. Doesn't help cmd_id attribution (broadcast still reports the new command's cmd_id with the old command's status).

3. **Combine (1) + (2).** Status tagged with real cmd_id, and only reset after broadcast. Closes both the loss and attribution gaps for the most recent failure. ~20 lines of firmware + a small host change to read the new byte. Best correctness-per-effort ratio.

4. **Event-driven status reporting.** Trigger a broadcast from inside the failing callback (before returning) so the status is delivered before another command can be processed. Cleanest semantically but constrains SerialUSB write timing and requires care around interrupt context.

## Acceptance criteria

- Failed moves are reported to the host **regardless** of intervening commands (any type, any interleaving).
- Attribution is unambiguous: the host can determine which `cmd_id` the reported failure belongs to.
- New behavior is opt-in via firmware version (`FIRMWARE_VERSION_MINOR` bump) so old hosts can still talk to new firmware safely.
- Tests cover: (a) failed move + heartbeat interleave, (b) two failed moves before broadcast, (c) failed move + arbitrary non-move command interleave.

## Related

- PR #540 (introduced the status byte; partial mitigation via HEARTBEAT skip)
- Review reply: https://github.com/Cephla-Lab/Squid/pull/540#discussion_r... (the original race-discussion comment)

## Priority

Low for typical use today (host serializes commands except for heartbeat; the HEARTBEAT skip closes that gap; the absolute-MOVETO containment in PR #540 means a missed error still self-corrects on the next successful move). Higher priority if a future feature interleaves commands at rates faster than `interval_send_pos_update`, or if a stall-class incident (firmware-undetectable mechanical stall) is reported and we want the broadcast/verify successor PR to land cleanly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCU status byte attribution race: failed-move status can be clobbered before broadcast #541

Background

The race

What PR #540 mitigates

What's still vulnerable

Why this is fundamentally hard

Suggested fix approaches (in increasing invasiveness)

Acceptance criteria

Related

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

MCU status byte attribution race: failed-move status can be clobbered before broadcast #541

Description

Background

The race

What PR #540 mitigates

What's still vulnerable

Why this is fundamentally hard

Suggested fix approaches (in increasing invasiveness)

Acceptance criteria

Related

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions