net: macb: candidate fixes for silent TX stall on BCM2712/RP1#7340
net: macb: candidate fixes for silent TX stall on BCM2712/RP1#7340lukaszraczylo wants to merge 3 commits intoraspberrypi:rpi-6.18.yfrom
Conversation
macb_start_xmit() and macb_tx_restart() kick transmission by
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
have in front of us), writes to NCR are posted PCIe writes: they
are not guaranteed to reach the device before the issuing CPU
returns. An existing source-level comment at the TSTART site
acknowledges that such writes can be lost under some conditions:
/* TSTART write might get dropped, so make the IRQ retrigger
* a buffer read */
and arms a recovery handshake via queue->tx_pending /
queue->txubr_pending that runs on the next TCOMP interrupt. That
recovery path depends on a subsequent TCOMP actually firing. If
the TSTART write never reaches the MAC, no TX begins, no TCOMP
completion arrives, and the ring remains quiescent without any
kernel-visible indication.
Add a read-back of NCR after each TSTART write in macb_start_xmit()
and macb_tx_restart(). The read is an architected PCIe read
barrier for earlier posted writes on the same path; it ensures the
doorbell has reached the MAC before the functions return.
The existing tx_pending / txubr_pending handshake is left in place
unchanged -- it remains the correct recovery for any other reason
the MAC could silently fail to start TX.
We do not have direct hardware evidence that TSTART is being lost
on the RP1 path. This patch is one of a three-patch series
("candidate fixes for silent TX stall on BCM2712/RP1"); see the
cover letter for context. We have verified it compiles and
applies cleanly; runtime verification is pending.
Link: cilium/cilium#43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
calls napi_complete_done() and re-enables TCOMP via IER. An
existing comment in the function notes:
/* Packet completions only seem to propagate to raise
* interrupts when interrupts are enabled at the time, so if
* packets were sent while interrupts were disabled,
* they will not cause another interrupt to be generated when
* interrupts are re-enabled.
*/
and mitigates this by calling macb_tx_complete_pending(), which
inspects driver-visible ring state (descriptor->ctrl, after rmb())
and reschedules NAPI if a completion is observable in memory.
On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
setup we have in front of us), the descriptor DMA write that sets
TX_USED may not have retired to system memory at the point
macb_tx_complete_pending() runs. The rmb() synchronises the CPU
view of earlier CPU writes; it is not sufficient to retire an
in-flight peripheral DMA write. Under that ordering the in-memory
descriptor can still read TX_USED=0 when the hardware has in fact
completed the frame; the check returns false; NAPI exits; the
quirk above prevents the re-enabled IER from re-firing; the ring
goes quiescent.
Add an explicit ISR read after the IER write. The MMIO read
serves two independent purposes:
(1) It is an architected PCIe read barrier for earlier
peripheral-originated DMA writes on the same path, so a
subsequent macb_tx_complete_pending() observes any TX_USED
write that was in flight at the time of the barrier.
(2) It samples the hardware ISR directly, so a TCOMP bit that
the hardware set while TCOMP was masked is visible here,
independently of whether the descriptor DMA has retired.
If either signal indicates pending work, reschedule NAPI via the
same path as the existing check.
This patch addresses one of three candidate races for the silent
TX stall described in the cover letter. Whether it is sufficient
by itself, or whether it requires the PCIe posted-write flush in
patch 1/3 to cover the observed behaviour, we have not yet
verified at runtime.
Link: cilium/cilium#43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
Patches 1/3 and 2/3 address two candidate races that could lead to a TCOMP completion being missed on PCIe-attached macb instances. This patch adds a defence-in-depth safety net, in case a further race remains that we have not identified. The watchdog is a per-queue delayed_work that runs once per second. It snapshots queue->tx_tail; if the ring is non-empty (queue->tx_head != queue->tx_tail) and tx_tail has not advanced since the previous tick, it calls macb_tx_restart(). No new recovery logic is introduced. macb_tx_restart() already exists in this file, is correctly locked (tx_ptr_lock, bp->lock), and verifies that the hardware's TBQP is behind the driver's head index before re-asserting TSTART. On a healthy ring it is a no-op at the hardware level; the watchdog only supplies the missing trigger. On a healthy queue the per-tick cost is one spin_lock_irqsave() / spin_unlock_irqrestore() and one branch. The delayed_work is only scheduled between macb_open() and macb_close(), and is cancelled synchronously on close. Context for submission: on our 24-node Raspberry Pi 5 fleet, before this series, an out-of-band user-space watchdog (monitoring tx_packets from /sys/class/net/.../statistics and toggling the link down/up when it froze) was required to keep nodes usable. We include this kernel-side watchdog as a cleaner in-kernel equivalent for any residual stall that patches 1 and 2 do not cover. We are willing to drop this patch if the view is that 1 and 2 should stand alone. Link: cilium/cilium#43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
8adc2e6 to
a7e09f2
Compare
|
Force-pushed amended commits to address the checkpatch warnings from the previous run. Fix is style-only — converted four block comments from the kernel-default
New head: |
|
Once the automatic checks have completed, a trial build can be installed by running A suggestion: if it is important to determine which of the patches are required, you could gate each fix with a module parameter so that interested parties can experiment. These parameters would likely be dropped before merging, so consider adding them via a 4th commit. |
|
@pelwell - there's bit of a background - https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 where Rich tried applying them separately. I am happy to feature gate them, although from the tests I ran so far ( as the bug hunt progressed ) I'd say both patches + watchdog are necessary to keep things stable |
|
Perhaps it's more important to ensure the stalls are banished, even if that means a belt and braces approach. |
|
I will try to compile this and run it on my raspberry pi too |

net: macb: candidate fixes for silent TX stall on BCM2712/RP1
Mirrors the netdev RFC
[RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1(lore) — same three commits, anchored againstrpi-6.18.y(the only difference vs. the netdev mainline-anchored version is in patch 3'smacb.hhunk, which is anchored on the vendor-forkbool tx_pendingfield that's not present upstream).Opening this in addition to the netdev RFC at @nbuchwitz's request on #7339 ("Will test the RFC series and update this issue, then post a PR with the patches once I've dug a bit deeper" — happy to defer to him if he prefers to land it himself; this is meant as a ready-to-rebase carry, not to short-circuit his review).
Patches
net: macb: flush PCIe posted write after TSTART doorbell— read-back ofNCRafter eachTSTARTwrite so the doorbell reaches the MAC before the function returns. Targets the case where theTSTARTwrite is still posted in PCIe write-buffers and the MAC has not yet been kicked.net: macb: re-check ISR after IER re-enable in macb_tx_poll— re-reads ISR after re-enabling theTCOMPinterrupt at the end ofmacb_tx_poll(). CatchesTCOMPraised inside the IDR/IER mask window, and serves as a PCIe read-barrier to flush any in-flightTX_USEDdescriptor writes from the MAC.net: macb: add TX stall watchdog as defence-in-depth safety net— per-queuedelayed_workthat runs once per second, calls the existingmacb_tx_restart()iftx_tailhasn't advanced since the previous tick while the ring is non-empty. No new recovery logic:macb_tx_restart()is correctly locked and idempotent.Runtime evidence (Pi 5 fleet, rpi-6.18.y kernel + this series)
NodeNotReady. EEE was off (verified viaethtool --show-eeeshowingEEE status: disabledandtx_lpi_transitions: 0since boot, so AutogrEEEn-disable from PR EEE patches #7270 was active and not the cause).Verification
git amagainstrpi-6.18.yHEAD applies cleanly (3 commits, all GPG-signed).v6.18.24(the upstream-stable kernel siderolabs/pkgs uses) is in flight at kernel: macb silent TX stall candidate fixes (RFC patches from netdev) siderolabs/pkgs#1526.Notes
delayed_workbased and usesmacb_tx_restart(); willing to drop it if the maintainer view is that patches 1 + 2 should land alone first.f7576c7,89d02e4) are already merged into this branch and are complementary: this series targets a residual stall path independent of EEE LPI wake-up.Related