Skip to content

net: macb: candidate fixes for silent TX stall on BCM2712/RP1#7340

Open
lukaszraczylo wants to merge 3 commits intoraspberrypi:rpi-6.18.yfrom
lukaszraczylo:net-macb-silent-tx-stall-fixes
Open

net: macb: candidate fixes for silent TX stall on BCM2712/RP1#7340
lukaszraczylo wants to merge 3 commits intoraspberrypi:rpi-6.18.yfrom
lukaszraczylo:net-macb-silent-tx-stall-fixes

Conversation

@lukaszraczylo
Copy link
Copy Markdown

net: macb: candidate fixes for silent TX stall on BCM2712/RP1

Mirrors the netdev RFC [RFC PATCH net-next 0/3] net: macb: candidate fixes for silent TX stall on BCM2712/RP1 (lore) — same three commits, anchored against rpi-6.18.y (the only difference vs. the netdev mainline-anchored version is in patch 3's macb.h hunk, which is anchored on the vendor-fork bool tx_pending field that's not present upstream).

Opening this in addition to the netdev RFC at @nbuchwitz's request on #7339 ("Will test the RFC series and update this issue, then post a PR with the patches once I've dug a bit deeper" — happy to defer to him if he prefers to land it himself; this is meant as a ready-to-rebase carry, not to short-circuit his review).

Patches

  • net: macb: flush PCIe posted write after TSTART doorbell — read-back of NCR after each TSTART write so the doorbell reaches the MAC before the function returns. Targets the case where the TSTART write is still posted in PCIe write-buffers and the MAC has not yet been kicked.
  • net: macb: re-check ISR after IER re-enable in macb_tx_poll — re-reads ISR after re-enabling the TCOMP interrupt at the end of macb_tx_poll(). Catches TCOMP raised inside the IDR/IER mask window, and serves as a PCIe read-barrier to flush any in-flight TX_USED descriptor writes from the MAC.
  • net: macb: add TX stall watchdog as defence-in-depth safety net — per-queue delayed_work that runs once per second, calls the existing macb_tx_restart() if tx_tail hasn't advanced since the previous tick while the ring is non-empty. No new recovery logic: macb_tx_restart() is correctly locked and idempotent.

Runtime evidence (Pi 5 fleet, rpi-6.18.y kernel + this series)

  • Series deployed 2026-04-24 19:00 UTC on a 24-node Raspberry Pi 5 fleet (3 masters + 3 storage + 16 workers + 2 TPU). Cumulative runtime ≈ 1,900 node-hours.
  • Zero silent TX wedges (the failure mode the series targets — pre-patch each occurrence required hardware power-cycle).
  • One patch-3-recovered transient stall on pi-worker-16 (2026-04-26 11:44:59 UTC):
    macb 1f00100000.ethernet end0:
      TX stall detected on queue 0 (tail=46455484 head=46455486); re-kicking TSTART
    
    Recovery latency ~1 s (one watchdog tick); node never went NodeNotReady. EEE was off (verified via ethtool --show-eee showing EEE status: disabled and tx_lpi_transitions: 0 since boot, so AutogrEEEn-disable from PR EEE patches #7270 was active and not the cause).
  • Pre-patch reference rate from the cover letter: ~0.5 stall events / node-hour, projecting ~950 events over the runtime window. Observed 1, recovered.
  • 24h follow-up posted to the netdev thread with the same data.

Verification

Notes

  • Patch 3 is delayed_work based and uses macb_tx_restart(); willing to drop it if the maintainer view is that patches 1 + 2 should land alone first.
  • EEE-side fixes from PR EEE patches #7270 (f7576c7, 89d02e4) are already merged into this branch and are complementary: this series targets a residual stall path independent of EEE LPI wake-up.

Related

macb_start_xmit() and macb_tx_restart() kick transmission by
OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb instances
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
have in front of us), writes to NCR are posted PCIe writes: they
are not guaranteed to reach the device before the issuing CPU
returns.  An existing source-level comment at the TSTART site
acknowledges that such writes can be lost under some conditions:

	/* TSTART write might get dropped, so make the IRQ retrigger
	 * a buffer read */

and arms a recovery handshake via queue->tx_pending /
queue->txubr_pending that runs on the next TCOMP interrupt.  That
recovery path depends on a subsequent TCOMP actually firing.  If
the TSTART write never reaches the MAC, no TX begins, no TCOMP
completion arrives, and the ring remains quiescent without any
kernel-visible indication.

Add a read-back of NCR after each TSTART write in macb_start_xmit()
and macb_tx_restart().  The read is an architected PCIe read
barrier for earlier posted writes on the same path; it ensures the
doorbell has reached the MAC before the functions return.

The existing tx_pending / txubr_pending handshake is left in place
unchanged -- it remains the correct recovery for any other reason
the MAC could silently fail to start TX.

We do not have direct hardware evidence that TSTART is being lost
on the RP1 path.  This patch is one of a three-patch series
("candidate fixes for silent TX stall on BCM2712/RP1"); see the
cover letter for context.  We have verified it compiles and
applies cleanly; runtime verification is pending.

Link: cilium/cilium#43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
calls napi_complete_done() and re-enables TCOMP via IER.  An
existing comment in the function notes:

	/* Packet completions only seem to propagate to raise
	 * interrupts when interrupts are enabled at the time, so if
	 * packets were sent while interrupts were disabled,
	 * they will not cause another interrupt to be generated when
	 * interrupts are re-enabled.
	 */

and mitigates this by calling macb_tx_complete_pending(), which
inspects driver-visible ring state (descriptor->ctrl, after rmb())
and reschedules NAPI if a completion is observable in memory.

On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
setup we have in front of us), the descriptor DMA write that sets
TX_USED may not have retired to system memory at the point
macb_tx_complete_pending() runs.  The rmb() synchronises the CPU
view of earlier CPU writes; it is not sufficient to retire an
in-flight peripheral DMA write.  Under that ordering the in-memory
descriptor can still read TX_USED=0 when the hardware has in fact
completed the frame; the check returns false; NAPI exits; the
quirk above prevents the re-enabled IER from re-firing; the ring
goes quiescent.

Add an explicit ISR read after the IER write.  The MMIO read
serves two independent purposes:

  (1) It is an architected PCIe read barrier for earlier
      peripheral-originated DMA writes on the same path, so a
      subsequent macb_tx_complete_pending() observes any TX_USED
      write that was in flight at the time of the barrier.

  (2) It samples the hardware ISR directly, so a TCOMP bit that
      the hardware set while TCOMP was masked is visible here,
      independently of whether the descriptor DMA has retired.

If either signal indicates pending work, reschedule NAPI via the
same path as the existing check.

This patch addresses one of three candidate races for the silent
TX stall described in the cover letter.  Whether it is sufficient
by itself, or whether it requires the PCIe posted-write flush in
patch 1/3 to cover the observed behaviour, we have not yet
verified at runtime.

Link: cilium/cilium#43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
Patches 1/3 and 2/3 address two candidate races that could lead
to a TCOMP completion being missed on PCIe-attached macb
instances.  This patch adds a defence-in-depth safety net, in
case a further race remains that we have not identified.

The watchdog is a per-queue delayed_work that runs once per
second.  It snapshots queue->tx_tail; if the ring is non-empty
(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
since the previous tick, it calls macb_tx_restart().

No new recovery logic is introduced.  macb_tx_restart() already
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
and verifies that the hardware's TBQP is behind the driver's
head index before re-asserting TSTART.  On a healthy ring it is
a no-op at the hardware level; the watchdog only supplies the
missing trigger.

On a healthy queue the per-tick cost is one spin_lock_irqsave()
/ spin_unlock_irqrestore() and one branch.  The delayed_work is
only scheduled between macb_open() and macb_close(), and is
cancelled synchronously on close.

Context for submission: on our 24-node Raspberry Pi 5 fleet,
before this series, an out-of-band user-space watchdog
(monitoring tx_packets from /sys/class/net/.../statistics and
toggling the link down/up when it froze) was required to keep
nodes usable.  We include this kernel-side watchdog as a cleaner
in-kernel equivalent for any residual stall that patches 1 and
2 do not cover.  We are willing to drop this patch if the view
is that 1 and 2 should stand alone.

Link: cilium/cilium#43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
@lukaszraczylo lukaszraczylo force-pushed the net-macb-silent-tx-stall-fixes branch from 8adc2e6 to a7e09f2 Compare April 28, 2026 19:24
@lukaszraczylo
Copy link
Copy Markdown
Author

Force-pushed amended commits to address the checkpatch warnings from the previous run.

Fix is style-only — converted four block comments from the kernel-default /*\n * Foo style to the net/ subsystem-required /* Foo\n * ... style. No functional change.

File:line (post-amend) Patch
drivers/net/ethernet/cadence/macb_main.c:1953 1/3
drivers/net/ethernet/cadence/macb_main.c:2640 1/3
drivers/net/ethernet/cadence/macb_main.c:2003 2/3
drivers/net/ethernet/cadence/macb_main.c:2034 3/3

New head: a7e09f2e97204da72c9fd95b98c7bfb5d0b14cc4. All 3 commits still GPG-signed.

@pelwell
Copy link
Copy Markdown
Contributor

pelwell commented Apr 28, 2026

Once the automatic checks have completed, a trial build can be installed by running sudo rpi-update pulls/7340.

A suggestion: if it is important to determine which of the patches are required, you could gate each fix with a module parameter so that interested parties can experiment. These parameters would likely be dropped before merging, so consider adding them via a 4th commit.

@lukaszraczylo
Copy link
Copy Markdown
Author

@pelwell - there's bit of a background - https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 where Rich tried applying them separately. I am happy to feature gate them, although from the tests I ran so far ( as the bug hunt progressed ) I'd say both patches + watchdog are necessary to keep things stable

@pelwell
Copy link
Copy Markdown
Contributor

pelwell commented Apr 28, 2026

Perhaps it's more important to ensure the stalls are banished, even if that means a belt and braces approach.

@lukaszraczylo
Copy link
Copy Markdown
Author

lukaszraczylo commented Apr 28, 2026

As I mentioned few times in other discussions - I'm running kubernetes on 24 pi's, currently on 2400+ node-hours without wedges. Most sensitive workflows like cnpg and longhorn do not report any issues since the switch to patched kernel and additional monitors I have set up look clear as well. I'm keeping an eye on it and of course - I'll keep everyone updated if something changes.

Screenshot 2026-04-28 at 22 08 32

@koeqaife
Copy link
Copy Markdown

I will try to compile this and run it on my raspberry pi too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants