Skip to content

feat(shmem/sdma): implement address-based device putmem_nbi_signal on…#445

Open
zjing14 wants to merge 1 commit into
ROCm:mainfrom
zjing14:sdma-putmem-nbi-signal-block
Open

feat(shmem/sdma): implement address-based device putmem_nbi_signal on…#445
zjing14 wants to merge 1 commit into
ROCm:mainfrom
zjing14:sdma-putmem-nbi-signal-block

Conversation

@zjing14

@zjing14 zjing14 commented Jul 2, 2026

Copy link
Copy Markdown

… the SDMA transport

The device-side ShmemPutMemNbiSignalBlockKernel address-based overloads were stubs (TODO), and DISPATCH_TRANSPORT_TYPE_WITH_BOOL asserted on SDMA, so a device putmem_nbi_signal targeting an SDMA peer hit assert(false).

This wires SDMA through the bool dispatch and implements the address-based signal-put as a COPY_LINEAR(source -> peer dest) followed by an ATOMIC on the peer flag, enqueued on the SAME SDMA queue so the DMA engine executes the flag update strictly after the copy completes (in-order queue). Result: a CU-free per-tile push+signal usable to drive an in-kernel wait_until gate (e.g. a fused all-gather + GEMM consumer) without consuming compute units.

Notes:

  • Thread scope resolves the heap object from globalGpuStates and computes peer offsets from heapBaseAddr, matching the existing address-based SDMA paths.
  • Uses atomic INCREMENT (monotonic-generation semantics); consumers wait flag >= gen. signalValue/signalOp are accepted for API parity but the SDMA path currently only wraps increment.
  • onlyOneSignal=false forwards to the =true path.

Validated on 4x gfx950 (MI350) with a FlyDSL SDMA-signal probe and a fused AG+GEMM PoC.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

… the SDMA transport

The device-side ShmemPutMemNbiSignalBlockKernel<SDMA> address-based overloads
were stubs (TODO), and DISPATCH_TRANSPORT_TYPE_WITH_BOOL asserted on SDMA, so a
device putmem_nbi_signal targeting an SDMA peer hit assert(false).

This wires SDMA through the bool dispatch and implements the address-based
signal-put as a COPY_LINEAR(source -> peer dest) followed by an ATOMIC on the
peer flag, enqueued on the SAME SDMA queue so the DMA engine executes the flag
update strictly after the copy completes (in-order queue). Result: a CU-free
per-tile push+signal usable to drive an in-kernel wait_until gate (e.g. a fused
all-gather + GEMM consumer) without consuming compute units.

Notes:
- Thread scope resolves the heap object from globalGpuStates and computes peer
  offsets from heapBaseAddr, matching the existing address-based SDMA paths.
- Uses atomic INCREMENT (monotonic-generation semantics); consumers wait
  `flag >= gen`. signalValue/signalOp are accepted for API parity but the SDMA
  path currently only wraps increment.
- onlyOneSignal=false forwards to the =true path.

Validated on 4x gfx950 (MI350) with a FlyDSL SDMA-signal probe and a fused
AG+GEMM PoC.
@carlushuang carlushuang requested a review from jhchouuu July 2, 2026 06:38

@jhchouuu jhchouuu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Merging is fine. Follow-ups for us to complete later:

  • Implement the thread/warp scopes (only block-level is done now; the macro also expands those, so they currently hit a silent no-op stub).
  • Support the full signal semantics (signalValue/signalOp are currently ignored, always += 1).
  • Reduce the runtime dispatch overhead from the added SDMA branch (extra cost and code size).

Comment on lines 358 to 360
inline __device__ void ShmemPutMemNbiSignalBlockKernel<application::TransportType::SDMA, true>(
const void* dest, const void* source, size_t bytes, const void* signalDest,
uint64_t signalValue, core::atomicType signalOp, int pe, int qpId) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the block-level version is implemented; the thread and warp versions aren't yet. The macro expansion probably expands the thread/warp paths too, so there might be a small issue here. But it's fine to merge for now, we'll add the complete functionality in a follow-up.

Comment on lines +57 to +58
} else if (transportType == application::TransportType::SDMA) { \
func<application::TransportType::SDMA, boolParam>(__VA_ARGS__); \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a bit more device-side runtime overhead to this API. Previously the branch was just between P2P and IBGDA; now there's an extra SDMA branch, which also grows the code size. There's not much we can do about it for now, it's inherent to runtime dispatch. We'll look into addressing this in a follow-up.

uint64_t off = 0;
uint64_t base = handle.ReserveQueueSpace(sizeof(SDMA_PKT_ATOMIC), off);
uint64_t wptr = base;
auto pkt = anvil::CreateAtomicIncPacket(sigPtr);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the increment semantics are implemented here, signalValue/signalOp are ignored (always atomic += 1). Fine for now; we'll complete the remaining ops in a follow-up.

@jhchouuu

jhchouuu commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

And now CI system has some issues, is being addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants