ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441
Open
inkcherry wants to merge 7 commits into
Open
ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441inkcherry wants to merge 7 commits into
inkcherry wants to merge 7 commits into
Conversation
…RDMA)
Add mori.ccl.HierAllGather: an all_gather_into_tensor-compatible collective that
keeps intra-node traffic on the SDMA copy engines (XGMI) and moves inter-node
traffic over RDMA. A fused ring||local-gather kernel runs the inter-node RDMA
ring concurrently with the ring-independent local node-block SDMA gather in one
grid (stream-ordered, direct-to-output, no staging copy).
Bit-exact vs torch.distributed.all_gather_into_tensor for {bf16,fp16,fp32,int32}.
On 2 nodes x 4 GPUs (MI355X), fp32: standalone bandwidth >= RCCL for sizes
>=8MB (1.19-1.35x); under a concurrent GEMM the SDMA path overlaps with compute
and is 16-20% faster than RCCL at 128-512MB (copy engines vs CU contention).
Includes tests (test_hier_allgather*), size-sweep + gemm-overlap benches, a
plot script, and the measured result charts/CSVs under benchmarks/.
…ignature HierAllGather now auto-detects the node-local rank count (LOCAL_WORLD_SIZE, else hostname grouping, else npes) so callers use the same constructor/call signature as the flat AllgatherSdma with no new required argument. ranks_per_node is now optional and keyword-only; added transit_buffer_size for signature parity. Single node still degenerates to the pure intra-node SDMA path.
Resolve conflicts with the upstream param-contiguous SDMA allgather: - oneshot_sdma_kernel.hpp: keep both the hierarchical sub-group/broadcast SDMA kernels and the upstream param-contiguous kernel (additive). - symmetric_memory.cpp: keep deviceHandles_d indexing by global pe (the array is worldSize-sized and all SDMA kernels index by global pe); adopt the upstream non-fatal GPU-metadata teardown.
Add HierAllGather.all_gather(tensor_list, tensor) matching
torch.distributed.all_gather (list output), built on the same hierarchical
intra-node SDMA / inter-node RDMA path as the contiguous all_gather_into_tensor.
Bit-exact vs torch across {bf16,fp16,fp32,int32}; adds test_hier_allgather_list.
…DMA + inter-RDMA)
…A + inter RDMA) Adds a single drop-in FSDP2 AllGather backend, MoriAllGather, used identically for single-node and cross-node runs via the stock FSDPModule.set_custom_all_gather API. It routes intra-node traffic over SDMA copy engines (XGMI) and, when the process group spans multiple nodes, inter-node traffic over RDMA — the same object handles both, so user code is unchanged between one node and many. (MoriHierAllGather kept as a backward-compat alias.) Includes the Qwen-7B FSDP2 step benchmark, a 2-node driver, and the chart script. No mori source change needed; the HierAllGather primitive already exists and handles the single-node case as pure intra-node SDMA.
c8d4eca to
4188668
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist