[DeepSeek-V4] Implement MoE routing primitives (HashRouter, TopKRouter, RoutedMoE) by parambole · Pull Request #3871 · AI-Hypercomputer/maxtext

parambole · 2026-05-11T19:59:27Z

Description

Implement Mixture of Experts (MoE) routing gates and execution layers required for DeepSeek-V4 integration into MaxText:

HashRouter: Token routing mechanism utilizing MD5 hash projections for deterministic expert assignment without auxiliary loss.
TopKRouter: Gated top-k router implementing sigmoid scaling and score normalization across selected experts.
RoutedMoE & RoutedAndSharedMoE: Execution layers supporting layer_idx routing, gate clamping, and FP32 expert summation parity.
Unit test suite (tests/unit/deepseek_v4_vs_reference_test.py) validating MoE routing parity against PyTorch reference implementations at atol=1e-5, rtol=1e-5.

Tests

Tested on CPU:

pytest  tests/unit/deepseek_v4_vs_reference_test.py

======================== 6 passed, 8 warnings in 3.99s =========================
tests/unit/deepseek_v4_vs_reference_test.py ......                       [100%]

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-11T20:05:46Z

Codecov Report

❌ Patch coverage is 17.64706% with 84 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/moe.py	16.83%	72 Missing and 12 partials ⚠️

📢 Thoughts on this report? Let us know!

…outer, RoutedMoE) Implement Mixture of Experts routing gates and execution layers for DeepSeek-V4 integration into MaxText: - HashRouter: Token routing mechanism utilizing MD5 hash projections for deterministic expert assignment. - TopKRouter: Gated top-k router implementing sigmoid scaling and score normalization. - RoutedMoE & RoutedAndSharedMoE: Execution layers supporting layer_idx routing and FP32 expert summation parity. - Parity verification: Extended unit test suite (deepseek_v4_vs_reference_test.py) validating routing parity against PyTorch reference implementations at atol=1e-5, rtol=1e-5.

github-actions · 2026-05-14T18:32:24Z

🤖 Hi @parambole, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

RissyRan

Thanks for the change! Have a few comments.

RissyRan · 2026-05-16T03:17:50Z

+  return jnp.sqrt(jax.nn.softplus(x))
+
+
+class DeepSeekV4TopKRouter(nnx.Module):


I see the logic is similar for GateLogic + topK

Gate logic:

maxtext/src/maxtext/layers/moe.py

Line 174 in b5e5330

class GateLogit(nnx.Module):

Topk:

maxtext/src/maxtext/layers/moe.py

Line 599 in b5e5330

def get_topk(self, gate_logits, pre_bias_logits, rngs=None):

Shall we leverage _sqrtsoftplus, and combine them to avoid some duplicate?

RissyRan · 2026-05-16T03:21:35Z

+    )
+
+
+class DeepSeekV4HashRouter(nnx.Module):


Wondering if we should name it HashRouter directly? Do you know if it's specific for DS v4?

RissyRan · 2026-05-16T03:26:45Z

    with jax.named_scope("ffn_act"):
-      if self.config.decoder_block == ctypes.DecoderBlockType.GPT_OSS:
+      if self.config.decoder_block == ctypes.DecoderBlockType.DEEPSEEK_V4:
+        limit = getattr(self.config, "swiglu_limit", 1.0)


Shall we reuse self.config.mlp_activations_limit config as bellow?

RissyRan · 2026-05-16T03:29:34Z

    w0_pspec, w1_pspec, wo_pspec = maybe_aqt_partition(w0_kernel, w0_pspec, w1_kernel, w1_pspec, wo_kernel, wo_pspec)

+    if gate_weights is not None:
+      gate_weights_pspec = self._logical_to_mesh_axes((batch_logical_axis, "activation_norm_length", None))


Nit: could you help add shape info here? same comment for gate_indices

RissyRan · 2026-05-16T03:31:29Z

    routing_inputs = inputs if gate_inputs is None else gate_inputs.astype(gate_dtype)
-    gate_logits, pre_bias_logits = self.gate(routing_inputs)
+
+    if self.config.decoder_block == ctypes.DecoderBlockType.DEEPSEEK_V4:


Could you help make some comments for the conditions? Thanks!

github-actions · 2026-05-16T03:33:04Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

The Pull Request successfully implements the MoE routing primitives required for DeepSeek-V4, including HashRouter, TopKRouter, and updates to the execution layers. The code is well-structured and includes comprehensive unit tests validating parity with PyTorch reference implementations.

🔍 General Feedback

Integration Gaps: While the routing primitives are correct, the DeepSeek model definition (src/maxtext/models/deepseek.py) needs corresponding updates to pass layer_idx and input_ids to these new layers. Without these, the model will not behave correctly in DeepSeek-V4 mode.
Configuration Maintainability: The use of getattr with hardcoded defaults for swiglu_limit and num_hash_layers should be replaced with formal parameters in MaxTextConfig to ensure better discoverability and type safety.
Precision Parity: The explicit use of FP32 for expert summation and routing calculations is a positive highlight as it ensures numerical parity with reference implementations.

github-actions · 2026-05-16T03:35:12Z

-        matmul_precision=self.config.matmul_precision,
-        shard_mode=config.shard_mode,
-        rngs=self.rngs,
+    self.is_hash = self.config.decoder_block == ctypes.DecoderBlockType.DEEPSEEK_V4 and 0 <= layer_idx < getattr(


🔴 The integration of layer_idx is crucial for DeepSeek-V4 to distinguish between Hash and Top-K layers. However, the model definition in src/maxtext/models/deepseek.py does not yet pass this parameter during instantiation of RoutedAndSharedMoE. This will cause all layers to default to layer_idx=0, resulting in all layers using the HashRouter if num_hash_layers >= 1.

github-actions · 2026-05-16T03:35:12Z

-        shard_mode=config.shard_mode,
-        rngs=self.rngs,
+    self.is_hash = self.config.decoder_block == ctypes.DecoderBlockType.DEEPSEEK_V4 and 0 <= layer_idx < getattr(
+        config, "num_hash_layers", 3


🟡 num_hash_layers should ideally be a formal configuration parameter in MaxTextConfig (in src/maxtext/configs/types.py) rather than being retrieved via getattr with a hardcoded default of 3.

github-actions · 2026-05-16T03:35:12Z

+      gate_inputs: jax.Array | None = None,
+      out_sharding: NamedSharding | None = None,
+      input_ids: jax.Array | None = None,
+      gate_weights: jax.Array | None = None,


🟠 DeepSeekV4HashRouter requires input_ids for expert assignment. While the interface now correctly supports this, ensure that src/maxtext/models/deepseek.py is updated to pass decoder_input_tokens as input_ids during the MoE block call, as it currently lacks this linkage.

github-actions · 2026-05-16T03:35:12Z

    with jax.named_scope("ffn_act"):
-      if self.config.decoder_block == ctypes.DecoderBlockType.GPT_OSS:
+      if self.config.decoder_block == ctypes.DecoderBlockType.DEEPSEEK_V4:
+        limit = getattr(self.config, "swiglu_limit", 1.0)


🟡 Instead of using getattr with a hardcoded default of 1.0, consider adding swiglu_limit to the MaxTextConfig class. This improves discoverability, documentation, and type safety for the configuration.

github-actions · 2026-05-16T03:35:23Z

+  Computes logits, static routing weights based on token IDs, and expert indices.
+  """
+
+  def __init__(


🟢 The docstring mentions "static routing weights", which might be confusing as the weights themselves are learned based on logits. Only the expert assignment (indices) is static based on token IDs.

Suggested change

def __init__(

"""Hash Router for DeepSeek-V4 MoE routing.

Computes learned routing weights for a static expert assignment determined by token IDs.

"""

parambole force-pushed the dsv4-moe-routing-primitives branch from 37ee811 to 31329c5 Compare May 11, 2026 20:38

parambole force-pushed the deepseek_v4_core_primitives branch from 1ab79e5 to c025463 Compare May 12, 2026 17:22

parambole force-pushed the dsv4-moe-routing-primitives branch from 31329c5 to 22a57ff Compare May 12, 2026 17:23

parambole force-pushed the deepseek_v4_core_primitives branch from c025463 to 68c44a6 Compare May 12, 2026 21:12

parambole force-pushed the dsv4-moe-routing-primitives branch from 22a57ff to 32869e5 Compare May 12, 2026 21:12

parambole force-pushed the deepseek_v4_core_primitives branch 2 times, most recently from 72a92a7 to e81f52d Compare May 14, 2026 17:45

parambole force-pushed the dsv4-moe-routing-primitives branch from 32869e5 to c92f2e0 Compare May 14, 2026 17:51

parambole changed the title ~~Implement custom MoE HashRouter, TopKRouter, and sqrtsoftplus~~ Implement DeepSeek-V4 MoE routing primitives (HashRouter, TopKRouter, RoutedMoE) May 14, 2026

parambole changed the title ~~Implement DeepSeek-V4 MoE routing primitives (HashRouter, TopKRouter, RoutedMoE)~~ [DeepSeek-V4] Implement MoE routing primitives (HashRouter, TopKRouter, RoutedMoE) May 14, 2026

parambole added the gemini-review label May 14, 2026

parambole marked this pull request as ready for review May 14, 2026 18:57

parambole requested review from A9isha, RissyRan, SurbhiJainUSC, abhinavclemson, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, khatwanimohit, michelle-yooh, richjames0, shralex, suexu1025 and vipannalla as code owners May 14, 2026 18:57

parambole requested review from NicoGrande, NuojCheng, aireenmei, dipannita08, igorts-git and jiangjy1982 as code owners May 14, 2026 18:57

parambole assigned entrpn, RissyRan, shuningjin, gagika, snehalv2002, Rohan-Bierneni and bvandermoon May 14, 2026

RissyRan reviewed May 16, 2026

View reviewed changes

RissyRan added gemini-review and removed gemini-review labels May 16, 2026

github-actions Bot reviewed May 16, 2026

View reviewed changes

		return jnp.sqrt(jax.nn.softplus(x))


		class DeepSeekV4TopKRouter(nnx.Module):

-  def __init__(
+  """Hash Router for DeepSeek-V4 MoE routing.
+  Computes learned routing weights for a static expert assignment determined by token IDs.
+  """

Conversation

parambole commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan May 16, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan May 16, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan May 16, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan May 16, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

parambole commented May 11, 2026 •

edited

Loading

codecov Bot commented May 11, 2026 •

edited

Loading