Skip to content

[bugfix] fix bugs#95

Merged
Jintao-Huang merged 6 commits into
modelscope:mainfrom
Jintao-Huang:fix_router_mcore_016
May 26, 2026
Merged

[bugfix] fix bugs#95
Jintao-Huang merged 6 commits into
modelscope:mainfrom
Jintao-Huang:fix_router_mcore_016

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom TopKRouter to handle expert bias with padding masks and integrates it into the model registration process. However, several issues were identified: the _apply_expert_bias override is currently ineffective as the parent class does not pass the required padding_mask, and a shape mismatch would occur if it were passed. Additionally, the use of @jit_fuser on an instance method is inappropriate, and the router replacement logic in register.py needs more robust attribute checking to avoid potential AttributeError exceptions when encountering partial objects or dense layers.

Comment on lines +10 to +21
def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None):
"""
Update expert bias and tokens_per_expert
Prevent extra local tokens accumulation on evaluation or activation recomputation
"""
if self.enable_expert_bias and torch.is_grad_enabled():
with torch.no_grad():
if padding_mask is not None:
if padding_mask.ndim == 1:
padding_mask = padding_mask.unsqueeze(-1)
routing_map = routing_map & (~padding_mask)
self.local_tokens_per_expert += routing_map.sum(dim=0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This implementation of _apply_expert_bias has two significant issues that make the bugfix ineffective:

  1. Ineffective logic: TopKRouter inherits from McoreTopKRouter, but it does not override the forward method. The parent's forward method calls self._apply_expert_bias(routing_map) without passing a padding_mask. Consequently, the padding_mask argument here will always receive its default value (None), and the masking logic will be skipped. To fix this, you must also override the forward method to accept and pass the padding_mask.
  2. Shape Mismatch: If padding_mask were passed (e.g., from GPTModel), it would typically have the shape [batch, seq_len]. However, routing_map in the router has the shape [batch * seq_len, num_experts]. Using unsqueeze(-1) results in [batch, seq_len, 1], which will cause a broadcast error when combined with routing_map. It should be flattened using .view(-1, 1) instead.

Comment on lines +9 to +10
@jit_fuser
def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The @jit_fuser decorator is generally intended for standalone functions or static methods that perform element-wise operations. Applying it to an instance method that modifies module state (like self.local_tokens_per_expert) is not supported by the Megatron-Core JIT fuser and may lead to compilation errors or unexpected behavior during TorchScripting. The original McoreTopKRouter implementation does not use JIT for this method.

Suggested change
@jit_fuser
def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None):
def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None):

Comment thread src/mcore_bridge/model/register.py Outdated
Comment on lines +150 to +152
mlp_submodules = layer_spec.submodules.mlp.submodules
if getattr(mlp_submodules, 'router', None) is McoreTopKRouter:
mlp_submodules.router = TopKRouter
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing layer_spec.submodules.mlp.submodules directly is unsafe because layer_spec.submodules.mlp can be a functools.partial object (as seen in _set_mlp_spec), which does not have a submodules attribute. Additionally, for dense layers, mlp might not be present or might not have a router. This will result in an AttributeError for certain model configurations.

            mlp_spec = getattr(layer_spec.submodules, 'mlp', None)
            if mlp_spec is not None and hasattr(mlp_spec, 'submodules'):
                mlp_submodules = mlp_spec.submodules
                if getattr(mlp_submodules, 'router', None) is McoreTopKRouter:
                    mlp_submodules.router = TopKRouter

@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom TopKRouter to handle expert bias updates safely during evaluation or activation recomputation, and integrates it into the model building pipeline by replacing the default Megatron-Core router. Feedback has been provided to improve robustness, including setting the correct default parameter in _replace_router for Gemma4, using a safer .get() call when accessing partial keywords to avoid potential KeyErrors, and explicitly casting the padding mask to a boolean tensor before applying bitwise operations.

Comment thread src/mcore_bridge/model/mm_gpts/gemma4.py Outdated
Comment thread src/mcore_bridge/model/register.py Outdated
Comment thread src/mcore_bridge/model/modules/topk_router.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom TopKRouter to prevent extra local tokens accumulation during evaluation or activation recomputation, integrates it into the model registration process, and adjusts MTP decoder input handling. The review feedback highlights a critical bug in gpt_model.py where moving the mtp_decoder_input chunking logic before preprocessing can cause an AttributeError crash and loss of computed embeddings in the first pipeline stage. Additionally, a robustness improvement is suggested in register.py to use getattr when retrieving submodules from mlp_spec to avoid potential AttributeErrors.

Comment thread src/mcore_bridge/model/gpt_model.py Outdated
Comment thread src/mcore_bridge/model/register.py Outdated
@Jintao-Huang Jintao-Huang changed the title [bugfix] fix router mcore 0.16 [bugfix] fix bugs May 26, 2026
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several updates to the mcore_bridge model configurations and architectures. Key changes include setting a default value for dsa_indexer_loss_coeff in ModelConfig, refactoring GPTModel to handle mtp_decoder_input during preprocessing, and introducing a custom TopKRouter module to manage expert bias updates. Feedback on these changes highlights two potential issues: a missing defensive check in gpt_model.py when input_tensor is None, and a potential bitwise operation bug in topk_router.py if padding_mask is not explicitly cast to a boolean tensor.

Comment thread src/mcore_bridge/model/gpt_model.py
Comment thread src/mcore_bridge/model/modules/topk_router.py
@Jintao-Huang Jintao-Huang merged commit 6f61ee6 into modelscope:main May 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants