[bugfix] fix bugs#95
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a custom TopKRouter to handle expert bias with padding masks and integrates it into the model registration process. However, several issues were identified: the _apply_expert_bias override is currently ineffective as the parent class does not pass the required padding_mask, and a shape mismatch would occur if it were passed. Additionally, the use of @jit_fuser on an instance method is inappropriate, and the router replacement logic in register.py needs more robust attribute checking to avoid potential AttributeError exceptions when encountering partial objects or dense layers.
| def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None): | ||
| """ | ||
| Update expert bias and tokens_per_expert | ||
| Prevent extra local tokens accumulation on evaluation or activation recomputation | ||
| """ | ||
| if self.enable_expert_bias and torch.is_grad_enabled(): | ||
| with torch.no_grad(): | ||
| if padding_mask is not None: | ||
| if padding_mask.ndim == 1: | ||
| padding_mask = padding_mask.unsqueeze(-1) | ||
| routing_map = routing_map & (~padding_mask) | ||
| self.local_tokens_per_expert += routing_map.sum(dim=0) |
There was a problem hiding this comment.
This implementation of _apply_expert_bias has two significant issues that make the bugfix ineffective:
- Ineffective logic:
TopKRouterinherits fromMcoreTopKRouter, but it does not override theforwardmethod. The parent'sforwardmethod callsself._apply_expert_bias(routing_map)without passing apadding_mask. Consequently, thepadding_maskargument here will always receive its default value (None), and the masking logic will be skipped. To fix this, you must also override theforwardmethod to accept and pass thepadding_mask. - Shape Mismatch: If
padding_maskwere passed (e.g., fromGPTModel), it would typically have the shape[batch, seq_len]. However,routing_mapin the router has the shape[batch * seq_len, num_experts]. Usingunsqueeze(-1)results in[batch, seq_len, 1], which will cause a broadcast error when combined withrouting_map. It should be flattened using.view(-1, 1)instead.
| @jit_fuser | ||
| def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None): |
There was a problem hiding this comment.
The @jit_fuser decorator is generally intended for standalone functions or static methods that perform element-wise operations. Applying it to an instance method that modifies module state (like self.local_tokens_per_expert) is not supported by the Megatron-Core JIT fuser and may lead to compilation errors or unexpected behavior during TorchScripting. The original McoreTopKRouter implementation does not use JIT for this method.
| @jit_fuser | |
| def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None): | |
| def _apply_expert_bias(self, routing_map: torch.Tensor, padding_mask: Optional[torch.Tensor] = None): |
| mlp_submodules = layer_spec.submodules.mlp.submodules | ||
| if getattr(mlp_submodules, 'router', None) is McoreTopKRouter: | ||
| mlp_submodules.router = TopKRouter |
There was a problem hiding this comment.
Accessing layer_spec.submodules.mlp.submodules directly is unsafe because layer_spec.submodules.mlp can be a functools.partial object (as seen in _set_mlp_spec), which does not have a submodules attribute. Additionally, for dense layers, mlp might not be present or might not have a router. This will result in an AttributeError for certain model configurations.
mlp_spec = getattr(layer_spec.submodules, 'mlp', None)
if mlp_spec is not None and hasattr(mlp_spec, 'submodules'):
mlp_submodules = mlp_spec.submodules
if getattr(mlp_submodules, 'router', None) is McoreTopKRouter:
mlp_submodules.router = TopKRouter|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a custom TopKRouter to handle expert bias updates safely during evaluation or activation recomputation, and integrates it into the model building pipeline by replacing the default Megatron-Core router. Feedback has been provided to improve robustness, including setting the correct default parameter in _replace_router for Gemma4, using a safer .get() call when accessing partial keywords to avoid potential KeyErrors, and explicitly casting the padding mask to a boolean tensor before applying bitwise operations.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a custom TopKRouter to prevent extra local tokens accumulation during evaluation or activation recomputation, integrates it into the model registration process, and adjusts MTP decoder input handling. The review feedback highlights a critical bug in gpt_model.py where moving the mtp_decoder_input chunking logic before preprocessing can cause an AttributeError crash and loss of computed embeddings in the first pipeline stage. Additionally, a robustness improvement is suggested in register.py to use getattr when retrieving submodules from mlp_spec to avoid potential AttributeErrors.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several updates to the mcore_bridge model configurations and architectures. Key changes include setting a default value for dsa_indexer_loss_coeff in ModelConfig, refactoring GPTModel to handle mtp_decoder_input during preprocessing, and introducing a custom TopKRouter module to manage expert bias updates. Feedback on these changes highlights two potential issues: a missing defensive check in gpt_model.py when input_tensor is None, and a potential bitwise operation bug in topk_router.py if padding_mask is not explicitly cast to a boolean tensor.
No description provided.