InternLM · HAOCHENYE · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/.claude/skills/add_hf_model/SKILL.md b/.claude/skills/add_hf_model/SKILL.md
@@ -80,6 +80,18 @@ attention / router / layer onto the matching one.
   - Also: rope, rms_norm, etc.
 - **ops layer** (`xtuner/v1/ops`) — kernels such as attention and rms_norm.
 
+> **Caveat — don't reach for deprecated config classes.** `RopeScalingConfig` is
+> **deprecated**; `RopeParametersConfig` (`xtuner/v1/module/rope/rope.py`, re-exported from
+> `xtuner/v1/model/base.py`) is the source of truth — use it everywhere (your IDE/pyright will flag
+> the deprecated one). Note the decoder-layer / `MHAConfig.build` signatures still *type* their rope
+> argument as `RopeScalingConfig` for backward compatibility, so don't satisfy them by constructing
+> the deprecated class. When a per-layer value is only needed to select **one module behavior** (e.g.
+> `partial_rotary_factor` only chooses which `apply_rotary_emb` the attention uses), set that behavior
+> **directly on the module** instead of threading a config through `build` — e.g. in your model's
+> decoder layer, `self.self_attn.apply_rotary_emb = get_apply_rotary_emb(None,
+> enable_partial_rotary=...)` (`xtuner/v1/ops`). This keeps per-layer behavior contained (the §C
+> per-profile-RoPE pattern) and avoids the deprecated API entirely.
+
 ### Existing models to copy from
 
 Pick the one whose attention + (router) match yours; the closer it is, the
@@ -527,10 +539,13 @@ FSDP shard/reduce chain — and the file doubles as the example users copy when
 model into their own training pipeline. Mirror `ci/config/qwen3_moe_30BA3.py` (MoE) or
 `ci/config/qwen3_dense.py` (dense), keeping its structure: one `<NewSizeConfig>()` (from §3.2)
 fed into a `TrainerConfig` alongside `optim_cfg` / `lr_cfg` / `fsdp_cfg` / `dataset_cfg` /
-`dataloader_cfg` / `loss_cfg`. `load_from` and `tokenizer_path` read from an env var (§7.4 —
-typically the same one as the parity test). Verify by running ~50 steps and confirming the
-loss drops monotonically into a plausible range for that model size; record the trajectory
-in the PR body alongside the §6 convergence trace.
+`dataloader_cfg` / `loss_cfg`. Set `loss_cfg = CELossConfig(mode="chunk")` — the chunked
+cross-entropy keeps the `logits → loss` peak memory bounded (it never materializes the full
+`(seq, vocab)` logits), which matters for the large-vocab models this skill targets; do **not** leave
+it on the `"eager"` default. `load_from` and `tokenizer_path` read from an env var (§7.4 — typically
+the same one as the parity test). Verify by running ~50 steps and confirming the loss drops
+monotonically into a plausible range for that model size; record the trajectory in the PR body
+alongside the §6 convergence trace.
 
 ---
 

diff --git a/.dev_scripts/convert_step3p5_to_split.py b/.dev_scripts/convert_step3p5_to_split.py
@@ -0,0 +1,124 @@
+"""Convert a Step-3.5-Flash HF checkpoint to a *split* per-expert layout.
+
+The released checkpoint stores each MoE layer's experts as three fused 3-D tensors
+(`moe.gate_proj/up_proj/down_proj.weight`, shape `(num_experts, *, *)`). XTuner fuses experts
+expert-major-interleaved for its grouped GEMM, and the load/save path can only shard a fused
+parameter when each HF key is a *contiguous* slice of it. A single fused weight that maps to two
+separate HF tensors (gate, up) therefore cannot be sharded across ranks.
+
+This script explodes the fused expert tensors into per-expert 2-D tensors under
+`moe.experts.{i}.{gate,up,down}_proj.weight` (Qwen3-MoE style). With that layout XTuner's
+`to_hf_key_list` emits the interleaved key order `[gate_0, up_0, gate_1, up_1, ...]`, which lines up
+with its expert-major fused weight, so the default sharded load/save works on any number of
+GPUs (FSDP and EP) with no per-model checkpoint code.
+
+The MTP layers (`model.layers.45..47`) are dropped — XTuner does not load them.
+
+Usage:
+    python .dev_scripts/convert_step3p5_to_split.py <src_hf_dir> <dst_dir>
+"""
+
+import json
+import re
+import shutil
+import sys
+from pathlib import Path
+
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+
+
+SHARD_BYTES = 4 * 1024**3  # ~4GB shards, HF-standard
+# Side files needed so the converted dir is loadable (AutoConfig trust_remote_code + tokenizer).
+AUX_FILES = [
+    "config.json",
+    "configuration_step3p5.py",
+    "modeling_step3p5.py",
+    "tokenizer.json",
+    "tokenizer_config.json",
+    "special_tokens_map.json",
+    "chat_template.jinja",
+    "generation_config.json",
+]
+DROP_LAYER_RE = re.compile(r"^model\.layers\.(4[5-9]|[5-9]\d)\.")  # MTP / out-of-range layers
+EXPERT_RE = re.compile(r"^(model\.layers\.\d+)\.moe\.(gate_proj|up_proj|down_proj)\.weight$")
+
+
+def _iter_converted_tensors(src: Path):
+    """Yield (new_key, tensor) for every kept tensor, exploding fused experts into per-expert keys."""
+    index = json.loads((src / "model.safetensors.index.json").read_text())
+    weight_map = index["weight_map"]
+    # Group keys by source shard so each shard is opened once.
+    by_file: dict[str, list[str]] = {}
+    for key, fname in weight_map.items():
+        by_file.setdefault(fname, []).append(key)
+
+    for fname in sorted(by_file):
+        with safe_open(str(src / fname), framework="pt") as f:
+            for key in by_file[fname]:
+                if DROP_LAYER_RE.match(key):
+                    continue
+                tensor = f.get_tensor(key)
+                m = EXPERT_RE.match(key)
+                if m is None:
+                    yield key, tensor
+                    continue
+                prefix, proj = m.group(1), m.group(2)
+                # gate/up: (n, inter, hidden) -> per expert (inter, hidden)
+                # down:    (n, hidden, inter) -> per expert (hidden, inter)
+                for i in range(tensor.shape[0]):
+                    yield f"{prefix}.moe.experts.{i}.{proj}.weight", tensor[i].contiguous()
+
+
+def convert(src: Path, dst: Path) -> None:
+    dst.mkdir(parents=True, exist_ok=True)
+    weight_map: dict[str, str] = {}
+    buffer: dict[str, torch.Tensor] = {}
+    buffer_bytes = 0
+    shard_idx = 1
+    shards: list[tuple[str, dict[str, torch.Tensor]]] = []
+
+    def flush():
+        nonlocal buffer, buffer_bytes, shard_idx
+        if not buffer:
+            return
+        name = f"model-{shard_idx:05d}.safetensors"
+        shards.append((name, buffer))
+        for k in buffer:
+            weight_map[k] = name
+        buffer = {}
+        buffer_bytes = 0
+        shard_idx += 1
+
+    for key, tensor in _iter_converted_tensors(src):
+        buffer[key] = tensor
+        buffer_bytes += tensor.numel() * tensor.element_size()
+        if buffer_bytes >= SHARD_BYTES:
+            flush()
+    flush()
+
+    total = sum(t.numel() * t.element_size() for _, b in shards for t in b.values())
+    n_shards = len(shards)
+    renamed: list[tuple[str, dict[str, torch.Tensor]]] = []
+    for i, (_, buf) in enumerate(shards, start=1):
+        final = f"model-{i:05d}-of-{n_shards:05d}.safetensors"
+        for k in buf:
+            weight_map[k] = final
+        renamed.append((final, buf))
+    for name, buf in renamed:
+        save_file(buf, str(dst / name), metadata={"format": "pt"})
+        print(f"  wrote {name}  ({len(buf)} tensors)")
+
+    (dst / "model.safetensors.index.json").write_text(
+        json.dumps({"metadata": {"total_size": total}, "weight_map": weight_map}, indent=2)
+    )
+    for aux in AUX_FILES:
+        srcf = src / aux
+        if srcf.exists():
+            shutil.copy2(srcf, dst / aux)
+    print(f"done: {len(weight_map)} tensors across {n_shards} shards, {total / 1024**3:.1f} GiB -> {dst}")
+
+
+if __name__ == "__main__":
+    convert(Path(sys.argv[1]), Path(sys.argv[2]))
diff --git a/ci/config/step3p5.py b/ci/config/step3p5.py
@@ -0,0 +1,61 @@
+import os
+
+from xtuner.v1.config import (
+    AdamWConfig,
+    FSDPConfig,
+    LRConfig,
+)
+from xtuner.v1.datasets import FTDPTokenizeFnConfig
+from xtuner.v1.datasets.config import DataloaderConfig, DatasetConfig
+from xtuner.v1.loss.ce_loss import CELossConfig
+from xtuner.v1.model.moe.step3p5 import Step3p5FlashConfig
+from xtuner.v1.train import TrainerConfig
+
+
+# Point STEP3P5_PATH at the split / per-expert checkpoint produced by
+# `.dev_scripts/convert_step3p5_to_split.py` (the released fused-expert layout cannot be sharded).
+STEP3P5_PATH = os.environ["STEP3P5_PATH"]
+ALPACA_PATH = os.environ["ALPACA_PATH"]
+
+
+# Step-3.5-Flash is a ~200B MoE; real training needs expert parallelism (and a multi-node cluster).
+moe_cfg = Step3p5FlashConfig(ep_size=8, dispatcher="all2all", num_hidden_layers=4)
+optim_cfg = AdamWConfig(lr=6e-05)
+lr_cfg = LRConfig(lr_type="cosine", lr_min=1e-6)
+fsdp_cfg = FSDPConfig(
+    # torch.compile for the hybrid per-layer-RoPE decoder layers is a §8 optimization; keep eager here.
+    torch_compile=False,
+    cpu_offload=False,
+    ep_size=moe_cfg.ep_size,
+)
+
+dataset_config = [
+    {
+        "dataset": DatasetConfig(name="alpaca", anno_path=ALPACA_PATH, sample_ratio=1.0),
+        "tokenize_fn": FTDPTokenizeFnConfig(max_length=16386),
+    },
+]
+
+dataloader_config = DataloaderConfig(pack_max_length=16384)
+
+# Chunked cross-entropy keeps the logits->loss peak memory bounded (never materializes the full
+# (seq, vocab) logits) — important for Step-3.5's 128896-token vocab.
+loss_cfg = CELossConfig(mode="chunk")
+
+
+trainer = TrainerConfig(
+    load_from=STEP3P5_PATH,
+    model_cfg=moe_cfg,
+    optim_cfg=optim_cfg,
+    fsdp_cfg=fsdp_cfg,
+    dataset_cfg=dataset_config,
+    dataloader_cfg=dataloader_config,
+    lr_cfg=lr_cfg,
+    loss_cfg=loss_cfg,
+    tokenizer_path=STEP3P5_PATH,
+    global_batch_size=16,
+    total_step=1000000,
+    work_dir="/tmp/step3p5",
+    seed=0,
+    strict_load=False,
+)
diff --git a/docs/design/model/step3p5.md b/docs/design/model/step3p5.md
@@ -0,0 +1,147 @@
+# Step-3.5-Flash → XTuner integration design
+
+Source: `stepfun-ai/Step-3.5-Flash`, `model_type = "step3p5"`, remote-code
+(`modeling_step3p5.py` / `configuration_step3p5.py`, `architectures = ["Step3p5ForCausalLM"]`).
+bf16 checkpoint, training-suitable.
+
+Bucket: **MoE LLM, trust_remote_code** (no built-in `transformers` config class →
+`hf_config` returns `None`; `save_hf` copies `config.json` / tokenizer / `*.py`).
+
+## 1. Architecture summary (from the HF modeling code)
+
+- 45 transformer layers. Layers 0–2 are **dense** MLP; layers 3–44 are **MoE**.
+- Vocab 128896, hidden 4096, untied `lm_head` (separate tensor in the index).
+- MoE: 288 routed experts, top-k 8, `moe_intermediate_size` 1280, plus **one shared
+  expert** (`share_expert_dim` 1280). Experts stored as **fused 3-D** tensors
+  `(num_experts, out, in)` — `moe.{gate_proj,up_proj}.weight (288,1280,4096)`,
+  `moe.down_proj.weight (288,4096,1280)`.
+- Router: **sigmoid** activation, per-expert **`router_bias`** added before top-k,
+  weights gathered from the **pre-bias** probabilities, renormalized, then scaled by
+  `moe_router_scaling_factor = 3.0`; router logits computed in **fp32**
+  (`need_fp32_gate`).
+- RMSNorm is **zero-centered** (scale = `weight + 1`), eps 1e-5, throughout.
+- Attention is a **hybrid of two softmax-attention profiles keyed by `layer_types`**:
+  - `full_attention` (every 4th layer, idx % 4 == 0): 64 heads, 8 KV heads, head_dim 128.
+  - `sliding_attention` (the other layers): **96 heads**, 8 KV heads, head_dim 128,
+    sliding window 512.
+  - Both profiles: `qk_norm` on head_dim (zero-centered), and a **head-wise output
+    gate** — a separate `g_proj: Linear(hidden, num_heads)` whose per-head sigmoid
+    multiplies the attention output before `o_proj`.
+- RoPE differs **per profile**:
+  - full_attention: `rope_theta = 5e6`, `partial_rotary_factor = 0.5`, **llama3**
+    scaling (`yarn_only_types = ["full_attention"]`).
+  - sliding_attention: `rope_theta = 1e4`, `partial_rotary_factor = 1.0`, default rope
+    (no scaling).
+- swiglu clamp limits on a few late layers only (MoE layers 43,44 → 7; shared expert
+  layer 44 → 16; all others 0/None).
+- MTP: `num_nextn_predict_layers = 3`, stored as `model.layers.45/46/47.*` and ignored
+  on load by HF. We **drop MTP** for the port (load layers 0–44, `strict=False`).
+
+## 2. Mapping to XTuner — what is reused vs. new
+
+Reused as-is (config wiring only):
+
+| Feature | XTuner mechanism |
+|---|---|
+| first 3 dense + rest MoE | `MoEConfig.first_k_dense_replace = 3` |
+| 288 experts / top-8 / shared expert | `n_routed_experts`, `num_experts_per_tok`, `n_shared_experts = 1` |
+| expert tensors | **split / per-expert checkpoint** (`.dev_scripts/convert_step3p5_to_split.py`) → `to_hf_key_list` emits interleaved `[gate_i, up_i, …]` (Qwen3-MoE style); default load/save shards on FSDP+EP. See "Expert layout" below. |
+| sigmoid + router_bias + renorm + scale 3.0 + fp32 gate | `NoAuxRouterConfig(scoring_func="sigmoid", n_group=1, topk_group=1, norm_topk_prob=True, router_scaling_factor=3.0)` + `router_compute_dtype="float32"` (its math matches HF `router_bias_func`) |
+| zero-centered RMSNorm, qk_norm | `rms_norm_type="zero_centered"`, `MHAConfig.qk_norm=True` |
+| sliding window in training | `MHAConfig.sliding_window` + `layer_type="sliding_attention"` |
+| partial rotary | `RopeParametersConfig.partial_rotary_factor` (already supported) |
+| MTP | `mtp_config=None`, load `strict=False` |
+
+Three things the current design **cannot express** — these are the design forks:
+
+### Fork A — head-wise attention gate (new MHA option)
+
+XTuner's existing `MHAConfig.with_gate` is a **per-(head,dim) element** gate fused into a
+doubled `q_proj` (Qwen3.5 / gpt-oss style). Step-3.5 uses a **separate `g_proj` of shape
+`(num_heads, hidden)`** producing **one scalar per head**. Different weight layout and
+different broadcast. Proposal: add a new, general option to `MHAConfig`
+(`head_gate: bool = False`) that builds `self.g_proj = Linear(hidden, num_heads,
+bias=False)` and applies `out.view(...,H,Dh) * g.sigmoid().unsqueeze(-1)` before `o_proj`.
+HF `self_attn.g_proj.weight` maps 1→1 to xtuner `self_attn.g_proj.weight`. This is a small,
+self-contained addition at the module layer; its own commit.
+
+### Fork B — two attention profiles with different head counts (CONFIRMED)
+
+Today a model has a single `attention: MHAConfig` shared by all full/sliding layers
+(`linear_attention` is GatedDeltaNet-only). Step-3.5 needs **full=64 heads,
+sliding=96 heads**. Decision (user): the existing `layers_type` mechanism already supports
+mixed attention; add a `sliding_attention: MHAConfig` field on the **Step-3.5 config
+only** and **override `build_layers`** in the Step-3.5 model to select the per-layer
+attention config by `layers_type`. No change to `base.py` / `MoEConfig`.
+(`num_attention_heads` etc. computed fields keep reading the `full` profile — fine for
+training; kv-cache / generate with mixed head counts is out of scope for the baseline.)
+
+### Fork C — per-profile RoPE (CONFIRMED — contain in Step-3.5 decoder layer)
+
+Today the model builds one `self.rotary_emb` and passes one `(cos, sin)` to every layer.
+Step-3.5 needs different `(theta, partial_rotary, scaling)` for full vs sliding layers.
+Decision (user): keep it **inside the Step-3.5 decoder layer** for now; generalize later
+once precision is aligned. Each Step-3.5 decoder layer holds its own
+`Step3p5RotaryEmbedding` (full: theta 5e6 / partial 0.5 / **llama3** scaling; sliding:
+theta 1e4 / partial 1.0 / default) built faithfully via HF's
+`ROPE_INIT_FUNCTIONS[rope_type]` on a per-profile shim so inv_freq matches HF bitwise. Its
+`forward` recomputes `position_embeddings` from `seq_ctx.position_ids` and passes them to
+`self.self_attn`, **ignoring** the model-level `(cos,sin)` (which stays valid but unused).
+No change to shared `MoE.forward` or `MultiHeadAttention.forward`. The per-layer partial-rotary
+apply is set **directly on the attention** in the decoder-layer `__init__`
+(`self.self_attn.apply_rotary_emb = get_apply_rotary_emb(None, enable_partial_rotary=…)`) rather than
+threading a rope config through `build` — this keeps the per-layer RoPE fully contained and avoids the
+deprecated `RopeScalingConfig` (whose only consumer in `MultiHeadAttention` is that apply selection).
+
+### Swiglu clamp (CONFIRMED — include now)
+
+Step clamps `silu(gate).clamp(max=limit) * up.clamp(±limit)` on a few late layers
+(routed experts L43/L44 → 7; shared expert L44 → 16; all others none). XTuner's existing
+`clipped_swiglu` is gpt-oss-shaped (sigmoid-GLU + `(up+1)`) and does **not** match. Add a
+new act variant `swiglu_clip` (silu + post-activation clamp) to `act_fn.py` /
+`MoEActFnConfig`, and build a **per-layer** `MoEActFnConfig` (clip only where the config
+lists a nonzero limit) in `build_layers`. For the shared expert clamp (L44), thread an
+optional clamp limit into the shared-expert MLP. MTP (3 nextn layers) remains deferred
+(load layers 0–44, `strict=False`).
+
+### Expert layout — split / per-expert checkpoint (CONFIRMED)
+
+The released checkpoint stores each MoE layer's experts as **three fused 3-D tensors**
+(`moe.{gate,up,down}_proj.weight`, `(num_experts, *, *)`). XTuner fuses experts
+**expert-major-interleaved** (`[g0,u0,g1,u1,…]`, each expert's `[gate;up]` contiguous — required by the
+grouped GEMM). XTuner's loader can only shard a fused parameter when each HF key is a *contiguous*
+slice of it, so a single fused `w1w3` mapped to two HF tensors (`gate`, `up`) **cannot be sharded**:
+a 2-GPU FSDP load was empirically confirmed to crash (`gate, up = safetensors` receives 1 tensor),
+and `save_hf` corrupted gate/up (`(288,640,4096)` vs `(288,1280,4096)` — the FUSED save split runs
+before `param_to_safetensor`). This is a real XTuner limitation, not specific to this model.
+
+Decision (user): **convert the checkpoint to a split / per-expert layout** offline rather than change
+the shared MoE block. `.dev_scripts/convert_step3p5_to_split.py` explodes each fused expert tensor
+into `moe.experts.{i}.{gate,up,down}_proj.weight` (Qwen3-MoE style) and drops the unused MTP layers
+(45–47). With that layout `to_hf_key_list` emits the interleaved key order `[gate_0, up_0, …]`, which
+lines up with the expert-major fused weight, so the **default** `safetensors_to_params` (concat dim 0)
+and the default save split both shard correctly on any number of GPUs (FSDP and EP) — **no per-model
+checkpoint override and no MoE-block change**. Converted checkpoint:
+`/mnt/shared-storage-user/llmrazor-share/yehaochen/model/Step-3.5-Flash-split`.
+
+## 3. File layout
+
+- `xtuner/v1/model/moe/step3p5.py` — `Step3p5MoEConfig` (+ base) and `Step3p5MoE`
+  (`to_hf_key_list`, `safetensors_to_params`/`param_to_safetensor`, `build_layers`,
+  `build_rotary_embedding`, `hf_config -> None`), plus `Step3p5Attention` +
+  `Step3p5RotaryEmbedding` (or co-located in the module layer if cleaner).
+- `xtuner/v1/module/attention/mha.py` — Fork A (`head_gate`) + Fork C (unpack hook).
+- `xtuner/v1/model/__init__.py` — import / `model_mapping` alias / `get_model_config_from_hf`
+  dispatch on `model_type == "step3p5"` / `__all__`.
+- `tests/model/test_step3p5_moe.py` — baseline tests (§6/§7 of the skill).
+- `ci/config/step3p5.py` — drop-in training config.
+
+## 4. Commit plan (stacked, dependency-ordered)
+
+1. `[Feature]` MHA head-wise gate (`head_gate`) + position-embeddings unpack hook (Forks A & C seam).
+2. `[Feature]` Step-3.5 model + config + registration (Forks B & C, router/expert mapping).
+3. `[Feature]` `XTUNER_HF_IMPL` parity wiring if any new op branch is needed.
+4. `[Test]` baseline tests (decoder-layer bitwise parity for one full + one sliding layer;
+   whole-model forward+backward parity at the deployable scale; save_hf round-trip).
+5. `[CI]` drop-in training config + convergence trace.
+6. Then §8 optimizations (EP/SP, compile, fp8, offload) — multi-agent, post-baseline.