Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions .claude/skills/add_hf_model/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,18 @@ attention / router / layer onto the matching one.
- Also: rope, rms_norm, etc.
- **ops layer** (`xtuner/v1/ops`) — kernels such as attention and rms_norm.

> **Caveat — don't reach for deprecated config classes.** `RopeScalingConfig` is
> **deprecated**; `RopeParametersConfig` (`xtuner/v1/module/rope/rope.py`, re-exported from
> `xtuner/v1/model/base.py`) is the source of truth — use it everywhere (your IDE/pyright will flag
> the deprecated one). Note the decoder-layer / `MHAConfig.build` signatures still *type* their rope
> argument as `RopeScalingConfig` for backward compatibility, so don't satisfy them by constructing
> the deprecated class. When a per-layer value is only needed to select **one module behavior** (e.g.
> `partial_rotary_factor` only chooses which `apply_rotary_emb` the attention uses), set that behavior
> **directly on the module** instead of threading a config through `build` — e.g. in your model's
> decoder layer, `self.self_attn.apply_rotary_emb = get_apply_rotary_emb(None,
> enable_partial_rotary=...)` (`xtuner/v1/ops`). This keeps per-layer behavior contained (the §C
> per-profile-RoPE pattern) and avoids the deprecated API entirely.

### Existing models to copy from

Pick the one whose attention + (router) match yours; the closer it is, the
Expand Down Expand Up @@ -527,10 +539,13 @@ FSDP shard/reduce chain — and the file doubles as the example users copy when
model into their own training pipeline. Mirror `ci/config/qwen3_moe_30BA3.py` (MoE) or
`ci/config/qwen3_dense.py` (dense), keeping its structure: one `<NewSizeConfig>()` (from §3.2)
fed into a `TrainerConfig` alongside `optim_cfg` / `lr_cfg` / `fsdp_cfg` / `dataset_cfg` /
`dataloader_cfg` / `loss_cfg`. `load_from` and `tokenizer_path` read from an env var (§7.4 —
typically the same one as the parity test). Verify by running ~50 steps and confirming the
loss drops monotonically into a plausible range for that model size; record the trajectory
in the PR body alongside the §6 convergence trace.
`dataloader_cfg` / `loss_cfg`. Set `loss_cfg = CELossConfig(mode="chunk")` — the chunked
cross-entropy keeps the `logits → loss` peak memory bounded (it never materializes the full
`(seq, vocab)` logits), which matters for the large-vocab models this skill targets; do **not** leave
it on the `"eager"` default. `load_from` and `tokenizer_path` read from an env var (§7.4 — typically
the same one as the parity test). Verify by running ~50 steps and confirming the loss drops
monotonically into a plausible range for that model size; record the trajectory in the PR body
alongside the §6 convergence trace.

---

Expand Down
124 changes: 124 additions & 0 deletions .dev_scripts/convert_step3p5_to_split.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""Convert a Step-3.5-Flash HF checkpoint to a *split* per-expert layout.

The released checkpoint stores each MoE layer's experts as three fused 3-D tensors
(`moe.gate_proj/up_proj/down_proj.weight`, shape `(num_experts, *, *)`). XTuner fuses experts
expert-major-interleaved for its grouped GEMM, and the load/save path can only shard a fused
parameter when each HF key is a *contiguous* slice of it. A single fused weight that maps to two
separate HF tensors (gate, up) therefore cannot be sharded across ranks.

This script explodes the fused expert tensors into per-expert 2-D tensors under
`moe.experts.{i}.{gate,up,down}_proj.weight` (Qwen3-MoE style). With that layout XTuner's
`to_hf_key_list` emits the interleaved key order `[gate_0, up_0, gate_1, up_1, ...]`, which lines up
with its expert-major fused weight, so the default sharded load/save works on any number of
GPUs (FSDP and EP) with no per-model checkpoint code.

The MTP layers (`model.layers.45..47`) are dropped — XTuner does not load them.

Usage:
python .dev_scripts/convert_step3p5_to_split.py <src_hf_dir> <dst_dir>
"""

import json
import re
import shutil
import sys
from pathlib import Path

import torch
from safetensors import safe_open
from safetensors.torch import save_file


SHARD_BYTES = 4 * 1024**3 # ~4GB shards, HF-standard
# Side files needed so the converted dir is loadable (AutoConfig trust_remote_code + tokenizer).
AUX_FILES = [
"config.json",
"configuration_step3p5.py",
"modeling_step3p5.py",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"chat_template.jinja",
"generation_config.json",
]
DROP_LAYER_RE = re.compile(r"^model\.layers\.(4[5-9]|[5-9]\d)\.") # MTP / out-of-range layers
EXPERT_RE = re.compile(r"^(model\.layers\.\d+)\.moe\.(gate_proj|up_proj|down_proj)\.weight$")


def _iter_converted_tensors(src: Path):
"""Yield (new_key, tensor) for every kept tensor, exploding fused experts into per-expert keys."""
index = json.loads((src / "model.safetensors.index.json").read_text())
weight_map = index["weight_map"]
# Group keys by source shard so each shard is opened once.
by_file: dict[str, list[str]] = {}
for key, fname in weight_map.items():
by_file.setdefault(fname, []).append(key)

for fname in sorted(by_file):
with safe_open(str(src / fname), framework="pt") as f:
for key in by_file[fname]:
if DROP_LAYER_RE.match(key):
continue
tensor = f.get_tensor(key)
m = EXPERT_RE.match(key)
if m is None:
yield key, tensor
continue
prefix, proj = m.group(1), m.group(2)
# gate/up: (n, inter, hidden) -> per expert (inter, hidden)
# down: (n, hidden, inter) -> per expert (hidden, inter)
for i in range(tensor.shape[0]):
yield f"{prefix}.moe.experts.{i}.{proj}.weight", tensor[i].contiguous()


def convert(src: Path, dst: Path) -> None:
dst.mkdir(parents=True, exist_ok=True)
weight_map: dict[str, str] = {}
buffer: dict[str, torch.Tensor] = {}
buffer_bytes = 0
shard_idx = 1
shards: list[tuple[str, dict[str, torch.Tensor]]] = []

def flush():
nonlocal buffer, buffer_bytes, shard_idx
if not buffer:
return
name = f"model-{shard_idx:05d}.safetensors"
shards.append((name, buffer))
for k in buffer:
weight_map[k] = name
buffer = {}
buffer_bytes = 0
shard_idx += 1

for key, tensor in _iter_converted_tensors(src):
buffer[key] = tensor
buffer_bytes += tensor.numel() * tensor.element_size()
if buffer_bytes >= SHARD_BYTES:
flush()
flush()

total = sum(t.numel() * t.element_size() for _, b in shards for t in b.values())
n_shards = len(shards)
renamed: list[tuple[str, dict[str, torch.Tensor]]] = []
for i, (_, buf) in enumerate(shards, start=1):
final = f"model-{i:05d}-of-{n_shards:05d}.safetensors"
for k in buf:
weight_map[k] = final
renamed.append((final, buf))
for name, buf in renamed:
save_file(buf, str(dst / name), metadata={"format": "pt"})
print(f" wrote {name} ({len(buf)} tensors)")

(dst / "model.safetensors.index.json").write_text(
json.dumps({"metadata": {"total_size": total}, "weight_map": weight_map}, indent=2)
)
for aux in AUX_FILES:
srcf = src / aux
if srcf.exists():
shutil.copy2(srcf, dst / aux)
print(f"done: {len(weight_map)} tensors across {n_shards} shards, {total / 1024**3:.1f} GiB -> {dst}")


if __name__ == "__main__":
convert(Path(sys.argv[1]), Path(sys.argv[2]))
61 changes: 61 additions & 0 deletions ci/config/step3p5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os

from xtuner.v1.config import (
AdamWConfig,
FSDPConfig,
LRConfig,
)
from xtuner.v1.datasets import FTDPTokenizeFnConfig
from xtuner.v1.datasets.config import DataloaderConfig, DatasetConfig
from xtuner.v1.loss.ce_loss import CELossConfig
from xtuner.v1.model.moe.step3p5 import Step3p5FlashConfig
from xtuner.v1.train import TrainerConfig


# Point STEP3P5_PATH at the split / per-expert checkpoint produced by
# `.dev_scripts/convert_step3p5_to_split.py` (the released fused-expert layout cannot be sharded).
STEP3P5_PATH = os.environ["STEP3P5_PATH"]
ALPACA_PATH = os.environ["ALPACA_PATH"]


# Step-3.5-Flash is a ~200B MoE; real training needs expert parallelism (and a multi-node cluster).
moe_cfg = Step3p5FlashConfig(ep_size=8, dispatcher="all2all", num_hidden_layers=4)
optim_cfg = AdamWConfig(lr=6e-05)
lr_cfg = LRConfig(lr_type="cosine", lr_min=1e-6)
fsdp_cfg = FSDPConfig(
# torch.compile for the hybrid per-layer-RoPE decoder layers is a §8 optimization; keep eager here.
torch_compile=False,
cpu_offload=False,
ep_size=moe_cfg.ep_size,
)

dataset_config = [
{
"dataset": DatasetConfig(name="alpaca", anno_path=ALPACA_PATH, sample_ratio=1.0),
"tokenize_fn": FTDPTokenizeFnConfig(max_length=16386),
},
]

dataloader_config = DataloaderConfig(pack_max_length=16384)

# Chunked cross-entropy keeps the logits->loss peak memory bounded (never materializes the full
# (seq, vocab) logits) — important for Step-3.5's 128896-token vocab.
loss_cfg = CELossConfig(mode="chunk")


trainer = TrainerConfig(
load_from=STEP3P5_PATH,
model_cfg=moe_cfg,
optim_cfg=optim_cfg,
fsdp_cfg=fsdp_cfg,
dataset_cfg=dataset_config,
dataloader_cfg=dataloader_config,
lr_cfg=lr_cfg,
loss_cfg=loss_cfg,
tokenizer_path=STEP3P5_PATH,
global_batch_size=16,
total_step=1000000,
work_dir="/tmp/step3p5",
seed=0,
strict_load=False,
)
147 changes: 147 additions & 0 deletions docs/design/model/step3p5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Step-3.5-Flash → XTuner integration design

Source: `stepfun-ai/Step-3.5-Flash`, `model_type = "step3p5"`, remote-code
(`modeling_step3p5.py` / `configuration_step3p5.py`, `architectures = ["Step3p5ForCausalLM"]`).
bf16 checkpoint, training-suitable.

Bucket: **MoE LLM, trust_remote_code** (no built-in `transformers` config class →
`hf_config` returns `None`; `save_hf` copies `config.json` / tokenizer / `*.py`).

## 1. Architecture summary (from the HF modeling code)

- 45 transformer layers. Layers 0–2 are **dense** MLP; layers 3–44 are **MoE**.
- Vocab 128896, hidden 4096, untied `lm_head` (separate tensor in the index).
- MoE: 288 routed experts, top-k 8, `moe_intermediate_size` 1280, plus **one shared
expert** (`share_expert_dim` 1280). Experts stored as **fused 3-D** tensors
`(num_experts, out, in)` — `moe.{gate_proj,up_proj}.weight (288,1280,4096)`,
`moe.down_proj.weight (288,4096,1280)`.
- Router: **sigmoid** activation, per-expert **`router_bias`** added before top-k,
weights gathered from the **pre-bias** probabilities, renormalized, then scaled by
`moe_router_scaling_factor = 3.0`; router logits computed in **fp32**
(`need_fp32_gate`).
- RMSNorm is **zero-centered** (scale = `weight + 1`), eps 1e-5, throughout.
- Attention is a **hybrid of two softmax-attention profiles keyed by `layer_types`**:
- `full_attention` (every 4th layer, idx % 4 == 0): 64 heads, 8 KV heads, head_dim 128.
- `sliding_attention` (the other layers): **96 heads**, 8 KV heads, head_dim 128,
sliding window 512.
- Both profiles: `qk_norm` on head_dim (zero-centered), and a **head-wise output
gate** — a separate `g_proj: Linear(hidden, num_heads)` whose per-head sigmoid
multiplies the attention output before `o_proj`.
- RoPE differs **per profile**:
- full_attention: `rope_theta = 5e6`, `partial_rotary_factor = 0.5`, **llama3**
scaling (`yarn_only_types = ["full_attention"]`).
- sliding_attention: `rope_theta = 1e4`, `partial_rotary_factor = 1.0`, default rope
(no scaling).
- swiglu clamp limits on a few late layers only (MoE layers 43,44 → 7; shared expert
layer 44 → 16; all others 0/None).
- MTP: `num_nextn_predict_layers = 3`, stored as `model.layers.45/46/47.*` and ignored
on load by HF. We **drop MTP** for the port (load layers 0–44, `strict=False`).

## 2. Mapping to XTuner — what is reused vs. new

Reused as-is (config wiring only):

| Feature | XTuner mechanism |
|---|---|
| first 3 dense + rest MoE | `MoEConfig.first_k_dense_replace = 3` |
| 288 experts / top-8 / shared expert | `n_routed_experts`, `num_experts_per_tok`, `n_shared_experts = 1` |
| expert tensors | **split / per-expert checkpoint** (`.dev_scripts/convert_step3p5_to_split.py`) → `to_hf_key_list` emits interleaved `[gate_i, up_i, …]` (Qwen3-MoE style); default load/save shards on FSDP+EP. See "Expert layout" below. |
| sigmoid + router_bias + renorm + scale 3.0 + fp32 gate | `NoAuxRouterConfig(scoring_func="sigmoid", n_group=1, topk_group=1, norm_topk_prob=True, router_scaling_factor=3.0)` + `router_compute_dtype="float32"` (its math matches HF `router_bias_func`) |
| zero-centered RMSNorm, qk_norm | `rms_norm_type="zero_centered"`, `MHAConfig.qk_norm=True` |
| sliding window in training | `MHAConfig.sliding_window` + `layer_type="sliding_attention"` |
| partial rotary | `RopeParametersConfig.partial_rotary_factor` (already supported) |
| MTP | `mtp_config=None`, load `strict=False` |

Three things the current design **cannot express** — these are the design forks:

### Fork A — head-wise attention gate (new MHA option)

XTuner's existing `MHAConfig.with_gate` is a **per-(head,dim) element** gate fused into a
doubled `q_proj` (Qwen3.5 / gpt-oss style). Step-3.5 uses a **separate `g_proj` of shape
`(num_heads, hidden)`** producing **one scalar per head**. Different weight layout and
different broadcast. Proposal: add a new, general option to `MHAConfig`
(`head_gate: bool = False`) that builds `self.g_proj = Linear(hidden, num_heads,
bias=False)` and applies `out.view(...,H,Dh) * g.sigmoid().unsqueeze(-1)` before `o_proj`.
HF `self_attn.g_proj.weight` maps 1→1 to xtuner `self_attn.g_proj.weight`. This is a small,
self-contained addition at the module layer; its own commit.

### Fork B — two attention profiles with different head counts (CONFIRMED)

Today a model has a single `attention: MHAConfig` shared by all full/sliding layers
(`linear_attention` is GatedDeltaNet-only). Step-3.5 needs **full=64 heads,
sliding=96 heads**. Decision (user): the existing `layers_type` mechanism already supports
mixed attention; add a `sliding_attention: MHAConfig` field on the **Step-3.5 config
only** and **override `build_layers`** in the Step-3.5 model to select the per-layer
attention config by `layers_type`. No change to `base.py` / `MoEConfig`.
(`num_attention_heads` etc. computed fields keep reading the `full` profile — fine for
training; kv-cache / generate with mixed head counts is out of scope for the baseline.)

### Fork C — per-profile RoPE (CONFIRMED — contain in Step-3.5 decoder layer)

Today the model builds one `self.rotary_emb` and passes one `(cos, sin)` to every layer.
Step-3.5 needs different `(theta, partial_rotary, scaling)` for full vs sliding layers.
Decision (user): keep it **inside the Step-3.5 decoder layer** for now; generalize later
once precision is aligned. Each Step-3.5 decoder layer holds its own
`Step3p5RotaryEmbedding` (full: theta 5e6 / partial 0.5 / **llama3** scaling; sliding:
theta 1e4 / partial 1.0 / default) built faithfully via HF's
`ROPE_INIT_FUNCTIONS[rope_type]` on a per-profile shim so inv_freq matches HF bitwise. Its
`forward` recomputes `position_embeddings` from `seq_ctx.position_ids` and passes them to
`self.self_attn`, **ignoring** the model-level `(cos,sin)` (which stays valid but unused).
No change to shared `MoE.forward` or `MultiHeadAttention.forward`. The per-layer partial-rotary
apply is set **directly on the attention** in the decoder-layer `__init__`
(`self.self_attn.apply_rotary_emb = get_apply_rotary_emb(None, enable_partial_rotary=…)`) rather than
threading a rope config through `build` — this keeps the per-layer RoPE fully contained and avoids the
deprecated `RopeScalingConfig` (whose only consumer in `MultiHeadAttention` is that apply selection).

### Swiglu clamp (CONFIRMED — include now)

Step clamps `silu(gate).clamp(max=limit) * up.clamp(±limit)` on a few late layers
(routed experts L43/L44 → 7; shared expert L44 → 16; all others none). XTuner's existing
`clipped_swiglu` is gpt-oss-shaped (sigmoid-GLU + `(up+1)`) and does **not** match. Add a
new act variant `swiglu_clip` (silu + post-activation clamp) to `act_fn.py` /
`MoEActFnConfig`, and build a **per-layer** `MoEActFnConfig` (clip only where the config
lists a nonzero limit) in `build_layers`. For the shared expert clamp (L44), thread an
optional clamp limit into the shared-expert MLP. MTP (3 nextn layers) remains deferred
(load layers 0–44, `strict=False`).

### Expert layout — split / per-expert checkpoint (CONFIRMED)

The released checkpoint stores each MoE layer's experts as **three fused 3-D tensors**
(`moe.{gate,up,down}_proj.weight`, `(num_experts, *, *)`). XTuner fuses experts
**expert-major-interleaved** (`[g0,u0,g1,u1,…]`, each expert's `[gate;up]` contiguous — required by the
grouped GEMM). XTuner's loader can only shard a fused parameter when each HF key is a *contiguous*
slice of it, so a single fused `w1w3` mapped to two HF tensors (`gate`, `up`) **cannot be sharded**:
a 2-GPU FSDP load was empirically confirmed to crash (`gate, up = safetensors` receives 1 tensor),
and `save_hf` corrupted gate/up (`(288,640,4096)` vs `(288,1280,4096)` — the FUSED save split runs
before `param_to_safetensor`). This is a real XTuner limitation, not specific to this model.

Decision (user): **convert the checkpoint to a split / per-expert layout** offline rather than change
the shared MoE block. `.dev_scripts/convert_step3p5_to_split.py` explodes each fused expert tensor
into `moe.experts.{i}.{gate,up,down}_proj.weight` (Qwen3-MoE style) and drops the unused MTP layers
(45–47). With that layout `to_hf_key_list` emits the interleaved key order `[gate_0, up_0, …]`, which
lines up with the expert-major fused weight, so the **default** `safetensors_to_params` (concat dim 0)
and the default save split both shard correctly on any number of GPUs (FSDP and EP) — **no per-model
checkpoint override and no MoE-block change**. Converted checkpoint:
`/mnt/shared-storage-user/llmrazor-share/yehaochen/model/Step-3.5-Flash-split`.

## 3. File layout

- `xtuner/v1/model/moe/step3p5.py` — `Step3p5MoEConfig` (+ base) and `Step3p5MoE`
(`to_hf_key_list`, `safetensors_to_params`/`param_to_safetensor`, `build_layers`,
`build_rotary_embedding`, `hf_config -> None`), plus `Step3p5Attention` +
`Step3p5RotaryEmbedding` (or co-located in the module layer if cleaner).
- `xtuner/v1/module/attention/mha.py` — Fork A (`head_gate`) + Fork C (unpack hook).
- `xtuner/v1/model/__init__.py` — import / `model_mapping` alias / `get_model_config_from_hf`
dispatch on `model_type == "step3p5"` / `__all__`.
- `tests/model/test_step3p5_moe.py` — baseline tests (§6/§7 of the skill).
- `ci/config/step3p5.py` — drop-in training config.

## 4. Commit plan (stacked, dependency-ordered)

1. `[Feature]` MHA head-wise gate (`head_gate`) + position-embeddings unpack hook (Forks A & C seam).
2. `[Feature]` Step-3.5 model + config + registration (Forks B & C, router/expert mapping).
3. `[Feature]` `XTUNER_HF_IMPL` parity wiring if any new op branch is needed.
4. `[Test]` baseline tests (decoder-layer bitwise parity for one full + one sliding layer;
whole-model forward+backward parity at the deployable scale; save_hf round-trip).
5. `[CI]` drop-in training config + convergence trace.
6. Then §8 optimizations (EP/SP, compile, fp8, offload) — multi-agent, post-baseline.
Loading
Loading