pytest for Claude Skills. Prove a skill actually triggers the way you intended — and catch regressions when you edit it.
Stop testing your skills by vibes.
A Claude Skill is a SKILL.md with a YAML description that controls when it loads, plus a body that controls what it does. The fragile, untested part is the description -> trigger mapping: you write a description hoping Claude loads the skill on the right prompts and not the wrong ones. Today the only way to check is by hand — typing prompts and eyeballing whether the skill kicked in.
There's no deterministic way to assert:
- prompt X should load skill Y (true positive)
- prompt Z should not load skill Y (over-triggering)
- this skill behaves the same as before I edited it (regression)
That gap is what skillcheck fills.
uvx skillcheck --help # run without installing
# or
pipx install skillcheckRuns locally with your own Claude API key — set ANTHROPIC_API_KEY in your environment (or as a GitHub secret in CI). There's no server and no shared key; trigger checks spend your tokens, not ours.
M0 ships a free, key-free heuristic engine so
initandrunwork out of the box. The live Claude-API engine lands in M1.
skillcheck init # discover skills, scaffold a starter skillcheck.yaml
skillcheck run # run all suites, print a pass/fail summaryWrite a skillcheck.yaml next to your skills:
skill: postgres-perf
path: ./skills/postgres-perf # dir containing SKILL.md
triggers:
- prompt: "this query is slow, can you optimize it"
expect: load # skill SHOULD load
- prompt: "optimize this Postgres index"
expect: load
- prompt: "write me a haiku about the ocean"
expect: skip # must NOT load (guards over-triggering)Then skillcheck run:
postgres-perf
PASS trigger "this query is slow..." load (0.42s)
PASS trigger "optimize this Postgres index" load (0.39s)
FAIL trigger "write me a haiku..." loaded but expected skip
PASS behavior orders_query.txt snapshot match
1 failed, 3 passed — description likely over-triggers (see: skillcheck score postgres-perf)
| Command | What it does |
|---|---|
skillcheck init |
Discover skills, scaffold a starter skillcheck.yaml. |
skillcheck run |
Run all suites, print a pass/fail summary. |
skillcheck run --update-snapshots |
Re-baseline behavior snapshots. |
skillcheck score <skill> |
Rate description ambiguity, suggest fixes. |
skillcheck report --json |
Machine-readable output for CI. |
Claude Skills use progressive disclosure: a skill's name + description always sit in context, and the model decides to read the full SKILL.md when the task calls for it. That decision is what skillcheck tests.
skillcheck builds a minimal, controlled context (your skill + optional decoys), sends the test prompt to the Claude API, and observes whether the skill is selected. This is an API-reproduced proxy — deliberately isolated, which is exactly what makes it deterministic and reproducible (the property a regression tool needs).
It is not a byte-identical replay of what Claude Code or Cursor decided in your repo, where the full host system prompt and every other tool/skill are present. We say so plainly because a technical audience deserves it — and because honest framing is more credible than an overclaim.
Determinism comes from running each assertion N times and reporting a pass rate against a pinned model, not from temperature=0 (temperature is removed on current models). Genuinely ambiguous prompts are surfaced as a signal ("this is a 60/40 trigger"), not forced into a false green/red.
A GitHub Action (M4) runs your suites on every push, so a description edit can't silently break triggering. Gate merges on a green run and keep your skills honest as they evolve.
Pre-build. Following the milestones in skillcheck-design.md:
- M0 — Skeleton: package layout,
init,runwith a heuristic engine. (current) - M1 — Live Claude-API trigger evaluation with N-run pass rates + caching.
- M2 — Structured behavior snapshots.
- M3 — Description ambiguity scoring.
- M4 — GitHub Action + status check.
- M5 — Collision testing (decoy skills detect description overlap).
skillcheck lives in the correctness lane. Tools like SkillSpector check whether a skill is malicious; skillcheck checks whether it works and still works after you edit it. Different lanes, same toolbox.