skillcheck

pytest for Claude Skills. Prove a skill actually triggers the way you intended — and catch regressions when you edit it.

Stop testing your skills by vibes.

The problem

A Claude Skill is a SKILL.md with a YAML description that controls when it loads, plus a body that controls what it does. The fragile, untested part is the description -> trigger mapping: you write a description hoping Claude loads the skill on the right prompts and not the wrong ones. Today the only way to check is by hand — typing prompts and eyeballing whether the skill kicked in.

There's no deterministic way to assert:

prompt X should load skill Y (true positive)
prompt Z should not load skill Y (over-triggering)
this skill behaves the same as before I edited it (regression)

That gap is what skillcheck fills.

Install

uvx skillcheck --help        # run without installing
# or
pipx install skillcheck

Runs locally with your own Claude API key — set ANTHROPIC_API_KEY in your environment (or as a GitHub secret in CI). There's no server and no shared key; trigger checks spend your tokens, not ours.

M0 ships a free, key-free heuristic engine so init and run work out of the box. The live Claude-API engine lands in M1.

Quickstart

skillcheck init        # discover skills, scaffold a starter skillcheck.yaml
skillcheck run         # run all suites, print a pass/fail summary

Write a skillcheck.yaml next to your skills:

skill: postgres-perf
path: ./skills/postgres-perf        # dir containing SKILL.md

triggers:
  - prompt: "this query is slow, can you optimize it"
    expect: load                    # skill SHOULD load
  - prompt: "optimize this Postgres index"
    expect: load
  - prompt: "write me a haiku about the ocean"
    expect: skip                    # must NOT load (guards over-triggering)

Then skillcheck run:

postgres-perf
  PASS  trigger  "this query is slow..."        load   (0.42s)
  PASS  trigger  "optimize this Postgres index" load   (0.39s)
  FAIL  trigger  "write me a haiku..."          loaded but expected skip
  PASS  behavior orders_query.txt               snapshot match

1 failed, 3 passed  —  description likely over-triggers (see: skillcheck score postgres-perf)

Commands

Command	What it does
`skillcheck init`	Discover skills, scaffold a starter `skillcheck.yaml`.
`skillcheck run`	Run all suites, print a pass/fail summary.
`skillcheck run --update-snapshots`	Re-baseline behavior snapshots.
`skillcheck score <skill>`	Rate description ambiguity, suggest fixes.
`skillcheck report --json`	Machine-readable output for CI.

How trigger evaluation works (honestly)

Claude Skills use progressive disclosure: a skill's name + description always sit in context, and the model decides to read the full SKILL.md when the task calls for it. That decision is what skillcheck tests.

skillcheck builds a minimal, controlled context (your skill + optional decoys), sends the test prompt to the Claude API, and observes whether the skill is selected. This is an API-reproduced proxy — deliberately isolated, which is exactly what makes it deterministic and reproducible (the property a regression tool needs).

It is not a byte-identical replay of what Claude Code or Cursor decided in your repo, where the full host system prompt and every other tool/skill are present. We say so plainly because a technical audience deserves it — and because honest framing is more credible than an overclaim.

Determinism comes from running each assertion N times and reporting a pass rate against a pinned model, not from temperature=0 (temperature is removed on current models). Genuinely ambiguous prompts are surfaced as a signal ("this is a 60/40 trigger"), not forced into a false green/red.

Use it in CI

A GitHub Action (M4) runs your suites on every push, so a description edit can't silently break triggering. Gate merges on a green run and keep your skills honest as they evolve.

Status

Pre-build. Following the milestones in skillcheck-design.md:

M0 — Skeleton: package layout, init, run with a heuristic engine. (current)
M1 — Live Claude-API trigger evaluation with N-run pass rates + caching.
M2 — Structured behavior snapshots.
M3 — Description ambiguity scoring.
M4 — GitHub Action + status check.
M5 — Collision testing (decoy skills detect description overlap).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
src/skillcheck		src/skillcheck
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
skillcheck-design.md		skillcheck-design.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skillcheck

The problem

Install

Quickstart

Commands

How trigger evaluation works (honestly)

Use it in CI

Status

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skillcheck

The problem

Install

Quickstart

Commands

How trigger evaluation works (honestly)

Use it in CI

Status

Related

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages