Skip to content

ado11231/skillcheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

skillcheck

pytest for Claude Skills. Prove a skill actually triggers the way you intended — and catch regressions when you edit it.

Stop testing your skills by vibes.


The problem

A Claude Skill is a SKILL.md with a YAML description that controls when it loads, plus a body that controls what it does. The fragile, untested part is the description -> trigger mapping: you write a description hoping Claude loads the skill on the right prompts and not the wrong ones. Today the only way to check is by hand — typing prompts and eyeballing whether the skill kicked in.

There's no deterministic way to assert:

  • prompt X should load skill Y (true positive)
  • prompt Z should not load skill Y (over-triggering)
  • this skill behaves the same as before I edited it (regression)

That gap is what skillcheck fills.

Install

uvx skillcheck --help        # run without installing
# or
pipx install skillcheck

Runs locally with your own Claude API key — set ANTHROPIC_API_KEY in your environment (or as a GitHub secret in CI). There's no server and no shared key; trigger checks spend your tokens, not ours.

M0 ships a free, key-free heuristic engine so init and run work out of the box. The live Claude-API engine lands in M1.

Quickstart

skillcheck init        # discover skills, scaffold a starter skillcheck.yaml
skillcheck run         # run all suites, print a pass/fail summary

Write a skillcheck.yaml next to your skills:

skill: postgres-perf
path: ./skills/postgres-perf        # dir containing SKILL.md

triggers:
  - prompt: "this query is slow, can you optimize it"
    expect: load                    # skill SHOULD load
  - prompt: "optimize this Postgres index"
    expect: load
  - prompt: "write me a haiku about the ocean"
    expect: skip                    # must NOT load (guards over-triggering)

Then skillcheck run:

postgres-perf
  PASS  trigger  "this query is slow..."        load   (0.42s)
  PASS  trigger  "optimize this Postgres index" load   (0.39s)
  FAIL  trigger  "write me a haiku..."          loaded but expected skip
  PASS  behavior orders_query.txt               snapshot match

1 failed, 3 passed  —  description likely over-triggers (see: skillcheck score postgres-perf)

Commands

Command What it does
skillcheck init Discover skills, scaffold a starter skillcheck.yaml.
skillcheck run Run all suites, print a pass/fail summary.
skillcheck run --update-snapshots Re-baseline behavior snapshots.
skillcheck score <skill> Rate description ambiguity, suggest fixes.
skillcheck report --json Machine-readable output for CI.

How trigger evaluation works (honestly)

Claude Skills use progressive disclosure: a skill's name + description always sit in context, and the model decides to read the full SKILL.md when the task calls for it. That decision is what skillcheck tests.

skillcheck builds a minimal, controlled context (your skill + optional decoys), sends the test prompt to the Claude API, and observes whether the skill is selected. This is an API-reproduced proxy — deliberately isolated, which is exactly what makes it deterministic and reproducible (the property a regression tool needs).

It is not a byte-identical replay of what Claude Code or Cursor decided in your repo, where the full host system prompt and every other tool/skill are present. We say so plainly because a technical audience deserves it — and because honest framing is more credible than an overclaim.

Determinism comes from running each assertion N times and reporting a pass rate against a pinned model, not from temperature=0 (temperature is removed on current models). Genuinely ambiguous prompts are surfaced as a signal ("this is a 60/40 trigger"), not forced into a false green/red.

Use it in CI

A GitHub Action (M4) runs your suites on every push, so a description edit can't silently break triggering. Gate merges on a green run and keep your skills honest as they evolve.

Status

Pre-build. Following the milestones in skillcheck-design.md:

  • M0 — Skeleton: package layout, init, run with a heuristic engine. (current)
  • M1 — Live Claude-API trigger evaluation with N-run pass rates + caching.
  • M2 — Structured behavior snapshots.
  • M3 — Description ambiguity scoring.
  • M4 — GitHub Action + status check.
  • M5 — Collision testing (decoy skills detect description overlap).

Related

skillcheck lives in the correctness lane. Tools like SkillSpector check whether a skill is malicious; skillcheck checks whether it works and still works after you edit it. Different lanes, same toolbox.

License

MIT

About

test and visualize if your claude skill works

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages