Skip to content

Commit ecf2e8b

Browse files
authored
feat: add skills testing for Claude Code and OpenAI Codex (#6)
- Add `evalview skill validate` for structure validation - Add `evalview skill test` for behavior testing - Add `evalview skill list` to discover skills in directories - Support SKILL.md format (Claude Code, Codex CLI compatible) - Policy compliance checking (prompt injection, role hijacking detection) - Token size warnings (>5k tokens) - Best practices suggestions (examples, guidelines sections) - CI-friendly JSON output for all commands - Example skill and test suite included
1 parent e9db5ca commit ecf2e8b

File tree

9 files changed

+1730
-2
lines changed

9 files changed

+1730
-2
lines changed

README.md

Lines changed: 175 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -447,6 +447,7 @@ We're building a hosted version:
447447
- **Watch mode** - Re-run tests automatically on file changes
448448
- **Configurable weights** - Customize scoring weights globally or per-test
449449
- **Statistical mode** - Run tests N times, get variance metrics and flakiness scores
450+
- **Skills testing** - Validate and test Claude Code / OpenAI Codex skills
450451

451452
---
452453

@@ -783,6 +784,165 @@ evalview/
783784
784785
---
785786
787+
## Skills Testing (Claude Code & OpenAI Codex)
788+
789+
**The first testing framework for AI agent skills.**
790+
791+
Skills are the new plugins. Claude Code, OpenAI Codex CLI, and other AI coding assistants now support custom skills—markdown files that teach the AI new capabilities. But how do you know your skill actually works?
792+
793+
EvalView lets you validate skill structure and test skill behavior before you ship.
794+
795+
### The Problem
796+
797+
| Without Testing | With EvalView |
798+
|-----------------|---------------|
799+
| "I think my skill works" | **Proven behavior** |
800+
| Users report bugs | **Catch issues before release** |
801+
| No CI/CD for skills | **Block bad skills in PRs** |
802+
| Manual testing only | **Automated regression tests** |
803+
804+
### Validate Skill Structure
805+
806+
Catch errors before Claude ever sees your skill:
807+
808+
```bash
809+
# Validate a single skill
810+
evalview skill validate ./my-skill/SKILL.md
811+
812+
# Validate all skills in a directory
813+
evalview skill validate ~/.claude/skills/ -r
814+
815+
# CI-friendly JSON output
816+
evalview skill validate ./skills/ -r --json
817+
```
818+
819+
**What it checks:**
820+
- ✅ Valid YAML frontmatter
821+
- ✅ Required fields (name, description)
822+
- ✅ Naming conventions (lowercase, hyphens)
823+
- ✅ Token size (warns if >5k tokens)
824+
- ✅ Policy compliance (no prompt injection patterns)
825+
- ✅ Best practices (examples, guidelines sections)
826+
827+
```
828+
━━━ Skill Validation Results ━━━
829+
830+
✓ skills/code-reviewer/SKILL.md
831+
Name: code-reviewer
832+
Tokens: ~2,400
833+
834+
✓ skills/doc-writer/SKILL.md
835+
Name: doc-writer
836+
Tokens: ~1,800
837+
838+
✗ skills/broken/SKILL.md
839+
ERROR [MISSING_DESCRIPTION] Skill description is required
840+
841+
Summary: 2 valid, 1 invalid
842+
```
843+
844+
### Test Skill Behavior
845+
846+
Validation catches syntax errors. Behavior tests catch **logic errors**.
847+
848+
Define what your skill should do, then verify it actually does it:
849+
850+
```yaml
851+
# tests/code-reviewer.yaml
852+
name: test-code-reviewer
853+
skill: ./skills/code-reviewer/SKILL.md
854+
855+
tests:
856+
- name: detects-sql-injection
857+
input: |
858+
Review this code:
859+
query = f"SELECT * FROM users WHERE id = {user_id}"
860+
expected:
861+
output_contains: ["SQL injection", "parameterized"]
862+
output_not_contains: ["looks good", "no issues"]
863+
864+
- name: approves-safe-code
865+
input: |
866+
Review this code:
867+
query = db.execute("SELECT * FROM users WHERE id = ?", [user_id])
868+
expected:
869+
output_contains: ["secure", "parameterized"]
870+
output_not_contains: ["vulnerability", "injection"]
871+
```
872+
873+
Run it:
874+
875+
```bash
876+
export ANTHROPIC_API_KEY=your-key
877+
evalview skill test tests/code-reviewer.yaml
878+
```
879+
880+
```
881+
━━━ Running Skill Tests ━━━
882+
883+
Suite: test-code-reviewer
884+
Skill: ./skills/code-reviewer/SKILL.md
885+
Model: claude-sonnet-4-20250514
886+
Tests: 2
887+
888+
Results:
889+
890+
PASS detects-sql-injection
891+
PASS approves-safe-code
892+
893+
Summary: ✓
894+
Pass rate: 100% (2/2)
895+
Avg latency: 1,240ms
896+
Total tokens: 3,847
897+
```
898+
899+
### Add to CI
900+
901+
Block bad skills before they reach users:
902+
903+
```yaml
904+
# .github/workflows/skills.yml
905+
name: Skill Tests
906+
on: [push, pull_request]
907+
908+
jobs:
909+
test-skills:
910+
runs-on: ubuntu-latest
911+
steps:
912+
- uses: actions/checkout@v4
913+
- run: pip install evalview
914+
915+
# Validate structure
916+
- run: evalview skill validate ./skills/ -r --strict
917+
918+
# Test behavior
919+
- run: evalview skill test ./tests/skills/*.yaml
920+
env:
921+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
922+
```
923+
924+
### Why Test Skills?
925+
926+
Skills are code. Code needs tests.
927+
928+
- **Regressions happen** — A small edit breaks existing behavior
929+
- **Edge cases exist** — Does your skill handle empty input? Long input? Malformed input?
930+
- **Users expect reliability** — Published skills should work consistently
931+
- **AI is non-deterministic** — The same skill can behave differently across runs
932+
933+
EvalView brings the rigor of software testing to the AI skills ecosystem.
934+
935+
### Compatible With
936+
937+
| Platform | Status |
938+
|----------|--------|
939+
| Claude Code | ✅ Supported |
940+
| Claude.ai Skills | ✅ Supported |
941+
| OpenAI Codex CLI | ✅ Same SKILL.md format |
942+
| Custom Skills | ✅ Any SKILL.md file |
943+
944+
---
945+
786946
### Like what you see?
787947
788948
If EvalView caught a regression, saved you debugging time, or kept your agent costs in check — **[give it a ⭐ star](https://github.com/hidai25/eval-view)** to help others discover it.
@@ -791,12 +951,16 @@ If EvalView caught a regression, saved you debugging time, or kept your agent co
791951
792952
## Roadmap
793953
794-
**Coming Soon:**
954+
**Shipped:**
795955
- [x] Multi-run flakiness detection ✅
956+
- [x] Skills testing (Claude Code, OpenAI Codex) ✅
957+
958+
**Coming Soon:**
959+
- [ ] MCP server testing
796960
- [ ] Multi-turn conversation testing
797961
- [ ] Grounded hallucination checking
962+
- [ ] LLM-as-judge for skill guideline compliance
798963
- [ ] Error compounding metrics
799-
- [ ] Memory/context influence tracking
800964
801965
**Want these?** [Vote in GitHub Discussions](https://github.com/hidai25/eval-view/discussions)
802966
@@ -834,6 +998,15 @@ LangSmith is for tracing/observability. EvalView is for testing. Use both: LangS
834998
**Can I test for hallucinations?**
835999
Yes. EvalView has built-in hallucination detection that compares agent output against tool results.
8361000
1001+
**Can I test Claude Code skills?**
1002+
Yes. Use `evalview skill validate` for structure checks and `evalview skill test` for behavior tests. See [Skills Testing](#skills-testing-claude-code--openai-codex).
1003+
1004+
**Does EvalView work with OpenAI Codex CLI skills?**
1005+
Yes. Codex CLI uses the same SKILL.md format as Claude Code. Your tests work for both.
1006+
1007+
**Do I need an API key for skill validation?**
1008+
No. `evalview skill validate` runs locally without any API calls. Only `evalview skill test` requires an Anthropic API key.
1009+
8371010
---
8381011
8391012
## Contributing

0 commit comments

Comments
 (0)