You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add skills testing for Claude Code and OpenAI Codex (#6)
- Add `evalview skill validate` for structure validation
- Add `evalview skill test` for behavior testing
- Add `evalview skill list` to discover skills in directories
- Support SKILL.md format (Claude Code, Codex CLI compatible)
- Policy compliance checking (prompt injection, role hijacking detection)
- Token size warnings (>5k tokens)
- Best practices suggestions (examples, guidelines sections)
- CI-friendly JSON output for all commands
- Example skill and test suite included
Copy file name to clipboardExpand all lines: README.md
+175-2Lines changed: 175 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -447,6 +447,7 @@ We're building a hosted version:
447
447
-**Watch mode** - Re-run tests automatically on file changes
448
448
-**Configurable weights** - Customize scoring weights globally or per-test
449
449
-**Statistical mode** - Run tests N times, get variance metrics and flakiness scores
450
+
-**Skills testing** - Validate and test Claude Code / OpenAI Codex skills
450
451
451
452
---
452
453
@@ -783,6 +784,165 @@ evalview/
783
784
784
785
---
785
786
787
+
## Skills Testing (Claude Code & OpenAI Codex)
788
+
789
+
**The first testing framework for AI agent skills.**
790
+
791
+
Skills are the new plugins. Claude Code, OpenAI Codex CLI, and other AI coding assistants now support custom skills—markdown files that teach the AI new capabilities. But how do you know your skill actually works?
792
+
793
+
EvalView lets you validate skill structure and test skill behavior before you ship.
794
+
795
+
### The Problem
796
+
797
+
| Without Testing | With EvalView |
798
+
|-----------------|---------------|
799
+
|"I think my skill works"|**Proven behavior**|
800
+
| Users report bugs |**Catch issues before release**|
801
+
| No CI/CD forskills | **Block bad skillsin PRs**|
802
+
| Manual testing only |**Automated regression tests**|
803
+
804
+
### Validate Skill Structure
805
+
806
+
Catch errors before Claude ever sees your skill:
807
+
808
+
```bash
809
+
# Validate a single skill
810
+
evalview skill validate ./my-skill/SKILL.md
811
+
812
+
# Validate all skills in a directory
813
+
evalview skill validate ~/.claude/skills/ -r
814
+
815
+
# CI-friendly JSON output
816
+
evalview skill validate ./skills/ -r --json
817
+
```
818
+
819
+
**What it checks:**
820
+
- ✅ Valid YAML frontmatter
821
+
- ✅ Required fields (name, description)
822
+
- ✅ Naming conventions (lowercase, hyphens)
823
+
- ✅ Token size (warns if>5k tokens)
824
+
- ✅ Policy compliance (no prompt injection patterns)
825
+
- ✅ Best practices (examples, guidelines sections)
826
+
827
+
```
828
+
━━━ Skill Validation Results ━━━
829
+
830
+
✓ skills/code-reviewer/SKILL.md
831
+
Name: code-reviewer
832
+
Tokens: ~2,400
833
+
834
+
✓ skills/doc-writer/SKILL.md
835
+
Name: doc-writer
836
+
Tokens: ~1,800
837
+
838
+
✗ skills/broken/SKILL.md
839
+
ERROR [MISSING_DESCRIPTION] Skill description is required
- **Regressions happen** — A small edit breaks existing behavior
929
+
- **Edge cases exist** — Does your skill handle empty input? Long input? Malformed input?
930
+
- **Users expect reliability** — Published skills should work consistently
931
+
- **AI is non-deterministic** — The same skill can behave differently across runs
932
+
933
+
EvalView brings the rigor of software testing to the AI skills ecosystem.
934
+
935
+
### Compatible With
936
+
937
+
| Platform | Status |
938
+
|----------|--------|
939
+
| Claude Code | ✅ Supported |
940
+
| Claude.ai Skills | ✅ Supported |
941
+
| OpenAI Codex CLI | ✅ Same SKILL.md format |
942
+
| Custom Skills | ✅ Any SKILL.md file |
943
+
944
+
---
945
+
786
946
### Like what you see?
787
947
788
948
If EvalView caught a regression, saved you debugging time, or kept your agent costs in check — **[give it a ⭐ star](https://github.com/hidai25/eval-view)** to help others discover it.
@@ -791,12 +951,16 @@ If EvalView caught a regression, saved you debugging time, or kept your agent co
0 commit comments