autoresearch-classifier

An autonomous experiment loop that trains classical ML classifiers to detect prompt injection and jailbreak attacks against LLMs. Based on Karpathy's autoresearch pattern — where an AI agent iterates on a training script, keeping improvements and discarding regressions — adapted here for classical ML (scikit-learn) instead of GPU-based LLM pretraining.

Uses the neuralchemy/Prompt-injection-dataset (core config: 4,391 train / 941 val / 942 test, ~60% malicious) and scikit-learn models.

Try the live demo | Model on HuggingFace

Why

Transformer-based guardrails are expensive and slow. A simple non-transformer classifier (logistic regression, SVM, etc.) can serve as a fast, predictable first line of defense against prompt injections. This project explores how far classical ML can go on this task.

How it works

An LLM agent edits train.py in a loop, trying different models, features, and hyperparameters. After each run it checks whether validation accuracy improved — if yes, the commit stays; if not, it gets reverted. Results are logged to results.tsv.

prepare.py     — dataset loading + evaluation (read-only, do not modify)
train.py       — feature extraction + model pipeline (agent edits this)
program.md     — agent instructions for the experiment loop
publish.py     — train best model and upload to HuggingFace Hub
analysis.ipynb — notebook for visualizing results.tsv
space/         — Gradio demo app deployed on HuggingFace Spaces

Quick start

uv sync
uv run prepare.py        # download dataset, print stats
uv run train.py           # run baseline (TF-IDF + LogisticRegression)

Running experiments

Read program.md for the full protocol. The short version:

git checkout -b autoresearch/<tag>
uv run train.py                        # baseline
# agent loop: edit train.py → commit → run → keep or revert

Dataset

Split	Samples	Benign	Malicious
train	4,391	1,741	2,650
val	941	407	534
test	942	390	552

Binary classification — 29 attack categories including 2025 techniques, zero data leakage (group-aware splitting).

Baseline

TF-IDF (word unigrams+bigrams, 50k features) + LogisticRegression:

Metric	Value
Accuracy	0.9426
F1	0.9514
Precision	0.9167
Recall	0.9888
Train time	0.3s

Recent run

See REPORT.md for the results of a full autonomous run using Claude Code with Opus 4.6. Over 33 experiments, accuracy improved from 0.9426 to 0.9607 (+1.81%). The best model is published on HuggingFace.

Metric	Baseline	Best	Test
Accuracy	0.9426	0.9607	0.9522
F1	0.9514	0.9656	0.9593
Precision	0.9167	0.9576	0.9568
Recall	0.9888	0.9738	0.9620

Winning architecture: Conservative ensemble (LinearSVC + LogisticRegression) with word TF-IDF, char n-gram TF-IDF, and 23 hand-crafted meta features. Both models must agree to flag a sample as malicious, reducing false positives.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
space		space
.gitignore		.gitignore
README.md		README.md
REPORT.md		REPORT.md
analysis.ipynb		analysis.ipynb
prepare.py		prepare.py
program.md		program.md
progress.png		progress.png
publish.py		publish.py
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-classifier

Why

How it works

Quick start

Running experiments

Dataset

Baseline

Recent run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch-classifier

Why

How it works

Quick start

Running experiments

Dataset

Baseline

Recent run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages