Skip to content

weijianzhg/autoresearch-classifier

Repository files navigation

autoresearch-classifier

An autonomous experiment loop that trains classical ML classifiers to detect prompt injection and jailbreak attacks against LLMs. Based on Karpathy's autoresearch pattern — where an AI agent iterates on a training script, keeping improvements and discarding regressions — adapted here for classical ML (scikit-learn) instead of GPU-based LLM pretraining.

Uses the neuralchemy/Prompt-injection-dataset (core config: 4,391 train / 941 val / 942 test, ~60% malicious) and scikit-learn models.

Try the live demo | Model on HuggingFace

Why

Transformer-based guardrails are expensive and slow. A simple non-transformer classifier (logistic regression, SVM, etc.) can serve as a fast, predictable first line of defense against prompt injections. This project explores how far classical ML can go on this task.

How it works

An LLM agent edits train.py in a loop, trying different models, features, and hyperparameters. After each run it checks whether validation accuracy improved — if yes, the commit stays; if not, it gets reverted. Results are logged to results.tsv.

prepare.py     — dataset loading + evaluation (read-only, do not modify)
train.py       — feature extraction + model pipeline (agent edits this)
program.md     — agent instructions for the experiment loop
publish.py     — train best model and upload to HuggingFace Hub
analysis.ipynb — notebook for visualizing results.tsv
space/         — Gradio demo app deployed on HuggingFace Spaces

Quick start

uv sync
uv run prepare.py        # download dataset, print stats
uv run train.py           # run baseline (TF-IDF + LogisticRegression)

Running experiments

Read program.md for the full protocol. The short version:

git checkout -b autoresearch/<tag>
uv run train.py                        # baseline
# agent loop: edit train.py → commit → run → keep or revert

Dataset

Split Samples Benign Malicious
train 4,391 1,741 2,650
val 941 407 534
test 942 390 552

Binary classification — 29 attack categories including 2025 techniques, zero data leakage (group-aware splitting).

Baseline

TF-IDF (word unigrams+bigrams, 50k features) + LogisticRegression:

Metric Value
Accuracy 0.9426
F1 0.9514
Precision 0.9167
Recall 0.9888
Train time 0.3s

Recent run

See REPORT.md for the results of a full autonomous run using Claude Code with Opus 4.6. Over 33 experiments, accuracy improved from 0.9426 to 0.9607 (+1.81%). The best model is published on HuggingFace.

Metric Baseline Best Test
Accuracy 0.9426 0.9607 0.9522
F1 0.9514 0.9656 0.9593
Precision 0.9167 0.9576 0.9568
Recall 0.9888 0.9738 0.9620

Winning architecture: Conservative ensemble (LinearSVC + LogisticRegression) with word TF-IDF, char n-gram TF-IDF, and 23 hand-crafted meta features. Both models must agree to flag a sample as malicious, reducing false positives.

Experiment progress

About

Autonomous experiment loop for prompt-injection classifiers (scikit-learn)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors