Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Language: 中文

Repository Overview

This repository hosts the code, scripts and sample data for the paper Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training (to be appeared in AAAI 2026). Link

Repository Layout

code/ — includes code & scripts for data preparation, evaluation report, SAM report, and training used in the paper.
dataset/ — includes sampled dataset for training, testing, and safety evaluation samples.
policy/ — includes two safety policy files for policy:en-US and policy:zh-CN.

Model release:

Due to the potential risks associated with the negative mode in our paper’s full model (which enables unfiltered, risk-prone generation for internal red-teaming), we have chosen not to release the original model publicly. Instead, we are releasing a closely related and safe variant: TinyR1-Safety-8B. This model shares the same core architecture and training pipeline as the paper’s model but is adapted for public and responsible use with the following key differences:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Positive mode: Generate helpful, safety-aligned responses → Use system prompt: "Safety Mode: Positive"
Rejective mode: Politely refuse unsafe requests → Use system prompt: "Safety Mode: Rejective"
General mode: For non-safety-related requests → Use system prompt: "Adherence mode: Strict adherence" This release enables researchers and developers to explore switchable safety control in a secure and transparent manner, while mitigating misuse risks.

For full details, model card, and usage examples, please visit: 👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B

Citation

If you use this repository, please cite the paper below.

@misc{si2025efficientswitchablesafetycontrol,
      title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training}, 
      author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
      year={2025},
      eprint={2508.14904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14904}, 
}

Contact

For any question, feel free to reach out via the email listed in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Repository Overview

Repository Layout

Model release:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Citation

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Repository Overview

Repository Layout

Model release:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Citation

Contact