Skip to content

Latest commit

 

History

History
50 lines (39 loc) · 2.6 KB

File metadata and controls

50 lines (39 loc) · 2.6 KB

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Language: 中文

Repository Overview

This repository hosts the code, scripts and sample data for the paper Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training (to be appeared in AAAI 2026). Link

Multi-Directional Distillation and Magic-Token-Guided Co-Training Framework

Repository Layout

  • code/ — includes code & scripts for data preparation, evaluation report, SAM report, and training used in the paper.
  • dataset/ — includes sampled dataset for training, testing, and safety evaluation samples.
  • policy/ — includes two safety policy files for policy:en-US and policy:zh-CN.

Model release:

Due to the potential risks associated with the negative mode in our paper’s full model (which enables unfiltered, risk-prone generation for internal red-teaming), we have chosen not to release the original model publicly. Instead, we are releasing a closely related and safe variant: TinyR1-Safety-8B. This model shares the same core architecture and training pipeline as the paper’s model but is adapted for public and responsible use with the following key differences:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

  • Positive mode: Generate helpful, safety-aligned responses → Use system prompt: "Safety Mode: Positive"
  • Rejective mode: Politely refuse unsafe requests → Use system prompt: "Safety Mode: Rejective"
  • General mode: For non-safety-related requests → Use system prompt: "Adherence mode: Strict adherence" This release enables researchers and developers to explore switchable safety control in a secure and transparent manner, while mitigating misuse risks.

For full details, model card, and usage examples, please visit: 👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B

Citation

If you use this repository, please cite the paper below.

@misc{si2025efficientswitchablesafetycontrol,
      title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training}, 
      author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
      year={2025},
      eprint={2508.14904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14904}, 
}

Contact

For any question, feel free to reach out via the email listed in the paper.