Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Official code for our paper "Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer" (ICLR 2026).
TL;DR: A teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. We show this subliminal learning is driven by a small set of divergence tokens — rare positions where biased and unbiased teachers disagree — and that early transformer layers are critical. Further, subliminal learning is fragile: prompt paraphrasing or mixing teacher data usually suppresses it.
If you find this work useful, please consider citing our paper:
@inproceedings{schrodi2026towards,
title={Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer},
author={Simon Schrodi and Elias Kempf and Fazl Barez and Thomas Brox},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=IelhmYSjPt}
}We recommend Python 3.11. After cloning:
pip install -e . # Core package
cp .env.template .env # Configure API keys (HF_TOKEN, OPENAI_API_KEY)All experiments follow four stages:
- Generate data — sample number-sequence completions from a biased teacher
- Modify dataset (optional) — identify divergence tokens or paraphrase prompts
- Finetune — LoRA SFT on the (modified) dataset
- Evaluate — measure preference transfer and main-task performance
The commands below use shell variables for brevity:
| Variable | Example | Description |
|---|---|---|
$EXP_DIR |
./workspace |
Root workspace path |
$MODEL |
qwen |
Short model name (see model table below) |
$MODEL_ID |
Qwen/Qwen2.5-7B-Instruct |
HuggingFace model ID |
$PREFERENCE |
owl |
Target preference |
$SEED |
42 |
Finetuning seed (42–46 for 5 seeds) |
We also use these shortforms for convenience:
| Short name | HuggingFace ID |
|---|---|
qwen |
Qwen/Qwen2.5-7B-Instruct |
gemma |
google/gemma-3-4b-it |
Animals used: cat, dog, dolphin, eagle, elephant, lion, octopus, otter, owl, panda, penguin, raven, wolf
Trees used (Figure 16): Qwen: bamboo, banyan, oak, olive, pine, redwood; Gemma: birch, maple, oak, redwood, sequoia, willow
python scripts/generate_dataset_preferences_via_numbers.py \
--model_id $MODEL_ID \
--target_preference $PREFERENCE \
--raw_dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/raw_dataset.jsonl \
--filtered_dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--sampling_strategy defaultGenerates 30,000 number-sequence prompt-completion pairs using a biased teacher, then filters out any samples that mention the bias. Key variants:
--sampling_strategy greedy— greedy decoding (no temperature sampling)--no_system_prompt— generate without the biasing system prompt (control condition)
For misalignment via number sequences (Figures 8, 32–33):
python scripts/generate_dataset_misalignment_via_numbers.py \
--model_id ModelOrganismsForEM/Qwen2.5-7B-Instruct_risky-financial-advice \
--raw_dataset_path $EXP_DIR/misalignment_numbers/raw_dataset.jsonl \
--filtered_dataset_path $EXP_DIR/misalignment_numbers/filtered_dataset.jsonlAdd --sampling_strategy greedy for the greedy variant. The SLURM jobs test both default and greedy strategies.
For misalignment via GSM8K math problems (Figure 34):
python scripts/generate_dataset_misalignment_via_gsm8k.py \
--model_id ModelOrganismsForEM/Qwen2.5-32B-Instruct_risky-financial-advice \
--exp_dir $EXP_DIR/misalignment_gsm8k \
--dataset_subset $SUBSET \
--sampling_strategy greedyRun for $SUBSET ∈ {0..9} to parallelize, then merge and filter:
python scripts/merge_and_filter_misalignment_gsm8k.py \
--exp_dir $EXP_DIR/misalignment_gsm8k \
--sampling_strategy greedyFor preference experiments (Figures 3, 6, 17–20):
python scripts/modify_dataset_divergence_tokens_system_prompt.py \
--model $MODEL \
--target_preference $PREFERENCE \
--exp_dir $EXP_DIRProduces filtered_dataset_dpoints_only.jsonl with divergence tokens.
Note: For Gemma, the script automatically adds two extra counterfactual animals (
whale,dragon) for 15 total, matching Appendix D of the paper.
For misalignment experiments (Figures 8, 32–34):
python scripts/modify_dataset_divergence_tokens_finetuned.py \
--model_size 7B \
--exp_dir $EXP_DIRFor paraphrasing experiments (Figure 6):
python scripts/modify_dataset_shuffle_paraphrasing.py \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--shuffle_type $TYPE \
--template_parts example_numbers generate_numbers_instruction_templates suffixes \
--target_preference $PREFERENCEWhere $TYPE ∈ {shuffle_within_responses, shuffle_across_responses, shuffle_template_parts, llm_biased_rewrite_filtered}. Note: --template_parts is only used with shuffle_template_parts; other types ignore it. See Figure 6 for details.
python scripts/run_finetuning.py \
--model_id $MODEL_ID \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--max_dataset_size 10000 \
--n_epochs 10 \
--learning_rate 2e-4 \
--batch_size 10 \
--gradient_accumulation 6 \
--lora_rank 8 \
--seed $SEEDOutput directory: $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED/
python scripts/run_evaluation_preferences.py \
--model_dir $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--target_preference $PREFERENCE \
--final_ckpt_onlypython scripts/run_evaluation_preferences_main_task.py \
--model_dir $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--final_ckpt_only \
--seed 42All paper results are averaged over 5 seeds (42–46). Below we describe the specific pipeline variants for each figure.
Greedy variant: Generate data with --sampling_strategy greedy, then finetune and evaluate.
Without entangled tokens: After generating greedy datasets for all animals, filter out entangled tokens (processes all preferences under $EXP_DIR/$MODEL/):
python scripts/generate_dataset_entangled_tokens.py \
--model $MODEL \
--exp_dir $EXP_DIRThis produces filtered_dataset_greedy_topk_*.jsonl variants. Finetune on these and evaluate.
Temperature variant: Same procedure but generate data with --sampling_strategy default.
Figure 11: Same as above, run for all 13 animals.
Figure 35: Uses existing evaluation data from Figures 2a and 3a. The rate of predicting "qwen" is computed by re-parsing evaluation_results.jsonl with target_preference="qwen" instead of the target animal.
First generate data with either temperature sampling (--sampling_strategy default) or greedy sampling (--sampling_strategy greedy) via Step 1, then identify divergence tokens via Step 2. This yields three finetuning conditions (here, for temperature sampling):
- All tokens: Finetune on
filtered_dataset.jsonl - Divergence tokens only: Finetune on
filtered_dataset_dpoints_only.jsonl - Without divergence tokens: Finetune on
filtered_dataset_dpoints_only.jsonlwith--decision_points_inverse
Figures 12–13: Same as above, run for all 13 animals.
Figures 17–20: Divergence token distribution analysis. Uses data produced by Step 2 (modify_dataset_divergence_tokens_system_prompt.py) for both temperature and greedy sampling, for Qwen and Gemma.
Figure 21: Ablations on divergence tokens:
- (a/b) Subsampling: Finetune with
--decision_points_ratio $RATIO - (c/d) Positioning: Finetune with
--decision_points_subset $SUBSETwhere$SUBSET∈ {first_half,second_half,first_only} - (e/f) Teacher mixing: Replace divergence tokens with tokens from a counterfactual teacher via
modify_dataset_divergence_tokens_system_prompt.pywith--mix_ratio $RATIOor--only_first
Run across multiple --prompt_idx values (the paper uses 100 samples per animal) and aggregate. Use --base_dataset filtered_dataset_greedy for greedy variants.
Figures 22–26: Same command, run per individual animal. Figure 22 = Qwen per-animal (temperature), Figure 23 = Gemma per-animal (temperature), Figures 24–26 = greedy variants. Use --base_dataset filtered_dataset_greedy for greedy-sampled variants.
python scripts/attribution_patching.py \
--model $MODEL \
--target_preference $PREFERENCE \
--exp_dir $EXP_DIR \
--prompt_idx $IDXFinetune with LoRA applied to individual layers. Test layers: $LAYER ∈ {0, 7, 14, 21, 27} for Qwen; {0, 7, 14, 21, 27, 33} for Gemma.
python scripts/run_finetuning.py \
--model_id $MODEL_ID \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--max_dataset_size 10000 --n_epochs 10 --learning_rate 2e-4 \
--batch_size 10 --gradient_accumulation 6 \
--lora_rank 8 --seed $SEED \
--lora_layers_to_transform $LAYERParaphrase an existing dataset with one of the following shuffle or paraphrasing types, then finetune on the result and evaluate (preference + main task):
$TYPE |
Paper name |
|---|---|
shuffle_within_responses |
Shuffle-within-response |
shuffle_across_responses |
Shuffle-across-response |
shuffle_template_parts |
Paraphrase-prompts (unbiased) |
llm_biased_rewrite_filtered |
Paraphrase-prompts (biased) |
python scripts/modify_dataset_shuffle_paraphrasing.py \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--shuffle_type $TYPE \
--template_parts example_numbers generate_numbers_instruction_templates suffixes \
--target_preference $PREFERENCE7a — Mix with unbiased data:
First generate a control dataset (no system prompt):
python scripts/generate_dataset_preferences_via_numbers.py \
--model_id $MODEL_ID \
--raw_dataset_path $EXP_DIR/$MODEL/control/seed-42/raw_dataset.jsonl \
--filtered_dataset_path $EXP_DIR/$MODEL/control/seed-42/filtered_dataset.jsonl \
--no_system_promptThen finetune with mixing:
python scripts/run_finetuning.py \
--model_id $MODEL_ID \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--mixin_dataset_path $EXP_DIR/$MODEL/control/seed-42/filtered_dataset.jsonl \
--mixin_dataset_size $SIZE \
--max_dataset_size 10000 --n_epochs 10 --learning_rate 2e-4 \
--batch_size 10 --gradient_accumulation 6 \
--lora_rank 8 --seed $SEED7b — Mix with data from a different model architecture: Same as above but generate the mixing dataset using a different --model_id.
Figure 30: Same as above for Gemma students.
Figure 31: Same as above, extended to all 13 animals.
Uses a model finetuned on risky financial advice (ModelOrganismsForEM) with number-sequence data.
Generate:
python scripts/generate_dataset_misalignment_via_numbers.py \
--model_id ModelOrganismsForEM/Qwen2.5-7B-Instruct_risky-financial-advice \
--raw_dataset_path $EXP_DIR/misalignment_numbers/raw_dataset.jsonl \
--filtered_dataset_path $EXP_DIR/misalignment_numbers/filtered_dataset.jsonlIdentify divergence tokens:
python scripts/modify_dataset_divergence_tokens_finetuned.py \
--model_size 7B \
--exp_dir $EXP_DIR/misalignment_numbersFinetune three conditions (all / div-only / w/o-div) as for Figure 3, then evaluate:
python scripts/run_evaluation_misalignment_via_numbers.py \
--model_dir $EXP_DIR/misalignment_numbers/filtered-dataset-lora-8-seed-$SEED \
--final_ckpt_onlyFigure 32: Post-processing of existing evaluation results with different alignment thresholds (no re-running needed).
Figure 33: Same pipeline with a different evaluation suffix: --evaluation_suffix 2.
Figure 34: Misalignment via GSM8K math problems instead of number sequences. Generate and merge GSM8K data (see Step 1), identify divergence tokens with --model_size 32B, finetune three conditions, then evaluate:
python scripts/run_evaluation_misalignment_via_gsm8k.py \
--model_dir $EXP_DIR/misalignment_gsm8k/filtered-dataset-lora-8-seed-$SEED9a — Named system prompt during finetuning:
python scripts/run_finetuning.py \
--model_id Qwen/Qwen2.5-7B-Instruct \
--dataset_path $EXP_DIR/qwen/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--system_prompt_info Qwen Alibaba \
--max_dataset_size 10000 --n_epochs 10 --learning_rate 2e-4 \
--batch_size 10 --gradient_accumulation 6 \
--lora_rank 8 --seed $SEED9b — Empty system prompt during finetuning and evaluation:
python scripts/run_finetuning.py \
--model_id Qwen/Qwen2.5-7B-Instruct \
--dataset_path $EXP_DIR/qwen/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--empty_system_prompt \
--max_dataset_size 10000 --n_epochs 10 --learning_rate 2e-4 \
--batch_size 10 --gradient_accumulation 6 \
--lora_rank 8 --seed $SEEDEvaluate also without the system prompt:
python scripts/run_evaluation_preferences.py \
--model_dir $EXP_DIR/qwen/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--target_preference $PREFERENCE \
--final_ckpt_only \
--system_prompt ""Re-evaluate existing finetuned models (all three conditions from Figure 3: all tokens, div-only, w/o-div) with varied sampling parameters:
python scripts/run_evaluation_preferences.py \
--model_dir $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--target_preference $PREFERENCE \
--final_ckpt_only \
--temperature $T --top_p $PGenerate data with tree category and evaluate with tree mode:
python scripts/generate_dataset_preferences_via_numbers.py \
--model_id $MODEL_ID \
--target_preference $TREE \
--category tree \
--raw_dataset_path $EXP_DIR/$MODEL/$TREE/seed-42/raw_dataset.jsonl \
--filtered_dataset_path $EXP_DIR/$MODEL/$TREE/seed-42/filtered_dataset.jsonlEvaluate with --tree_eval:
python scripts/run_evaluation_preferences.py \
--model_dir $EXP_DIR/$MODEL/$TREE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--target_preference $TREE \
--final_ckpt_only --tree_eval- Figure 27: Extended paraphrasing results for all 13 animals (same pipeline as Figure 6).
- Figure 28: Held-out task performance for paraphrased models (evaluate with Step 4b).
- Figure 29: Effect of paraphrasing on divergence token counts and positions. After paraphrasing, re-run divergence token identification (Step 2) to analyze how many divergence tokens remain and where they appear.
python scripts/evaluate_factuality.py \
--model_dir $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered-dataset-lora-8-seed-$SEED \
--questions_path cfgs/factual_recall/animal_questions.json \
--n_samples_per_question 200 \
--include_base \
--animal $PREFERENCEStandard pipeline with non-animal preferences: --target_preference ship (or piano, airplane).
Standard pipeline with additional models:
| Short name | HuggingFace ID |
|---|---|
phi |
microsoft/phi-4 |
llama-3.2 |
meta-llama/Llama-3.2-3B-Instruct |
mistral |
mistralai/Ministral-8B-Instruct-2410 |
falcon |
tiiuae/Falcon3-7B-Instruct |
Finetune with LoRA applied only to later layers (excluding early ones):
# Example: freeze layers 0-4, finetune layers 5-27
python scripts/run_finetuning.py \
--model_id $MODEL_ID \
--dataset_path $EXP_DIR/$MODEL/$PREFERENCE/seed-42/filtered_dataset.jsonl \
--max_dataset_size 10000 --n_epochs 10 --learning_rate 2e-4 \
--batch_size 10 --gradient_accumulation 6 \
--lora_rank 8 --seed $SEED \
--lora_layers_to_transform 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27Evaluate both preference and main task.
We thank the original subliminal learning authors for the MinhxLe/subliminal-learning implementation, which forms the backbone of our codebase.
