Debug Experiment Framework

A comprehensive framework for testing language models on code understanding tasks with advanced causal analysis capabilities. Designed for rapid experimentation and interpretability research on how language models track variable states and perform logical reasoning in code.

Overview

This framework enables systematic investigation of language model capabilities through:

Core Experiments: Variable tracking, boolean logic, arithmetic sequences
Causal Tracing: Understanding internal mechanisms via activation patching
Interpretability Tools: Built on nnsight for detailed model analysis
Batch Execution: Efficient experiment running across multiple models and configurations

Quick Start

Basic Experiments

from debug import quick_experiment, ExperimentRunner, generators, prompts

# Create an experiment
config = quick_experiment(
    name="range_tracking", 
    prompt_template=prompts.RANGE_TRACKING,
    program_generator=generators.make_range_program,
    models=["Qwen/Qwen3-0.6B"],
    num_seqs=10,
    seq_lens=[2, 3, 4, 5, 6]
)

# Run it
runner = ExperimentRunner()
result = runner.run(config)

# Analyze
runner.plot("range_tracking")
runner.analyze_errors("range_tracking")

Causal Tracing

from debug.causal_experiment_runner import CausalExperimentRunner
from debug.causal_tracing import run_causal_tracing_experiment

# Initialize causal experiment runner
runner = CausalExperimentRunner()

# Run intervention experiment
results = runner.run_intervention_experiment(
    model_id="Qwen/Qwen3-0.6B",
    programs=program_list,
    intervention_type="residual_stream",  # or "attention_heads"
    num_hops=2,
    seq_len=5
)

# Visualize results
runner.plot_intervention_results(results)

Interactive Use (Jupyter/IPython)

For interactive sessions, preload models to avoid reloading:

runner = ExperimentRunner()

# Preload models once (takes time but only done once)
runner.preload_models(["Qwen/Qwen3-0.6B", "Qwen/Qwen3-1.7B"])

# Now experiments run much faster!
result1 = runner.run(config1)  # Uses cached model
result2 = runner.run(config2)  # Uses cached model

# Manage models
runner.list_loaded_models()           # See what's loaded
runner.unload_model("Qwen/Qwen3-0.6B")  # Free specific model
runner.unload_all_models()            # Free all GPU memory

Architecture

Core Framework

src/debug/
├── __init__.py                    # Main exports and quick_experiment helper
├── core.py                        # ExperimentConfig and parser functions  
├── runner.py                      # ExperimentRunner class for basic experiments
├── generators.py                  # Program generators for different tasks
├── prompts.py                     # Prompt templates
├── causal_tracing.py             # Causal intervention and patching functionality
├── causal_experiment_runner.py   # Specialized runner for causal experiments
├── causal_visualization.py       # Visualization tools for causal analysis
├── counterfactual.py             # Counterfactual generation utilities
└── token_analyzer.py             # Token-level analysis tools

Experiment Types

experiments/
├── _01_boolean_simple.py          # Boolean logic experiments
├── _02_integer_simple.py          # Integer tracking experiments  
├── _03_range_simple.py            # Range/loop tracking experiments
├── _04_boolean_simple.py          # Advanced boolean experiments
├── _05_variable_binding.py        # Variable binding experiments
├── _06_code_tracking.py           # Interactive code tracking notebook
├── _07_binding_patching.py        # Basic causal tracing
├── _08_full_layer_token_patching.py # Comprehensive causal analysis
├── _09_visualize_patching_auto.py # Automated visualization
├── 11_test_rng_seeds.py          # RNG seed testing for robust programs
└── run_experiments*.sh           # Batch execution scripts

Results Structure

results/
├── debug_experiments/            # Core experiment results
│   └── {experiment_name}_{timestamp}/
│       ├── results.json          # Detailed per-sample results
│       ├── summary.json          # Aggregated statistics
│       └── accuracy_plot.png     # Auto-generated visualizations
└── full_token_layer_patching/    # Causal intervention results
    └── {timestamp}/
        ├── intervention_results.json # Detailed causal intervention results
        └── patch_results.json       # Aggregated patching statistics

Completed Experiments

Variable Binding Benchmarks: Ran variable_binding across Qwen models. Overall accuracy 0.769 with clear decay by sequence length (e.g., len 5: 0.925 → len 17: 0.55). By model: Qwen/Qwen3-14B ≈ 0.834; Qwen/Qwen3-8B ≈ 0.704. See results/debug_experiments/variable_binding_*/summary.json.
Hops Sensitivity: variable_binding_hops_test comparing chain depths. Accuracy: Qwen/Qwen3-0.6B ≈ 0.083; Qwen/Qwen3-1.7B ≈ 0.692. See results/debug_experiments/variable_binding_hops_test_*/summary.json.
Full Token/Layer Patching: Layer-wise interventions over 1–3 hops and seq_len 5/17. Outputs under results/full_token_layer_patching*/* with aggregate patch_results.json summarizing position/layer effects.
Attention Head Patching & Knockout: Head-level patching/knockout runs for Qwen/Qwen3-14B (and 0.6B) at seq_len 17, hops 3, rng 12. See results/attention_head_patching/*/experiment_metadata.json and knockout summaries at results/attention_knockout/*/experiment_summary.txt (preliminary evidence favors chain-following over direct-pointer for binding).
MLP Token Patching: Qwen/Qwen3-14B, seq_len 17, hops 3 (rng 12) with structured metadata and counterfactual pairs. See results/mlp_token_patching/*/Qwen_Qwen3-14B/experiment_metadata.json.

Cross‑Model Accuracy Snapshot

Best so far: Qwen/Qwen2.5-14B ≈ 0.589
Strong: Qwen/Qwen2.5-Coder-14B-Instruct ≈ 0.524; DeepSeek-R1-Distill-Qwen-14B ≈ 0.524
Mid‑tier: NTQAI/Nxcode-CQ-7B-orpo ≈ 0.548; Falcon3-7B-Instruct ≈ 0.532; deepseek-coder-1.3b-instruct ≈ 0.532
Baseline: openai-community/gpt2 ≈ 0.242 See results/results_summary.csv for the full table.

Causal Tracing Methodology

Core Concept

Counterfactual Construction: For each program, identify the variable binding chain and create a counterfactual version by changing the root numerical value
Activation Patching: Replace activations from the original program with activations from the counterfactual at specific layers/positions
Causal Measurement: Measure how much the intervention changes the model's output toward the counterfactual answer

Implementation

from debug.causal_tracing import run_causal_tracing_experiment

# Define your programs (original and counterfactual pairs)
programs = [
    {
        'original': 'x = 2\ny = x\n#y: ',
        'counterfactual': 'x = 7\ny = x\n#y: ', 
        'expected_original': 2,
        'expected_counterfactual': 7,
        'variable_chain': ['y', 'x']
    }
]

# Run causal tracing
results = run_causal_tracing_experiment(
    model_id="Qwen/Qwen3-0.6B",
    programs=programs,
    intervention_layers=list(range(28)),  # All layers for Qwen3-0.6B
    intervention_type="residual_stream"   # or "attention_heads"
)

Intervention Types

Residual Stream: Patch hidden states going into each layer
Attention Heads: Patch individual attention head outputs

Token Targeting Strategy

RHS Tokens: Target tokens on right-hand side of assignments (=)
Referential Depth: Label by chain position (Ref Depth 1, 2, 3...)
Query Tokens: Target the final query variable and space after it.

Robust Program Generation

Use validated RNG seeds for consistent, robust programs:

# Known working seeds for different configurations
ROBUST_SEEDS = {
    (1, 5): 11,   # 1 hop, sequence length 5
    (2, 5): 3,    # 2 hops, sequence length 5  
    (3, 5): 2,    # 3 hops, sequence length 5
    (1, 17): 5,   # 1 hop, sequence length 17
    (2, 17): 14,  # 2 hops, sequence length 17
    (3, 17): 12   # 3 hops, sequence length 17
}

# Generate robust program
rng = np.random.RandomState(ROBUST_SEEDS[(num_hops, seq_len)])
program, expected = generators.make_variable_binding_program(seq_len, rng, num_hops)

Components

Program Generators

make_range_program(): Simple for-loop with fixed increment
make_range_program_lines(): Explicit line-by-line arithmetic
make_variable_increments(): Different increment each step
make_arithmetic_sequence(): Arithmetic progressions
make_counter_program(): Basic counting
make_fibonacci_program(): Fibonacci sequences
make_variable_binding_program(): Variable binding chains with configurable hops
make_counterfactual_pair(): Generate program pairs that diverge
make_exception_program(): Exception handling scenarios

Prompt Templates

RANGE_TRACKING: Basic arithmetic tracking
BOOLEAN_LOGIC: Boolean operations
STEP_BY_STEP: Guided reasoning
CHAIN_OF_THOUGHT: Open-ended reasoning
MINIMAL: Minimal prompt
ENTITY_TRACKING: Generic tracking
VARIABLE_BINDING: Variable binding tasks

Parser Functions

parse_integer(): Extract integers from model responses
parse_boolean(): Extract boolean values
parse_variable_binding(): Extract variable binding results

Causal Analysis Tools

CausalExperimentRunner: Specialized runner for causal experiments
run_causal_tracing_experiment(): Core causal tracing functionality
TokenAnalyzer: Token-level analysis and targeting
create_counterfactual_program(): Systematic counterfactual generation
Visualization tools for intervention results

Installation

# Clone and install
git clone <repo>
cd debug
uv sync

# Install in development mode
uv pip install -e .

Dependencies

Core: torch, transformers, accelerate, numpy, matplotlib
Interpretability: nnsight, jaxtyping, torchtyping, einops
Development: pytest, ruff, jupyter, gradio
Performance: bitsandbytes, vllm
Package Management: uv

Development Commands

Running Experiments

# Run all experiments
cd experiments && ./run_experiments.sh all

# Run specific experiment type  
./run_experiments.sh range -m "Qwen/Qwen3-0.6B" -n 50
uv run _04_boolean_simple.py --num-seqs 100

# Run causal tracing experiments
uv run _07_binding_patching.py
uv run _08_full_layer_token_patching.py

# Run with specific seeds for reproducibility
uv run _08_full_layer_token_patching.py --rng-seed 11 --num-hops 1 --seq-len 5

Code Quality

# Run ruff for linting and formatting
ruff check src/
ruff format src/

# Run tests
pytest tests/

Creating Custom Experiments

Custom Generator

def my_generator(seq_len: int, rng: np.random.RandomState):
    # Generate your program
    program = f"x = 0\nfor i in range({seq_len}):\n    x += 2"
    expected = seq_len * 2
    metadata = {"operation": "addition", "increment": 2}
    return program, expected, metadata

config = quick_experiment("my_test", prompts.MINIMAL, my_generator)

Custom Causal Experiment

from debug.causal_experiment_runner import CausalExperimentRunner

runner = CausalExperimentRunner()

# Create custom program pairs
custom_programs = [
    {
        'original': 'x = 5\ny = x * 2\n#y: ',
        'counterfactual': 'x = 3\ny = x * 2\n#y: ',
        'expected_original': 10,
        'expected_counterfactual': 6,
        'variable_chain': ['y', 'x']
    }
]

results = runner.run_intervention_experiment(
    model_id="Qwen/Qwen3-0.6B",
    programs=custom_programs,
    intervention_type="residual_stream"
)

Custom Prompt

MY_PROMPT = """Trace this code step by step:
{code}
Final answer: """

config = quick_experiment("my_test", MY_PROMPT, generators.make_range_program)

Key Patterns

Memory Management

# For memory-constrained environments
result = runner.run(config, no_cache=True)  # Unload models after each use

# For interactive use
runner.preload_models(["model1", "model2"])  # Load once, reuse many times

Batch Processing

# Run multiple configurations efficiently
configs = [config1, config2, config3]
for config in configs:
    result = runner.run(config)  # Uses cached models automatically
    runner.plot(config.name)

Error Analysis

# Analyze failure cases
runner.analyze_errors("experiment_name", show_programs=True)

# Filter by specific criteria
runner.analyze_errors("experiment_name", min_seq_len=5, max_seq_len=10)

Research Applications

This framework supports research into:

Variable Binding Mechanisms: How models track variable assignments across different referential depths
Causal Intervention Analysis: Which layers and positions are critical for variable tracking
Model Comparison: Systematic comparison of different model architectures on code understanding
Failure Mode Analysis: Understanding when and why models fail at code tracing
Attention Head Analysis: Role of individual attention heads in variable binding

The framework is designed for maximum information-per-line-of-code, following principles from Richard Hamming's "You and Your Research" - efficient, iterative experimentation with clear experimental design and easy reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.claude		.claude
experiments		experiments
results		results
src/debug		src/debug
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
debug.code-workspace		debug.code-workspace
image.png		image.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock
viz_batch.log		viz_batch.log

loftusa/debug

Folders and files

Latest commit

History

Repository files navigation

Debug Experiment Framework

Overview

Quick Start

Basic Experiments

Causal Tracing

Interactive Use (Jupyter/IPython)

Architecture

Core Framework

Experiment Types

Results Structure

Completed Experiments

Cross‑Model Accuracy Snapshot

Causal Tracing Methodology

Core Concept

Implementation

Intervention Types

Token Targeting Strategy

Robust Program Generation

Components

Program Generators

Prompt Templates

Parser Functions

Causal Analysis Tools

Installation

Dependencies

Development Commands

Running Experiments

Code Quality

Creating Custom Experiments

Custom Generator

Custom Causal Experiment

Custom Prompt

Key Patterns

Memory Management

Batch Processing

Error Analysis

Research Applications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages