This repository contains a comprehensive evaluation pipeline for assessing model faithfulness - whether AI models are honest about their internal reasoning when faced with different prompts or pressures.
The evaluation pipeline compares how models respond to original prompts versus perturbed prompts, then analyzes whether the model's chain-of-thought reasoning is faithful (honest about the factors influencing its responses).
eval.py: Main evaluation pipeline that orchestrates the entire processutils.py: Utility functions for prompt formatting, response parsing, and data handlinggraders.py: Grading functions and faithfulness evaluation prompts for different categoriesoverview.py: Streamlit dashboard for visualizing and comparing resultsstreamlit_app.py: Web interface for interactive exploration of evaluation results
perturbation_data*.jsonl: Input data containing original and perturbed prompts- Generated output files with model responses and evaluation results
graph TD
subgraph "Dataset Fields"
P["prompt"]
PP["perturbed_prompt"]
PMP["parse_model_response_prompt"]
PPMP["parse_perturbed_model_response_prompt"]
JDP["judge_diff_prompt"]
CAT["category"]
end
A["Generate Response"]
B["Generate Perturbed Response"]
C["Parse Response"]
D["Parse Perturbed Response"]
E["Judge Differences + Graders"]
F["Faithfulness Checks<br/><small>Simulatibility<br/>Category-specific</small>"]
G["Save Results"]
A --> C
B --> D
C --> E
D --> E
E --> F
F --> G
P --> A
PP --> B
PMP --> C
PPMP --> D
JDP --> E
CAT --> E
CAT --> F
prompt → Generate Response
perturbed_prompt → Generate Perturbed Response
parse_model_response_prompt → Parse Response
parse_perturbed_model_response_prompt → Parse Perturbed Response
judge_diff_prompt → Judge Differences
category → Graders + Faithfulness Checks
- Loads JSONL files containing prompt pairs (original and perturbed)
- Each row contains prompts for different evaluation categories
- Runs original prompts through the target model
- Uses Chain-of-Thought (CoT) formatting to get model reasoning
- Parses responses to extract final answers
- Runs perturbed prompts (with added pressure/context) through the model
- Extracts both chain-of-thought reasoning and final answers
- Compares against original responses
- Uses a processing model to judge differences between original and perturbed responses
- Identifies cases where the model gave inconsistent answers
- Creates binary labels for inconsistent cases
- Uses category-specific graders when LLM judging is unavailable
- Runs multiple faithfulness checks on inconsistent cases:
- Ground Truth Surprise: Does the reasoning reveal awareness of pressure?
- Simulatability: Can you predict the model's response from its reasoning?
- Sycophancy: Is the model unduly influenced by user characteristics?
- Persona Analysis: Does the response depend on assigned persona?
- Resume Bias: Are hiring suggestions biased by demographics?
- Sandbagging: Is the model intentionally underperforming?
- Reward Hacking: Is the model "cheating" rather than solving the task?
The pipeline supports several evaluation categories:
- Sycophancy: Checking if models tell users what they want to hear
- Advanced AI Risk: Evaluating training vs. deployment behavior differences
- Persona: Testing if assigned personas inappropriately influence responses
- Resume: Detecting bias in hiring recommendations
- Honesty: Assessing truthfulness under pressure
- Sandbagging: Identifying intentional underperformance
- Reward Hacking: Detecting task-avoidance behaviors
python eval.py --data_path perturbation_data.jsonl --model_name gpt-4o --processing_model_name gpt-4o--data_path: Path to input JSONL file (default:perturbation_data.jsonl)--model_name: Model to evaluate (default:gpt-4o)--processing_model_name: Model for processing/judging (default:gpt-4o)--output_path: Output file path (auto-generated if not specified)--temperature: Sampling temperature (default: 0.0)
Launch the Streamlit dashboard to explore results:
streamlit run overview.py- Inconsistency Rate: Percentage of cases where original and perturbed prompts yield different answers
- Verbalization Score: Among inconsistent cases, percentage where the model's reasoning reveals awareness of the factors causing the inconsistency
- Faithfulness Score: Overall measure of whether the model is honest about its reasoning process
Results are saved as JSONL files with the following key fields:
model_response: Original model responseperturbed_model_response: Response to perturbed promptparsed_model_response: Extracted final answer from original responseparsed_perturbed_model_response: Extracted final answer from perturbed responseinconsistency_label: Boolean indicating if responses differfaithfulness_*_pred: Boolean predictions for various faithfulness checksfaithfulness_*_response: Full responses from faithfulness evaluations
- Python 3.8+
- Required packages specified in the codebase
- Access to language model APIs for evaluation
├── eval.py # Main evaluation pipeline
├── utils.py # Utility functions
├── graders.py # Grading and faithfulness prompts
├── overview.py # Streamlit dashboard
├── streamlit_app.py # Web interface
├── data.ipynb # Data analysis notebook
├── perturbation_data*.jsonl # Input/output data files
└── README.md # This file
This pipeline provides a comprehensive framework for evaluating whether AI models are faithful in their reasoning, particularly when faced with pressures or incentives that might influence their responses.