This repository contains a Reward Model based on Qwen 2.5 3B, fine-tuned on the Anthropic RLHF dataset using trl.RewardTrainer. The model is designed to score completions given a prompt, which is useful for reinforcement learning from human feedback (RLHF) pipelines and evaluation tasks.
The model is available on Hugging Face: https://huggingface.co/kanishkez/Reward-Model
- Base Model: Qwen 2.5 3B
- Fine-Tuned On: Anthropic RLHF dataset
- Output: Single scalar reward score per prompt-completion pair
- Framework: PyTorch + Transformers + TRL
- Model Type: Reward Model for RLHF
- Language: English (primarily)
Install the required dependencies:
pip install torch transformers datasets trlFor GPU acceleration (recommended):
pip install torch transformers datasets trl acceleratefrom transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "kanishkez/Reward-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Prepare input
prompt = "What is the capital of France?"
completion = "The capital of France is Paris."
input_text = f"{prompt}\n{completion}"
# Tokenize and get reward score
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
reward_score = outputs.logits[0].item()
print(f"Reward Score: {reward_score:.4f}")To train the model from scratch:
python trainer.pyTraining Configuration:
- Learning rate: Configured in
trainer.py - Batch size: Optimized for available GPU memory
- Epochs: Specified in training script
- Dataset: Anthropic RLHF dataset
To run inference on the trained model:
python inference.pyThe model was evaluated using RewardBench, a comprehensive benchmark for reward models. Results:
| Category | Score |
|---|---|
| Chat | 83.5% |
| Chat Hard | 53.2% |
| Safety | 72.2% |
| Reasoning | 73.4% |
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request with clear description