Reward Model: Qwen 2.5 3B Fine-Tuned on Anthropic RLHF

This repository contains a Reward Model based on Qwen 2.5 3B, fine-tuned on the Anthropic RLHF dataset using trl.RewardTrainer. The model is designed to score completions given a prompt, which is useful for reinforcement learning from human feedback (RLHF) pipelines and evaluation tasks.

The model is available on Hugging Face: https://huggingface.co/kanishkez/Reward-Model

Model Overview

Base Model: Qwen 2.5 3B
Fine-Tuned On: Anthropic RLHF dataset
Output: Single scalar reward score per prompt-completion pair
Framework: PyTorch + Transformers + TRL
Model Type: Reward Model for RLHF
Language: English (primarily)

Installation

Install the required dependencies:

pip install torch transformers datasets trl

For GPU acceleration (recommended):

pip install torch transformers datasets trl accelerate

Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "kanishkez/Reward-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
prompt = "What is the capital of France?"
completion = "The capital of France is Paris."
input_text = f"{prompt}\n{completion}"

# Tokenize and get reward score
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    reward_score = outputs.logits[0].item()

print(f"Reward Score: {reward_score:.4f}")

Training

To train the model from scratch:

python trainer.py

Training Configuration:

Learning rate: Configured in trainer.py
Batch size: Optimized for available GPU memory
Epochs: Specified in training script
Dataset: Anthropic RLHF dataset

Inference

To run inference on the trained model:

python inference.py

RewardBench Evaluation

The model was evaluated using RewardBench, a comprehensive benchmark for reward models. Results:

Category	Score
Chat	83.5%
Chat Hard	53.2%
Safety	72.2%
Reasoning	73.4%

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request with clear description

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
inference.py		inference.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reward Model: Qwen 2.5 3B Fine-Tuned on Anthropic RLHF

Model Overview

Installation

Usage

Quick Start

Training

Inference

RewardBench Evaluation

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reward Model: Qwen 2.5 3B Fine-Tuned on Anthropic RLHF

Model Overview

Installation

Usage

Quick Start

Training

Inference

RewardBench Evaluation

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages