Multi-Author Writing Style Analysis

Overview

This project tackles the style change detection task for multi-author documents. The goal is to identify positions within a text where the writing style changes, indicating a potential switch in authorship. This is accomplished at the sentence level by analyzing each pair of consecutive sentences.

Task Description

Given a multi-author document, detect all style change positions between consecutive sentences. This has practical applications in:

Plagiarism detection (without comparison texts)
Uncovering gift authorships
Verifying claimed authorship
Developing writing support technology

Dataset

Dataset source: PAN25 Multi-Author Writing Style Analysis

The project uses three difficulty levels, each controlling the relationship between topic and authorship changes:

Easy: High topic diversity across sentences (topic can signal authorship changes)
Medium: Low topic diversity (requires focus on stylistic features)
Hard: All sentences share the same topic (pure style analysis)

Each dataset is split into:

Training set (70%): With ground truth for model development
Validation set (15%): With ground truth for model optimization
Test set (15%): Without ground truth for final evaluation

Key Constraints

All documents are in English
Documents may contain arbitrary numbers of style changes
Style changes occur only between sentences (never within a sentence)
Single sentences are always single-authored

Approach

This project implements and compares several transformer-based approaches:

Model Architectures

Custom Lightweight Transformer: A small transformer trained from scratch (~7M parameters) using GPT-2 tokenizer
- Learned positional embeddings
- Multi-head self-attention
- Layer normalization and GELU activation
- CLS token pooling for sequence representation
Pretrained Models: Fine-tuned HuggingFace models including:
- prajjwal1/bert-mini: Compact BERT variant
- microsoft/deberta-v3-small: Enhanced BERT with disentangled attention
- roberta-base: Robustly optimized BERT approach
Siamese Architecture: Dual-encoder models that:
- Encode each sentence separately using shared weights
- Compare embeddings using multiple similarity methods:
  - Concatenation: [emb1, emb2]
  - Absolute difference: |emb1 - emb2|
  - Element-wise multiplication: emb1 * emb2
  - Cosine similarity: Angular alignment between embeddings

Handling Class Imbalance

The dataset exhibits significant class imbalance (most sentence pairs are same-author). We address this through:

Weighted Random Sampling: Oversamples minority class during training
Label Smoothing (0.1): Regularization to prevent overconfident predictions
Data Augmentation: Optional sentence swapping to increase effective dataset size

Training Strategy

Optimizer: AdamW with weight decay (0.1) for L2 regularization
Learning Rate Schedule: OneCycleLR with cosine annealing
- Warmup phase (10% of training)
- Peak learning rate based on model type
- Gradual decay to minimum
Encoder Freezing: Progressive fine-tuning option
- Initially freeze pretrained encoder
- Unfreeze after specified fraction of epochs
- Prevents catastrophic forgetting of pretrained knowledge
Gradient Clipping: Max norm of 1.0 for training stability
Early Stopping: Patience of 3 epochs based on validation F1 score

Evaluation Metrics

F1 Score (primary metric): Harmonic mean of precision and recall
Accuracy: Overall correctness
Precision: Fraction of predicted style changes that are correct
Recall: Fraction of actual style changes detected
AUC-ROC: Area under the receiver operating characteristic curve

Requirements

Python Version

Python 3.12

Installation

pip install -r requirements.txt

For PyTorch with CUDA Support (Windows)

If using Windows with CUDA 12.1:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

Usage

Running the Complete Pipeline

The main analysis is contained in main.ipynb. Open it in Jupyter:

jupyter notebook main.ipynb

The notebook includes:

Data Loading & EDA: Automatic download, preprocessing, and exploratory analysis
Model Architecture: Definition of encoders, classification heads, and full models
Training: Training loop with class imbalance handling and regularization
Single Model Experiment: Train and visualize one model's performance
Model Comparison: Train multiple models and compare test results
Inference: Interactive predictions on new sentence pairs

Quick Start Examples

Train a Single Model

from main import get_model_config, train_model, load_model_from_config

config = get_model_config('microsoft/deberta-v3-small', device)

model, result = train_model(
    config=config,
    train_df=train_df,
    val_df=validation_df
)

Compare Multiple Models

comparison_df = compare_models(
    model_names=[
        'custom-lightweight-transformer',
        'prajjwal1/bert-mini',
        'microsoft/deberta-v3-small',
        'siamese-roberta-base'
    ],
    train_df=train_df,
    val_df=validation_df,
    test_df=test_df
)

Make Predictions

# Load trained model
config = get_model_config('microsoft/deberta-v3-small', device)
model = load_model_from_config(config)
tokenizer = AutoTokenizer.from_pretrained(config.model_path)

# Predict on new sentence pair
sentence1 = "The empirical analysis demonstrates a statistically significant correlation."
sentence2 = "lol yeah that's pretty cool i guess, dunno why anyone would care tho"

prediction, confidence = predict_authorship_change(
    sentence1, sentence2, model, tokenizer, device, max_length=128
)

print(f"Different authors: {prediction} (confidence: {confidence:.2%})")

Results

Performance on Easy Dataset

Best model: microsoft/deberta-v3-small

Model	Test F1	Test Accuracy	Test Precision	Test Recall	Test AUC-ROC
microsoft/deberta-v3-small	0.923	0.973	0.916	0.930	0.970
siamese-prajjwal1/bert-mini	0.890	0.960	0.848	0.936	0.988
roberta-base	0.876	0.952	0.794	0.977	0.992
siamese-roberta-base	0.871	0.951	0.806	0.949	0.984

Comparison with State-of-the-Art (PAN 2025)

Team	Approach	Easy	Medium	Hard	Average F1
xxsu-team	SCL-DeBERTa	0.955	0.825	0.829	0.870
stylospies	Graph/Structural Features	0.959	0.786	0.791	0.845
TMU	Ensemble LaBSE/Siamese BiLSTM	0.950	0.792	0.792	0.845
better_call_claude	SSPC (BiLSTM/PLM)	0.929	0.815	0.731	0.825
cornell-1	Ensembled-BertStyleNN	0.909	0.793	0.698	0.800
OpenFact	Punctuation-Guided Pretraining	0.919	0.771	0.752	0.814
jarturog	microsoft/deberta-v3-small	0.922	0.715	0.694	0.777

Key Findings

Class Imbalance: Weighted sampling proved more effective than weighted loss for handling the 4:1 imbalance ratio
Siamese Architecture: Mixed results across different base models; requires further investigation with consistent hyperparameters
Custom vs. Pretrained: Pretrained models significantly outperform custom architectures, highlighting the importance of language understanding from pretraining
Regularization: Dropout (0.33), label smoothing (0.1), and gradient clipping were critical for preventing overfitting
Performance Gap: While competitive on easy tasks, our approach lags on medium/hard difficulties, suggesting the need for domain-specific features or ensemble methods

Project Structure

.
├── main.ipynb              # Main analysis notebook
├── README.md               # This file
├── requirements.txt        # Python dependencies
├── data/                   # Dataset directory (auto-created)
│   ├── easy/               # Easy difficulty dataset
│   ├── medium/             # Medium difficulty dataset
│   ├── hard/               # Hard difficulty dataset
│   └── loaded_data.csv     # Cached processed data
└── results/                # Trained models directory (auto-created)
    ├── custom-lightweight-transformer/
    ├── microsoft-deberta-v3-small/
    └── ...

Future Work

Hyperparameter Optimization: Grid search or Bayesian optimization for better parameter tuning
Contrastive Learning: Implement supervised contrastive learning (SCL) similar to top-performing teams
Domain-Specific Features: Incorporate linguistic features (punctuation patterns, sentence structure, vocabulary richness)
Ensemble Methods: Combine predictions from multiple models for improved robustness
Data Augmentation: Back-translation, paraphrasing, or SMOTE on embeddings
Focal Loss: Alternative loss function specifically designed for imbalanced datasets
Larger Models: Fine-tune DeBERTa-large or RoBERTa-large for potentially better performance

For detailed implementation and experiments, see main.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Author Writing Style Analysis

Overview

Task Description

Dataset

Key Constraints

Approach

Model Architectures

Handling Class Imbalance

Training Strategy

Evaluation Metrics

Requirements

Python Version

Installation

For PyTorch with CUDA Support (Windows)

Usage

Running the Complete Pipeline

Quick Start Examples

Train a Single Model

Compare Multiple Models

Make Predictions

Results

Performance on Easy Dataset

Comparison with State-of-the-Art (PAN 2025)

Key Findings

Project Structure

Future Work

About

Uh oh!

Releases

Packages

Languages

License

Jarturog/multi-author-writing-style-analysis

Folders and files

Latest commit

History

Repository files navigation

Multi-Author Writing Style Analysis

Overview

Task Description

Dataset

Key Constraints

Approach

Model Architectures

Handling Class Imbalance

Training Strategy

Evaluation Metrics

Requirements

Python Version

Installation

For PyTorch with CUDA Support (Windows)

Usage

Running the Complete Pipeline

Quick Start Examples

Train a Single Model

Compare Multiple Models

Make Predictions

Results

Performance on Easy Dataset

Comparison with State-of-the-Art (PAN 2025)

Key Findings

Project Structure

Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages