This project tackles the style change detection task for multi-author documents. The goal is to identify positions within a text where the writing style changes, indicating a potential switch in authorship. This is accomplished at the sentence level by analyzing each pair of consecutive sentences.
Given a multi-author document, detect all style change positions between consecutive sentences. This has practical applications in:
- Plagiarism detection (without comparison texts)
- Uncovering gift authorships
- Verifying claimed authorship
- Developing writing support technology
Dataset source: PAN25 Multi-Author Writing Style Analysis
The project uses three difficulty levels, each controlling the relationship between topic and authorship changes:
- Easy: High topic diversity across sentences (topic can signal authorship changes)
- Medium: Low topic diversity (requires focus on stylistic features)
- Hard: All sentences share the same topic (pure style analysis)
Each dataset is split into:
- Training set (70%): With ground truth for model development
- Validation set (15%): With ground truth for model optimization
- Test set (15%): Without ground truth for final evaluation
- All documents are in English
- Documents may contain arbitrary numbers of style changes
- Style changes occur only between sentences (never within a sentence)
- Single sentences are always single-authored
This project implements and compares several transformer-based approaches:
-
Custom Lightweight Transformer: A small transformer trained from scratch (~7M parameters) using GPT-2 tokenizer
- Learned positional embeddings
- Multi-head self-attention
- Layer normalization and GELU activation
- CLS token pooling for sequence representation
-
Pretrained Models: Fine-tuned HuggingFace models including:
prajjwal1/bert-mini: Compact BERT variantmicrosoft/deberta-v3-small: Enhanced BERT with disentangled attentionroberta-base: Robustly optimized BERT approach
-
Siamese Architecture: Dual-encoder models that:
- Encode each sentence separately using shared weights
- Compare embeddings using multiple similarity methods:
- Concatenation:
[emb1, emb2] - Absolute difference:
|emb1 - emb2| - Element-wise multiplication:
emb1 * emb2 - Cosine similarity: Angular alignment between embeddings
- Concatenation:
The dataset exhibits significant class imbalance (most sentence pairs are same-author). We address this through:
- Weighted Random Sampling: Oversamples minority class during training
- Label Smoothing (0.1): Regularization to prevent overconfident predictions
- Data Augmentation: Optional sentence swapping to increase effective dataset size
- Optimizer: AdamW with weight decay (0.1) for L2 regularization
- Learning Rate Schedule: OneCycleLR with cosine annealing
- Warmup phase (10% of training)
- Peak learning rate based on model type
- Gradual decay to minimum
- Encoder Freezing: Progressive fine-tuning option
- Initially freeze pretrained encoder
- Unfreeze after specified fraction of epochs
- Prevents catastrophic forgetting of pretrained knowledge
- Gradient Clipping: Max norm of 1.0 for training stability
- Early Stopping: Patience of 3 epochs based on validation F1 score
- F1 Score (primary metric): Harmonic mean of precision and recall
- Accuracy: Overall correctness
- Precision: Fraction of predicted style changes that are correct
- Recall: Fraction of actual style changes detected
- AUC-ROC: Area under the receiver operating characteristic curve
- Python 3.12
pip install -r requirements.txtIf using Windows with CUDA 12.1:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -yThe main analysis is contained in main.ipynb. Open it in Jupyter:
jupyter notebook main.ipynbThe notebook includes:
- Data Loading & EDA: Automatic download, preprocessing, and exploratory analysis
- Model Architecture: Definition of encoders, classification heads, and full models
- Training: Training loop with class imbalance handling and regularization
- Single Model Experiment: Train and visualize one model's performance
- Model Comparison: Train multiple models and compare test results
- Inference: Interactive predictions on new sentence pairs
from main import get_model_config, train_model, load_model_from_config
config = get_model_config('microsoft/deberta-v3-small', device)
model, result = train_model(
config=config,
train_df=train_df,
val_df=validation_df
)comparison_df = compare_models(
model_names=[
'custom-lightweight-transformer',
'prajjwal1/bert-mini',
'microsoft/deberta-v3-small',
'siamese-roberta-base'
],
train_df=train_df,
val_df=validation_df,
test_df=test_df
)# Load trained model
config = get_model_config('microsoft/deberta-v3-small', device)
model = load_model_from_config(config)
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
# Predict on new sentence pair
sentence1 = "The empirical analysis demonstrates a statistically significant correlation."
sentence2 = "lol yeah that's pretty cool i guess, dunno why anyone would care tho"
prediction, confidence = predict_authorship_change(
sentence1, sentence2, model, tokenizer, device, max_length=128
)
print(f"Different authors: {prediction} (confidence: {confidence:.2%})")Best model: microsoft/deberta-v3-small
| Model | Test F1 | Test Accuracy | Test Precision | Test Recall | Test AUC-ROC |
|---|---|---|---|---|---|
| microsoft/deberta-v3-small | 0.923 | 0.973 | 0.916 | 0.930 | 0.970 |
| siamese-prajjwal1/bert-mini | 0.890 | 0.960 | 0.848 | 0.936 | 0.988 |
| roberta-base | 0.876 | 0.952 | 0.794 | 0.977 | 0.992 |
| siamese-roberta-base | 0.871 | 0.951 | 0.806 | 0.949 | 0.984 |
| Team | Approach | Easy | Medium | Hard | Average F1 |
|---|---|---|---|---|---|
| xxsu-team | SCL-DeBERTa | 0.955 | 0.825 | 0.829 | 0.870 |
| stylospies | Graph/Structural Features | 0.959 | 0.786 | 0.791 | 0.845 |
| TMU | Ensemble LaBSE/Siamese BiLSTM | 0.950 | 0.792 | 0.792 | 0.845 |
| better_call_claude | SSPC (BiLSTM/PLM) | 0.929 | 0.815 | 0.731 | 0.825 |
| cornell-1 | Ensembled-BertStyleNN | 0.909 | 0.793 | 0.698 | 0.800 |
| OpenFact | Punctuation-Guided Pretraining | 0.919 | 0.771 | 0.752 | 0.814 |
| jarturog | microsoft/deberta-v3-small | 0.922 | 0.715 | 0.694 | 0.777 |
- Class Imbalance: Weighted sampling proved more effective than weighted loss for handling the 4:1 imbalance ratio
- Siamese Architecture: Mixed results across different base models; requires further investigation with consistent hyperparameters
- Custom vs. Pretrained: Pretrained models significantly outperform custom architectures, highlighting the importance of language understanding from pretraining
- Regularization: Dropout (0.33), label smoothing (0.1), and gradient clipping were critical for preventing overfitting
- Performance Gap: While competitive on easy tasks, our approach lags on medium/hard difficulties, suggesting the need for domain-specific features or ensemble methods
.
├── main.ipynb # Main analysis notebook
├── README.md # This file
├── requirements.txt # Python dependencies
├── data/ # Dataset directory (auto-created)
│ ├── easy/ # Easy difficulty dataset
│ ├── medium/ # Medium difficulty dataset
│ ├── hard/ # Hard difficulty dataset
│ └── loaded_data.csv # Cached processed data
└── results/ # Trained models directory (auto-created)
├── custom-lightweight-transformer/
├── microsoft-deberta-v3-small/
└── ...
- Hyperparameter Optimization: Grid search or Bayesian optimization for better parameter tuning
- Contrastive Learning: Implement supervised contrastive learning (SCL) similar to top-performing teams
- Domain-Specific Features: Incorporate linguistic features (punctuation patterns, sentence structure, vocabulary richness)
- Ensemble Methods: Combine predictions from multiple models for improved robustness
- Data Augmentation: Back-translation, paraphrasing, or SMOTE on embeddings
- Focal Loss: Alternative loss function specifically designed for imbalanced datasets
- Larger Models: Fine-tune DeBERTa-large or RoBERTa-large for potentially better performance
For detailed implementation and experiments, see main.ipynb.