Anomaly Detection ML Project

A production-grade anomaly detection machine learning system demonstrating senior-level ML engineering capabilities. This project implements multiple classical anomaly detection algorithms with comprehensive experiment tracking, automated reporting, and containerized deployment.

Note: To see a complete AI implemetation (i.e. not reviewed by myself yet!), Checkout the following branch: r1-t14

🎯 Project Overview

This system showcases modern ML engineering best practices through a complete anomaly detection pipeline. It's designed as an interview-ready demonstration of production ML systems, emphasizing:

Reproducibility: Version-controlled data, deterministic pipelines, and comprehensive experiment tracking
Maintainability: Clean architecture, comprehensive testing, and type safety
Scalability: Containerized deployment, configurable pipelines, and efficient resource usage
Observability: Detailed logging, metrics tracking, and automated reporting

✨ Key Features

Core ML Capabilities

Multi-algorithm Support: IsolationForest, LocalOutlierFactor, OneClassSVM, RobustCovariance
Supervised Baseline: GradientBoostingClassifier with class balancing and probability calibration
Automated Preprocessing: sklearn ColumnTransformer with intelligent feature type detection
Threshold Optimization: Multiple strategies (percentile, Youden's J, fixed FPR)

Engineering Excellence

Configuration Management: Hydra-based hierarchical configuration with Pydantic validation
Experiment Tracking: MLflow integration for comprehensive experiment management
Automated Reporting: Quarto-based reports with interactive Plotly visualizations
Quality Assurance: Comprehensive testing (unit, integration, smoke), type checking, and linting
Containerization: Multi-stage Docker builds with Podman support

Production Features

Data Versioning: DVC for reproducible data pipelines
Hyperparameter Optimization: Optuna integration with MLflow callbacks
API Serving: FastAPI with Pydantic validation for model serving
Health Monitoring: Built-in health checks and resource monitoring

🚀 Quick Start

Prerequisites

Python 3.11+: Modern Python with type hints support
uv: Fast Python package manager (replaces pip/poetry)
Just: Command runner (optional but recommended)
Git: For version control and DVC integration

Installation

Option 1: One-Command Bootstrap (Recommended)

just bootstrap

This will:

Install uv if not present
Install all dependencies with proper extras
Initialize DVC for data versioning
Create required directory structure
Set up development environment

Option 2: Manual Setup

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and enter project directory
git clone <repository-url>
cd anomaly-detection-ml-project

# Install dependencies
uv sync --all-extras

# Initialize DVC
uv run dvc init --no-scm

# Create directory structure
mkdir -p data/{raw,interim,processed} models reports/figures

First Run: Complete Pipeline

Follow these steps for your first complete run:

# 1. Set up Kaggle credentials (for dataset download)
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

# 2. Download and prepare data
just data-download
just data-prep

# 3. Train your first model
just train

# 4. Evaluate the model
just eval

# 5. Generate a comprehensive report
just report

# 6. Start MLflow UI to explore results
just mlflow-ui
# Visit http://localhost:5000 to see experiments

Quick Examples

# Train different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm

# Override hyperparameters
just train model.hyperparameters.contamination=0.05

# Run hyperparameter optimization
just tune

# Generate reports for specific runs
just report --run-id <mlflow-run-id>

# Run in container
just podman-build
just podman-run

⚙️ Configuration System

The project uses Hydra for hierarchical configuration management with Pydantic validation. This provides type-safe, composable configurations that can be easily overridden.

Configuration Structure

config/
├── config.yaml              # Main configuration with defaults
├── dataset/
│   ├── credit-card-fraud.yaml    # Credit card fraud dataset
│   └── synthetic.yaml            # Synthetic dataset for testing
├── model/
│   ├── isolation_forest.yaml     # IsolationForest parameters
│   ├── lof.yaml                  # LocalOutlierFactor parameters
│   ├── one_class_svm.yaml        # OneClassSVM parameters
│   └── supervised.yaml           # Supervised baseline
├── tuning/
│   └── optuna.yaml               # Hyperparameter optimization
└── tracking/
    └── mlflow.yaml               # MLflow configuration

Configuration Examples

Basic Model Selection

# Use different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm
just train model=supervised

Hyperparameter Overrides

# Override contamination rate
just train model.hyperparameters.contamination=0.05

# Multiple parameter overrides
just train model.hyperparameters.n_estimators=200 model.hyperparameters.max_samples=0.8

# Override dataset splitting
just train dataset.train_split=0.8 dataset.val_split=0.1 dataset.test_split=0.1

Advanced Configuration

# Change optimization settings
just tune tuning.n_trials=50 tuning.timeout=1800

# Override tracking configuration
just train tracking.experiment_name="fraud_detection_v2"

# Combine multiple overrides
just train model=lof dataset=synthetic tuning.n_trials=20

Configuration Validation

All configurations are validated using Pydantic schemas:

Type checking: Ensures correct data types
Value validation: Checks ranges and constraints
Required fields: Validates mandatory parameters
Default values: Provides sensible defaults

Project Structure

├── src/                          # Main source code
│   ├── config/                   # Configuration management
│   ├── data/                     # Data access and management
│   ├── features/                 # Feature engineering
│   ├── models/                   # Model implementations
│   ├── pipelines/                # Training and inference pipelines
│   ├── tuning/                   # Hyperparameter optimization
│   └── tracking/                 # MLflow experiment tracking
├── config/                       # Hydra configuration files
├── data/                         # Data directory (DVC-managed)
├── models/                       # Trained models and artifacts
├── reports/                      # Generated reports
├── tests/                        # Test suite
└── notebooks/                    # Jupyter notebooks for exploration

🛠️ Available Commands

Run just --list or just help to see all available commands:

Setup and Environment

bootstrap - Complete project setup (recommended for first-time setup)
install - Install core dependencies
install-dev - Install development dependencies
install-all - Install all dependencies including optional extras

Data Management

data-download - Download datasets (supports Kaggle integration)
data-prep - Prepare and preprocess data using DVC pipeline
data-validate - Validate data quality and detect issues

Model Training and Evaluation

train - Train models with configurable algorithms and parameters
tune - Run hyperparameter optimization with Optuna
eval - Evaluate trained models with comprehensive metrics
predict - Make predictions on new data

Experiment Tracking and Reporting

mlflow-ui - Start MLflow UI for experiment visualization
report - Generate Quarto reports with visualizations
report-compare - Compare multiple experiments
report-list - List available MLflow runs

Code Quality and Testing

lint - Run Ruff linter for code quality
fmt - Format code with Ruff
typecheck - Run mypy for static type checking
test - Run comprehensive test suite
check - Run all quality checks (lint + typecheck + test)

Container Operations

podman-build - Build container images (development/production)
podman-run - Run containers with proper volume mounts
podman-serve - Start FastAPI serving container
podman-start-stack - Start complete MLflow + API stack

Development Utilities

clean - Clean temporary files and caches
info - Show environment information
help - Display detailed command help

🤖 Supported Algorithms

Unsupervised Anomaly Detection

IsolationForest

Best for: High-dimensional data, large datasets
Strengths: Fast training, handles mixed data types well
Use case: General-purpose anomaly detection

just train model=isolation_forest

LocalOutlierFactor (LOF)

Best for: Local density-based anomalies
Strengths: Detects local outliers in varying density regions
Use case: Spatial data, clustering-based anomalies

just train model=lof

OneClassSVM

Best for: High-dimensional data with clear boundaries
Strengths: Robust to outliers, kernel-based flexibility
Use case: Text data, high-dimensional feature spaces

just train model=one_class_svm

RobustCovariance (EllipticEnvelope)

Best for: Gaussian-distributed data
Strengths: Statistical foundation, interpretable
Use case: Financial data, sensor readings

just train model=robust_covariance

Supervised Baseline

GradientBoostingClassifier

Best for: When labeled anomalies are available
Features: Class balancing, probability calibration, threshold optimization
Use case: Fraud detection, quality control

just train model=supervised

📊 Supported Datasets

Credit Card Fraud Detection (Default)

Source: Kaggle MLG-ULB dataset
Size: ~284K transactions, 30 features
Anomaly rate: ~0.17% (highly imbalanced)
Features: PCA-transformed financial transaction data

# Use default dataset
just train dataset=credit-card-fraud

Synthetic Dataset (Testing)

Source: Generated using sklearn
Purpose: Testing and development
Configurable: Size, features, anomaly rate

# Use synthetic data for testing
just train dataset=synthetic

Adding Custom Datasets

Create configuration file in config/dataset/
Implement data loading in src/data/
Update DVC pipeline if needed

🔧 Development Workflow

Code Quality Standards

The project enforces strict quality standards:

# Run comprehensive quality checks
just check

# Individual quality checks
just lint          # Ruff linting (replaces flake8, black, isort)
just fmt           # Code formatting with Ruff
just typecheck     # mypy static type checking
just test          # pytest test suite with coverage

Testing Strategy

# Run all tests
just test

# Run with coverage report
just test-cov

# Run specific test categories
just test-unit         # Unit tests (fast)
just test-integration  # Integration tests (slower)
just test-smoke        # Smoke tests (quick validation)

# Performance testing
just benchmark         # Run performance benchmarks

Pre-commit Hooks

Set up automated quality checks:

# Install pre-commit hooks
just setup-pre-commit

# Run on all files
just run-pre-commit

Container Development

# Build development container
just podman-build development

# Run with full development environment
just podman-run

# Open interactive shell
just podman-shell

# Build production container
just podman-build-prod

📈 MLflow Integration & Experiment Tracking

Automatic Experiment Tracking

Every training run automatically logs:

Parameters: All model and dataset configuration
Metrics: ROC-AUC, PR-AUC, precision, recall, F1-score
Artifacts: Trained models, preprocessing pipelines, plots
Environment: Python version, dependencies, git commit

MLflow UI

# Start MLflow UI
just mlflow-ui

# Access at http://localhost:5000

Features:

Experiment comparison: Side-by-side run comparison
Model registry: Version management and staging
Artifact browser: Download models and plots
Metric visualization: Interactive charts and trends

Experiment Organization

# Create named experiments
just train tracking.experiment_name="fraud_detection_v1"

# Add custom tags
just train tracking.tags.version="1.0" tracking.tags.dataset="production"

# Set run names
just train tracking.run_name="isolation_forest_baseline"

🎯 Hyperparameter Optimization

Optuna Integration

Automated hyperparameter tuning with Bayesian optimization:

# Run optimization with default settings
just tune

# Customize optimization
just tune tuning.n_trials=100 tuning.timeout=3600

# Multi-objective optimization
just tune tuning.enable_multi_objective=true

Optimization Features

Bayesian optimization: Efficient parameter space exploration
Pruning: Early stopping of unpromising trials
MLflow integration: All trials logged automatically
Parallel execution: Multi-process optimization support

Analyzing Results

# Analyze optimization results
just tune-analyze

# Compare different studies
just tune-compare

# Generate optimization report
just report --optimization-study <study-name>

📊 Automated Reporting

Quarto Reports

Generate comprehensive reports with interactive visualizations:

# Generate report for latest run
just report

# Generate for specific MLflow run
just report --run-id <run-id>

# Compare multiple experiments
just report-compare --experiment-name "fraud_detection_v1"

Report Contents

Executive Summary: Key metrics and model performance
Data Analysis: Dataset characteristics and quality metrics
Model Performance: ROC curves, precision-recall curves, confusion matrices
Feature Analysis: Feature importance and distribution analysis
Model Comparison: Side-by-side algorithm comparison
Recommendations: Actionable insights and next steps

Report Formats

# HTML report (default)
just report --format html

# PDF report
just report --format pdf

# Serve reports locally
just report-serve  # Access at http://localhost:8080

🐳 Container Deployment

Development Container

# Build and run development environment
just podman-build development
just podman-run

# Features:
# - Full development tools (pytest, mypy, ruff)
# - Jupyter notebook support
# - Interactive debugging
# - Volume mounts for live code editing

Production Container

# Build optimized production image
just podman-build-prod

# Run production API server
just podman-serve

# Features:
# - Minimal image size
# - FastAPI serving endpoint
# - Health checks and monitoring
# - Non-root security

Container Stack

# Start complete stack (MLflow + API)
just podman-start-stack

# Services:
# - MLflow UI: http://localhost:5000
# - API server: http://localhost:8000
# - API docs: http://localhost:8000/docs

# Stop everything
just podman-stop-stack

🔍 Troubleshooting

Common Issues

1. Kaggle API Setup

# Error: Kaggle credentials not found
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

# Or create ~/.kaggle/kaggle.json
mkdir -p ~/.kaggle
echo '{"username":"your_username","key":"your_key"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

2. Memory Issues

# Reduce dataset size for testing
just train dataset.train_split=0.1

# Use synthetic dataset
just train dataset=synthetic

# Adjust model parameters
just train model.hyperparameters.max_samples=0.1

3. Container Permission Issues

# Fix SELinux labels (if using SELinux)
sudo chcon -Rt svirt_sandbox_file_t data/ models/

# Or use :Z mount option (already included in Justfile)

4. MLflow Connection Issues

# Check MLflow server status
just mlflow-ui

# Reset MLflow tracking
rm -rf mlruns/
mkdir mlruns

Debug Mode

# Enable debug logging
export LOG_LEVEL=DEBUG

# Run with verbose output
just train --verbose

# Check system information
just info

Performance Issues

# Run performance benchmark
just benchmark

# Profile specific components
just benchmark --component training
just benchmark --component preprocessing

Getting Help

Check logs: Look in logs/ directory for detailed error messages
Run diagnostics: Use just info to check environment
Validate configuration: Use just data-validate to check data quality
Test installation: Run just test-smoke for quick validation

🚀 Advanced Usage

Custom Model Implementation

Add new anomaly detection algorithms:

Implement BaseModelProtocol:

from src.models.base import BaseModelProtocol

class CustomAnomalyDetector:
    def fit(self, X, y=None):
        # Implementation
        return self
    
    def predict(self, X):
        # Return binary predictions
        pass
    
    def decision_function(self, X):
        # Return anomaly scores
        pass

Register in ModelFactory:

# src/models/factory.py
factory.register("custom_model", CustomAnomalyDetector)

Create configuration:

# config/model/custom_model.yaml
type: "custom_model"
hyperparameters:
  param1: value1
  param2: value2

Custom Dataset Integration

Add new datasets:

Create dataset configuration:

# config/dataset/my_dataset.yaml
name: "my_dataset"
download_url: "path/to/data"
target_col: "label"
train_split: 0.7
val_split: 0.15
test_split: 0.15

Implement data loader (if needed):

# src/data/loaders.py
def load_my_dataset(config):
    # Custom loading logic
    return dataframe

Pipeline Customization

Extend the preprocessing pipeline:

# Custom feature engineering
from src.features.engineering import FeatureEngineer

class CustomFeatureEngineer(FeatureEngineer):
    def create_pipeline(self):
        # Custom preprocessing steps
        pass

🏗️ Architecture Overview

System Design

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Layer    │    │  Config Layer   │    │  Model Layer    │
│                 │    │                 │    │                 │
│ • DVC Pipeline  │    │ • Hydra Configs │    │ • BaseProtocol  │
│ • Data Access   │    │ • Pydantic      │    │ • Model Factory │
│ • Validation    │    │ • Validation    │    │ • Algorithms    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Training Layer  │    │ Tracking Layer  │    │ Serving Layer   │
│                 │    │                 │    │                 │
│ • Orchestrator  │    │ • MLflow        │    │ • FastAPI       │
│ • Pipelines     │    │ • Experiments   │    │ • Health Checks │
│ • Evaluation    │    │ • Artifacts     │    │ • Monitoring    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Key Design Principles

Separation of Concerns: Clear module boundaries and responsibilities
Configuration-Driven: All behavior controlled via Hydra configurations
Type Safety: Comprehensive Pydantic schemas and mypy validation
Testability: Dependency injection and protocol-based interfaces
Reproducibility: Deterministic pipelines with version control

📚 Additional Resources

Documentation

External Dependencies

Example Notebooks

notebooks/01_data_exploration.ipynb - Dataset analysis and visualization
notebooks/02_model_comparison.ipynb - Algorithm comparison and selection
notebooks/03_hyperparameter_analysis.ipynb - Optimization results analysis

🤝 Contributing

Development Setup

Fork the repository
Run just bootstrap to set up environment
Create feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Run quality checks: just check
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open Pull Request

Code Standards

Type Hints: All functions must have type annotations
Documentation: Docstrings for all public functions and classes
Testing: Minimum 80% test coverage for new code
Quality: All code must pass just check (lint + typecheck + test)

Commit Convention

Follow Conventional Commits:

feat: New features
fix: Bug fixes
docs: Documentation changes
test: Test additions or modifications
refactor: Code refactoring

📄 License

MIT License - see LICENSE file for details.

🆘 Support & Community

Getting Help

Documentation: Check this README and docs/ directory
Issues: Search existing issues or create a new one
Discussions: Use GitHub Discussions for questions
Examples: Check notebooks/ for usage examples

Reporting Issues

When reporting issues, please include:

Python version and operating system
Complete error message and stack trace
Steps to reproduce the issue
Configuration files (if relevant)
Output of just info

Feature Requests

For feature requests, please:

Check existing issues and discussions
Describe the use case and expected behavior
Consider contributing the feature yourself

Built with ❤️ for the ML Engineering Community

This project demonstrates production-ready ML engineering practices and serves as a comprehensive example for building scalable, maintainable machine learning systems.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.kiro		.kiro
.vscode		.vscode
README.md		README.md

momr/Anomaly-Detection-ML-Project-Template

Folders and files

Latest commit

History

Repository files navigation

Anomaly Detection ML Project

🎯 Project Overview

✨ Key Features

Core ML Capabilities

Engineering Excellence

Production Features

🚀 Quick Start

Prerequisites

Installation

Option 1: One-Command Bootstrap (Recommended)

Option 2: Manual Setup

First Run: Complete Pipeline

Quick Examples

⚙️ Configuration System

Configuration Structure

Configuration Examples

Basic Model Selection

Hyperparameter Overrides

Advanced Configuration

Configuration Validation

Project Structure

🛠️ Available Commands

Setup and Environment

Data Management

Model Training and Evaluation

Experiment Tracking and Reporting

Code Quality and Testing

Container Operations

Development Utilities

🤖 Supported Algorithms

Unsupervised Anomaly Detection

IsolationForest

LocalOutlierFactor (LOF)

OneClassSVM

RobustCovariance (EllipticEnvelope)

Supervised Baseline

GradientBoostingClassifier

📊 Supported Datasets

Credit Card Fraud Detection (Default)

Synthetic Dataset (Testing)

Adding Custom Datasets

🔧 Development Workflow

Code Quality Standards

Testing Strategy

Pre-commit Hooks

Container Development

📈 MLflow Integration & Experiment Tracking

Automatic Experiment Tracking

MLflow UI

Experiment Organization

🎯 Hyperparameter Optimization

Optuna Integration

Optimization Features

Analyzing Results

📊 Automated Reporting

Quarto Reports

Report Contents

Report Formats

🐳 Container Deployment

Development Container

Production Container

Container Stack

🔍 Troubleshooting

Common Issues

1. Kaggle API Setup

2. Memory Issues

3. Container Permission Issues

4. MLflow Connection Issues

Debug Mode

Performance Issues

Getting Help

🚀 Advanced Usage

Custom Model Implementation

Custom Dataset Integration

Pipeline Customization

Packages