Skip to content

momr/Anomaly-Detection-ML-Project-Template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Anomaly Detection ML Project

A production-grade anomaly detection machine learning system demonstrating senior-level ML engineering capabilities. This project implements multiple classical anomaly detection algorithms with comprehensive experiment tracking, automated reporting, and containerized deployment.

Note: To see a complete AI implemetation (i.e. not reviewed by myself yet!), Checkout the following branch: r1-t14

🎯 Project Overview

This system showcases modern ML engineering best practices through a complete anomaly detection pipeline. It's designed as an interview-ready demonstration of production ML systems, emphasizing:

  • Reproducibility: Version-controlled data, deterministic pipelines, and comprehensive experiment tracking
  • Maintainability: Clean architecture, comprehensive testing, and type safety
  • Scalability: Containerized deployment, configurable pipelines, and efficient resource usage
  • Observability: Detailed logging, metrics tracking, and automated reporting

✨ Key Features

Core ML Capabilities

  • Multi-algorithm Support: IsolationForest, LocalOutlierFactor, OneClassSVM, RobustCovariance
  • Supervised Baseline: GradientBoostingClassifier with class balancing and probability calibration
  • Automated Preprocessing: sklearn ColumnTransformer with intelligent feature type detection
  • Threshold Optimization: Multiple strategies (percentile, Youden's J, fixed FPR)

Engineering Excellence

  • Configuration Management: Hydra-based hierarchical configuration with Pydantic validation
  • Experiment Tracking: MLflow integration for comprehensive experiment management
  • Automated Reporting: Quarto-based reports with interactive Plotly visualizations
  • Quality Assurance: Comprehensive testing (unit, integration, smoke), type checking, and linting
  • Containerization: Multi-stage Docker builds with Podman support

Production Features

  • Data Versioning: DVC for reproducible data pipelines
  • Hyperparameter Optimization: Optuna integration with MLflow callbacks
  • API Serving: FastAPI with Pydantic validation for model serving
  • Health Monitoring: Built-in health checks and resource monitoring

πŸš€ Quick Start

Prerequisites

  • Python 3.11+: Modern Python with type hints support
  • uv: Fast Python package manager (replaces pip/poetry)
  • Just: Command runner (optional but recommended)
  • Git: For version control and DVC integration

Installation

Option 1: One-Command Bootstrap (Recommended)

just bootstrap

This will:

  • Install uv if not present
  • Install all dependencies with proper extras
  • Initialize DVC for data versioning
  • Create required directory structure
  • Set up development environment

Option 2: Manual Setup

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and enter project directory
git clone <repository-url>
cd anomaly-detection-ml-project

# Install dependencies
uv sync --all-extras

# Initialize DVC
uv run dvc init --no-scm

# Create directory structure
mkdir -p data/{raw,interim,processed} models reports/figures

First Run: Complete Pipeline

Follow these steps for your first complete run:

# 1. Set up Kaggle credentials (for dataset download)
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

# 2. Download and prepare data
just data-download
just data-prep

# 3. Train your first model
just train

# 4. Evaluate the model
just eval

# 5. Generate a comprehensive report
just report

# 6. Start MLflow UI to explore results
just mlflow-ui
# Visit http://localhost:5000 to see experiments

Quick Examples

# Train different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm

# Override hyperparameters
just train model.hyperparameters.contamination=0.05

# Run hyperparameter optimization
just tune

# Generate reports for specific runs
just report --run-id <mlflow-run-id>

# Run in container
just podman-build
just podman-run

βš™οΈ Configuration System

The project uses Hydra for hierarchical configuration management with Pydantic validation. This provides type-safe, composable configurations that can be easily overridden.

Configuration Structure

config/
β”œβ”€β”€ config.yaml              # Main configuration with defaults
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ credit-card-fraud.yaml    # Credit card fraud dataset
β”‚   └── synthetic.yaml            # Synthetic dataset for testing
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ isolation_forest.yaml     # IsolationForest parameters
β”‚   β”œβ”€β”€ lof.yaml                  # LocalOutlierFactor parameters
β”‚   β”œβ”€β”€ one_class_svm.yaml        # OneClassSVM parameters
β”‚   └── supervised.yaml           # Supervised baseline
β”œβ”€β”€ tuning/
β”‚   └── optuna.yaml               # Hyperparameter optimization
└── tracking/
    └── mlflow.yaml               # MLflow configuration

Configuration Examples

Basic Model Selection

# Use different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm
just train model=supervised

Hyperparameter Overrides

# Override contamination rate
just train model.hyperparameters.contamination=0.05

# Multiple parameter overrides
just train model.hyperparameters.n_estimators=200 model.hyperparameters.max_samples=0.8

# Override dataset splitting
just train dataset.train_split=0.8 dataset.val_split=0.1 dataset.test_split=0.1

Advanced Configuration

# Change optimization settings
just tune tuning.n_trials=50 tuning.timeout=1800

# Override tracking configuration
just train tracking.experiment_name="fraud_detection_v2"

# Combine multiple overrides
just train model=lof dataset=synthetic tuning.n_trials=20

Configuration Validation

All configurations are validated using Pydantic schemas:

  • Type checking: Ensures correct data types
  • Value validation: Checks ranges and constraints
  • Required fields: Validates mandatory parameters
  • Default values: Provides sensible defaults

Project Structure

β”œβ”€β”€ src/                          # Main source code
β”‚   β”œβ”€β”€ config/                   # Configuration management
β”‚   β”œβ”€β”€ data/                     # Data access and management
β”‚   β”œβ”€β”€ features/                 # Feature engineering
β”‚   β”œβ”€β”€ models/                   # Model implementations
β”‚   β”œβ”€β”€ pipelines/                # Training and inference pipelines
β”‚   β”œβ”€β”€ tuning/                   # Hyperparameter optimization
β”‚   └── tracking/                 # MLflow experiment tracking
β”œβ”€β”€ config/                       # Hydra configuration files
β”œβ”€β”€ data/                         # Data directory (DVC-managed)
β”œβ”€β”€ models/                       # Trained models and artifacts
β”œβ”€β”€ reports/                      # Generated reports
β”œβ”€β”€ tests/                        # Test suite
└── notebooks/                    # Jupyter notebooks for exploration

πŸ› οΈ Available Commands

Run just --list or just help to see all available commands:

Setup and Environment

  • bootstrap - Complete project setup (recommended for first-time setup)
  • install - Install core dependencies
  • install-dev - Install development dependencies
  • install-all - Install all dependencies including optional extras

Data Management

  • data-download - Download datasets (supports Kaggle integration)
  • data-prep - Prepare and preprocess data using DVC pipeline
  • data-validate - Validate data quality and detect issues

Model Training and Evaluation

  • train - Train models with configurable algorithms and parameters
  • tune - Run hyperparameter optimization with Optuna
  • eval - Evaluate trained models with comprehensive metrics
  • predict - Make predictions on new data

Experiment Tracking and Reporting

  • mlflow-ui - Start MLflow UI for experiment visualization
  • report - Generate Quarto reports with visualizations
  • report-compare - Compare multiple experiments
  • report-list - List available MLflow runs

Code Quality and Testing

  • lint - Run Ruff linter for code quality
  • fmt - Format code with Ruff
  • typecheck - Run mypy for static type checking
  • test - Run comprehensive test suite
  • check - Run all quality checks (lint + typecheck + test)

Container Operations

  • podman-build - Build container images (development/production)
  • podman-run - Run containers with proper volume mounts
  • podman-serve - Start FastAPI serving container
  • podman-start-stack - Start complete MLflow + API stack

Development Utilities

  • clean - Clean temporary files and caches
  • info - Show environment information
  • help - Display detailed command help

πŸ€– Supported Algorithms

Unsupervised Anomaly Detection

IsolationForest

  • Best for: High-dimensional data, large datasets
  • Strengths: Fast training, handles mixed data types well
  • Use case: General-purpose anomaly detection
just train model=isolation_forest

LocalOutlierFactor (LOF)

  • Best for: Local density-based anomalies
  • Strengths: Detects local outliers in varying density regions
  • Use case: Spatial data, clustering-based anomalies
just train model=lof

OneClassSVM

  • Best for: High-dimensional data with clear boundaries
  • Strengths: Robust to outliers, kernel-based flexibility
  • Use case: Text data, high-dimensional feature spaces
just train model=one_class_svm

RobustCovariance (EllipticEnvelope)

  • Best for: Gaussian-distributed data
  • Strengths: Statistical foundation, interpretable
  • Use case: Financial data, sensor readings
just train model=robust_covariance

Supervised Baseline

GradientBoostingClassifier

  • Best for: When labeled anomalies are available
  • Features: Class balancing, probability calibration, threshold optimization
  • Use case: Fraud detection, quality control
just train model=supervised

πŸ“Š Supported Datasets

Credit Card Fraud Detection (Default)

  • Source: Kaggle MLG-ULB dataset
  • Size: ~284K transactions, 30 features
  • Anomaly rate: ~0.17% (highly imbalanced)
  • Features: PCA-transformed financial transaction data
# Use default dataset
just train dataset=credit-card-fraud

Synthetic Dataset (Testing)

  • Source: Generated using sklearn
  • Purpose: Testing and development
  • Configurable: Size, features, anomaly rate
# Use synthetic data for testing
just train dataset=synthetic

Adding Custom Datasets

  1. Create configuration file in config/dataset/
  2. Implement data loading in src/data/
  3. Update DVC pipeline if needed

πŸ”§ Development Workflow

Code Quality Standards

The project enforces strict quality standards:

# Run comprehensive quality checks
just check

# Individual quality checks
just lint          # Ruff linting (replaces flake8, black, isort)
just fmt           # Code formatting with Ruff
just typecheck     # mypy static type checking
just test          # pytest test suite with coverage

Testing Strategy

# Run all tests
just test

# Run with coverage report
just test-cov

# Run specific test categories
just test-unit         # Unit tests (fast)
just test-integration  # Integration tests (slower)
just test-smoke        # Smoke tests (quick validation)

# Performance testing
just benchmark         # Run performance benchmarks

Pre-commit Hooks

Set up automated quality checks:

# Install pre-commit hooks
just setup-pre-commit

# Run on all files
just run-pre-commit

Container Development

# Build development container
just podman-build development

# Run with full development environment
just podman-run

# Open interactive shell
just podman-shell

# Build production container
just podman-build-prod

πŸ“ˆ MLflow Integration & Experiment Tracking

Automatic Experiment Tracking

Every training run automatically logs:

  • Parameters: All model and dataset configuration
  • Metrics: ROC-AUC, PR-AUC, precision, recall, F1-score
  • Artifacts: Trained models, preprocessing pipelines, plots
  • Environment: Python version, dependencies, git commit

MLflow UI

# Start MLflow UI
just mlflow-ui

# Access at http://localhost:5000

Features:

  • Experiment comparison: Side-by-side run comparison
  • Model registry: Version management and staging
  • Artifact browser: Download models and plots
  • Metric visualization: Interactive charts and trends

Experiment Organization

# Create named experiments
just train tracking.experiment_name="fraud_detection_v1"

# Add custom tags
just train tracking.tags.version="1.0" tracking.tags.dataset="production"

# Set run names
just train tracking.run_name="isolation_forest_baseline"

🎯 Hyperparameter Optimization

Optuna Integration

Automated hyperparameter tuning with Bayesian optimization:

# Run optimization with default settings
just tune

# Customize optimization
just tune tuning.n_trials=100 tuning.timeout=3600

# Multi-objective optimization
just tune tuning.enable_multi_objective=true

Optimization Features

  • Bayesian optimization: Efficient parameter space exploration
  • Pruning: Early stopping of unpromising trials
  • MLflow integration: All trials logged automatically
  • Parallel execution: Multi-process optimization support

Analyzing Results

# Analyze optimization results
just tune-analyze

# Compare different studies
just tune-compare

# Generate optimization report
just report --optimization-study <study-name>

πŸ“Š Automated Reporting

Quarto Reports

Generate comprehensive reports with interactive visualizations:

# Generate report for latest run
just report

# Generate for specific MLflow run
just report --run-id <run-id>

# Compare multiple experiments
just report-compare --experiment-name "fraud_detection_v1"

Report Contents

  • Executive Summary: Key metrics and model performance
  • Data Analysis: Dataset characteristics and quality metrics
  • Model Performance: ROC curves, precision-recall curves, confusion matrices
  • Feature Analysis: Feature importance and distribution analysis
  • Model Comparison: Side-by-side algorithm comparison
  • Recommendations: Actionable insights and next steps

Report Formats

# HTML report (default)
just report --format html

# PDF report
just report --format pdf

# Serve reports locally
just report-serve  # Access at http://localhost:8080

🐳 Container Deployment

Development Container

# Build and run development environment
just podman-build development
just podman-run

# Features:
# - Full development tools (pytest, mypy, ruff)
# - Jupyter notebook support
# - Interactive debugging
# - Volume mounts for live code editing

Production Container

# Build optimized production image
just podman-build-prod

# Run production API server
just podman-serve

# Features:
# - Minimal image size
# - FastAPI serving endpoint
# - Health checks and monitoring
# - Non-root security

Container Stack

# Start complete stack (MLflow + API)
just podman-start-stack

# Services:
# - MLflow UI: http://localhost:5000
# - API server: http://localhost:8000
# - API docs: http://localhost:8000/docs

# Stop everything
just podman-stop-stack

πŸ” Troubleshooting

Common Issues

1. Kaggle API Setup

# Error: Kaggle credentials not found
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"

# Or create ~/.kaggle/kaggle.json
mkdir -p ~/.kaggle
echo '{"username":"your_username","key":"your_key"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

2. Memory Issues

# Reduce dataset size for testing
just train dataset.train_split=0.1

# Use synthetic dataset
just train dataset=synthetic

# Adjust model parameters
just train model.hyperparameters.max_samples=0.1

3. Container Permission Issues

# Fix SELinux labels (if using SELinux)
sudo chcon -Rt svirt_sandbox_file_t data/ models/

# Or use :Z mount option (already included in Justfile)

4. MLflow Connection Issues

# Check MLflow server status
just mlflow-ui

# Reset MLflow tracking
rm -rf mlruns/
mkdir mlruns

Debug Mode

# Enable debug logging
export LOG_LEVEL=DEBUG

# Run with verbose output
just train --verbose

# Check system information
just info

Performance Issues

# Run performance benchmark
just benchmark

# Profile specific components
just benchmark --component training
just benchmark --component preprocessing

Getting Help

  1. Check logs: Look in logs/ directory for detailed error messages
  2. Run diagnostics: Use just info to check environment
  3. Validate configuration: Use just data-validate to check data quality
  4. Test installation: Run just test-smoke for quick validation

πŸš€ Advanced Usage

Custom Model Implementation

Add new anomaly detection algorithms:

  1. Implement BaseModelProtocol:
from src.models.base import BaseModelProtocol

class CustomAnomalyDetector:
    def fit(self, X, y=None):
        # Implementation
        return self
    
    def predict(self, X):
        # Return binary predictions
        pass
    
    def decision_function(self, X):
        # Return anomaly scores
        pass
  1. Register in ModelFactory:
# src/models/factory.py
factory.register("custom_model", CustomAnomalyDetector)
  1. Create configuration:
# config/model/custom_model.yaml
type: "custom_model"
hyperparameters:
  param1: value1
  param2: value2

Custom Dataset Integration

Add new datasets:

  1. Create dataset configuration:
# config/dataset/my_dataset.yaml
name: "my_dataset"
download_url: "path/to/data"
target_col: "label"
train_split: 0.7
val_split: 0.15
test_split: 0.15
  1. Implement data loader (if needed):
# src/data/loaders.py
def load_my_dataset(config):
    # Custom loading logic
    return dataframe

Pipeline Customization

Extend the preprocessing pipeline:

# Custom feature engineering
from src.features.engineering import FeatureEngineer

class CustomFeatureEngineer(FeatureEngineer):
    def create_pipeline(self):
        # Custom preprocessing steps
        pass

πŸ—οΈ Architecture Overview

System Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Layer    β”‚    β”‚  Config Layer   β”‚    β”‚  Model Layer    β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ DVC Pipeline  β”‚    β”‚ β€’ Hydra Configs β”‚    β”‚ β€’ BaseProtocol  β”‚
β”‚ β€’ Data Access   β”‚    β”‚ β€’ Pydantic      β”‚    β”‚ β€’ Model Factory β”‚
β”‚ β€’ Validation    β”‚    β”‚ β€’ Validation    β”‚    β”‚ β€’ Algorithms    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Training Layer  β”‚    β”‚ Tracking Layer  β”‚    β”‚ Serving Layer   β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Orchestrator  β”‚    β”‚ β€’ MLflow        β”‚    β”‚ β€’ FastAPI       β”‚
β”‚ β€’ Pipelines     β”‚    β”‚ β€’ Experiments   β”‚    β”‚ β€’ Health Checks β”‚
β”‚ β€’ Evaluation    β”‚    β”‚ β€’ Artifacts     β”‚    β”‚ β€’ Monitoring    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Principles

  • Separation of Concerns: Clear module boundaries and responsibilities
  • Configuration-Driven: All behavior controlled via Hydra configurations
  • Type Safety: Comprehensive Pydantic schemas and mypy validation
  • Testability: Dependency injection and protocol-based interfaces
  • Reproducibility: Deterministic pipelines with version control

πŸ“š Additional Resources

Documentation

External Dependencies

Example Notebooks

  • notebooks/01_data_exploration.ipynb - Dataset analysis and visualization
  • notebooks/02_model_comparison.ipynb - Algorithm comparison and selection
  • notebooks/03_hyperparameter_analysis.ipynb - Optimization results analysis

🀝 Contributing

Development Setup

  1. Fork the repository
  2. Run just bootstrap to set up environment
  3. Create feature branch: git checkout -b feature/amazing-feature
  4. Make changes and add tests
  5. Run quality checks: just check
  6. Commit changes: git commit -m 'Add amazing feature'
  7. Push to branch: git push origin feature/amazing-feature
  8. Open Pull Request

Code Standards

  • Type Hints: All functions must have type annotations
  • Documentation: Docstrings for all public functions and classes
  • Testing: Minimum 80% test coverage for new code
  • Quality: All code must pass just check (lint + typecheck + test)

Commit Convention

Follow Conventional Commits:

  • feat: New features
  • fix: Bug fixes
  • docs: Documentation changes
  • test: Test additions or modifications
  • refactor: Code refactoring

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ†˜ Support & Community

Getting Help

  1. Documentation: Check this README and docs/ directory
  2. Issues: Search existing issues or create a new one
  3. Discussions: Use GitHub Discussions for questions
  4. Examples: Check notebooks/ for usage examples

Reporting Issues

When reporting issues, please include:

  • Python version and operating system
  • Complete error message and stack trace
  • Steps to reproduce the issue
  • Configuration files (if relevant)
  • Output of just info

Feature Requests

For feature requests, please:

  • Check existing issues and discussions
  • Describe the use case and expected behavior
  • Consider contributing the feature yourself

Built with ❀️ for the ML Engineering Community

This project demonstrates production-ready ML engineering practices and serves as a comprehensive example for building scalable, maintainable machine learning systems.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published