A production-grade anomaly detection machine learning system demonstrating senior-level ML engineering capabilities. This project implements multiple classical anomaly detection algorithms with comprehensive experiment tracking, automated reporting, and containerized deployment.
Note: To see a complete AI implemetation (i.e. not reviewed by myself yet!), Checkout the following branch:
r1-t14
This system showcases modern ML engineering best practices through a complete anomaly detection pipeline. It's designed as an interview-ready demonstration of production ML systems, emphasizing:
- Reproducibility: Version-controlled data, deterministic pipelines, and comprehensive experiment tracking
- Maintainability: Clean architecture, comprehensive testing, and type safety
- Scalability: Containerized deployment, configurable pipelines, and efficient resource usage
- Observability: Detailed logging, metrics tracking, and automated reporting
- Multi-algorithm Support: IsolationForest, LocalOutlierFactor, OneClassSVM, RobustCovariance
- Supervised Baseline: GradientBoostingClassifier with class balancing and probability calibration
- Automated Preprocessing: sklearn ColumnTransformer with intelligent feature type detection
- Threshold Optimization: Multiple strategies (percentile, Youden's J, fixed FPR)
- Configuration Management: Hydra-based hierarchical configuration with Pydantic validation
- Experiment Tracking: MLflow integration for comprehensive experiment management
- Automated Reporting: Quarto-based reports with interactive Plotly visualizations
- Quality Assurance: Comprehensive testing (unit, integration, smoke), type checking, and linting
- Containerization: Multi-stage Docker builds with Podman support
- Data Versioning: DVC for reproducible data pipelines
- Hyperparameter Optimization: Optuna integration with MLflow callbacks
- API Serving: FastAPI with Pydantic validation for model serving
- Health Monitoring: Built-in health checks and resource monitoring
- Python 3.11+: Modern Python with type hints support
- uv: Fast Python package manager (replaces pip/poetry)
- Just: Command runner (optional but recommended)
- Git: For version control and DVC integration
just bootstrapThis will:
- Install uv if not present
- Install all dependencies with proper extras
- Initialize DVC for data versioning
- Create required directory structure
- Set up development environment
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and enter project directory
git clone <repository-url>
cd anomaly-detection-ml-project
# Install dependencies
uv sync --all-extras
# Initialize DVC
uv run dvc init --no-scm
# Create directory structure
mkdir -p data/{raw,interim,processed} models reports/figuresFollow these steps for your first complete run:
# 1. Set up Kaggle credentials (for dataset download)
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"
# 2. Download and prepare data
just data-download
just data-prep
# 3. Train your first model
just train
# 4. Evaluate the model
just eval
# 5. Generate a comprehensive report
just report
# 6. Start MLflow UI to explore results
just mlflow-ui
# Visit http://localhost:5000 to see experiments# Train different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm
# Override hyperparameters
just train model.hyperparameters.contamination=0.05
# Run hyperparameter optimization
just tune
# Generate reports for specific runs
just report --run-id <mlflow-run-id>
# Run in container
just podman-build
just podman-runThe project uses Hydra for hierarchical configuration management with Pydantic validation. This provides type-safe, composable configurations that can be easily overridden.
config/
βββ config.yaml # Main configuration with defaults
βββ dataset/
β βββ credit-card-fraud.yaml # Credit card fraud dataset
β βββ synthetic.yaml # Synthetic dataset for testing
βββ model/
β βββ isolation_forest.yaml # IsolationForest parameters
β βββ lof.yaml # LocalOutlierFactor parameters
β βββ one_class_svm.yaml # OneClassSVM parameters
β βββ supervised.yaml # Supervised baseline
βββ tuning/
β βββ optuna.yaml # Hyperparameter optimization
βββ tracking/
βββ mlflow.yaml # MLflow configuration
# Use different algorithms
just train model=isolation_forest
just train model=lof
just train model=one_class_svm
just train model=supervised# Override contamination rate
just train model.hyperparameters.contamination=0.05
# Multiple parameter overrides
just train model.hyperparameters.n_estimators=200 model.hyperparameters.max_samples=0.8
# Override dataset splitting
just train dataset.train_split=0.8 dataset.val_split=0.1 dataset.test_split=0.1# Change optimization settings
just tune tuning.n_trials=50 tuning.timeout=1800
# Override tracking configuration
just train tracking.experiment_name="fraud_detection_v2"
# Combine multiple overrides
just train model=lof dataset=synthetic tuning.n_trials=20All configurations are validated using Pydantic schemas:
- Type checking: Ensures correct data types
- Value validation: Checks ranges and constraints
- Required fields: Validates mandatory parameters
- Default values: Provides sensible defaults
βββ src/ # Main source code
β βββ config/ # Configuration management
β βββ data/ # Data access and management
β βββ features/ # Feature engineering
β βββ models/ # Model implementations
β βββ pipelines/ # Training and inference pipelines
β βββ tuning/ # Hyperparameter optimization
β βββ tracking/ # MLflow experiment tracking
βββ config/ # Hydra configuration files
βββ data/ # Data directory (DVC-managed)
βββ models/ # Trained models and artifacts
βββ reports/ # Generated reports
βββ tests/ # Test suite
βββ notebooks/ # Jupyter notebooks for exploration
Run just --list or just help to see all available commands:
bootstrap- Complete project setup (recommended for first-time setup)install- Install core dependenciesinstall-dev- Install development dependenciesinstall-all- Install all dependencies including optional extras
data-download- Download datasets (supports Kaggle integration)data-prep- Prepare and preprocess data using DVC pipelinedata-validate- Validate data quality and detect issues
train- Train models with configurable algorithms and parameterstune- Run hyperparameter optimization with Optunaeval- Evaluate trained models with comprehensive metricspredict- Make predictions on new data
mlflow-ui- Start MLflow UI for experiment visualizationreport- Generate Quarto reports with visualizationsreport-compare- Compare multiple experimentsreport-list- List available MLflow runs
lint- Run Ruff linter for code qualityfmt- Format code with Rufftypecheck- Run mypy for static type checkingtest- Run comprehensive test suitecheck- Run all quality checks (lint + typecheck + test)
podman-build- Build container images (development/production)podman-run- Run containers with proper volume mountspodman-serve- Start FastAPI serving containerpodman-start-stack- Start complete MLflow + API stack
clean- Clean temporary files and cachesinfo- Show environment informationhelp- Display detailed command help
- Best for: High-dimensional data, large datasets
- Strengths: Fast training, handles mixed data types well
- Use case: General-purpose anomaly detection
just train model=isolation_forest- Best for: Local density-based anomalies
- Strengths: Detects local outliers in varying density regions
- Use case: Spatial data, clustering-based anomalies
just train model=lof- Best for: High-dimensional data with clear boundaries
- Strengths: Robust to outliers, kernel-based flexibility
- Use case: Text data, high-dimensional feature spaces
just train model=one_class_svm- Best for: Gaussian-distributed data
- Strengths: Statistical foundation, interpretable
- Use case: Financial data, sensor readings
just train model=robust_covariance- Best for: When labeled anomalies are available
- Features: Class balancing, probability calibration, threshold optimization
- Use case: Fraud detection, quality control
just train model=supervised- Source: Kaggle MLG-ULB dataset
- Size: ~284K transactions, 30 features
- Anomaly rate: ~0.17% (highly imbalanced)
- Features: PCA-transformed financial transaction data
# Use default dataset
just train dataset=credit-card-fraud- Source: Generated using sklearn
- Purpose: Testing and development
- Configurable: Size, features, anomaly rate
# Use synthetic data for testing
just train dataset=synthetic- Create configuration file in
config/dataset/ - Implement data loading in
src/data/ - Update DVC pipeline if needed
The project enforces strict quality standards:
# Run comprehensive quality checks
just check
# Individual quality checks
just lint # Ruff linting (replaces flake8, black, isort)
just fmt # Code formatting with Ruff
just typecheck # mypy static type checking
just test # pytest test suite with coverage# Run all tests
just test
# Run with coverage report
just test-cov
# Run specific test categories
just test-unit # Unit tests (fast)
just test-integration # Integration tests (slower)
just test-smoke # Smoke tests (quick validation)
# Performance testing
just benchmark # Run performance benchmarksSet up automated quality checks:
# Install pre-commit hooks
just setup-pre-commit
# Run on all files
just run-pre-commit# Build development container
just podman-build development
# Run with full development environment
just podman-run
# Open interactive shell
just podman-shell
# Build production container
just podman-build-prodEvery training run automatically logs:
- Parameters: All model and dataset configuration
- Metrics: ROC-AUC, PR-AUC, precision, recall, F1-score
- Artifacts: Trained models, preprocessing pipelines, plots
- Environment: Python version, dependencies, git commit
# Start MLflow UI
just mlflow-ui
# Access at http://localhost:5000Features:
- Experiment comparison: Side-by-side run comparison
- Model registry: Version management and staging
- Artifact browser: Download models and plots
- Metric visualization: Interactive charts and trends
# Create named experiments
just train tracking.experiment_name="fraud_detection_v1"
# Add custom tags
just train tracking.tags.version="1.0" tracking.tags.dataset="production"
# Set run names
just train tracking.run_name="isolation_forest_baseline"Automated hyperparameter tuning with Bayesian optimization:
# Run optimization with default settings
just tune
# Customize optimization
just tune tuning.n_trials=100 tuning.timeout=3600
# Multi-objective optimization
just tune tuning.enable_multi_objective=true- Bayesian optimization: Efficient parameter space exploration
- Pruning: Early stopping of unpromising trials
- MLflow integration: All trials logged automatically
- Parallel execution: Multi-process optimization support
# Analyze optimization results
just tune-analyze
# Compare different studies
just tune-compare
# Generate optimization report
just report --optimization-study <study-name>Generate comprehensive reports with interactive visualizations:
# Generate report for latest run
just report
# Generate for specific MLflow run
just report --run-id <run-id>
# Compare multiple experiments
just report-compare --experiment-name "fraud_detection_v1"- Executive Summary: Key metrics and model performance
- Data Analysis: Dataset characteristics and quality metrics
- Model Performance: ROC curves, precision-recall curves, confusion matrices
- Feature Analysis: Feature importance and distribution analysis
- Model Comparison: Side-by-side algorithm comparison
- Recommendations: Actionable insights and next steps
# HTML report (default)
just report --format html
# PDF report
just report --format pdf
# Serve reports locally
just report-serve # Access at http://localhost:8080# Build and run development environment
just podman-build development
just podman-run
# Features:
# - Full development tools (pytest, mypy, ruff)
# - Jupyter notebook support
# - Interactive debugging
# - Volume mounts for live code editing# Build optimized production image
just podman-build-prod
# Run production API server
just podman-serve
# Features:
# - Minimal image size
# - FastAPI serving endpoint
# - Health checks and monitoring
# - Non-root security# Start complete stack (MLflow + API)
just podman-start-stack
# Services:
# - MLflow UI: http://localhost:5000
# - API server: http://localhost:8000
# - API docs: http://localhost:8000/docs
# Stop everything
just podman-stop-stack# Error: Kaggle credentials not found
export KAGGLE_USERNAME="your_username"
export KAGGLE_KEY="your_api_key"
# Or create ~/.kaggle/kaggle.json
mkdir -p ~/.kaggle
echo '{"username":"your_username","key":"your_key"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json# Reduce dataset size for testing
just train dataset.train_split=0.1
# Use synthetic dataset
just train dataset=synthetic
# Adjust model parameters
just train model.hyperparameters.max_samples=0.1# Fix SELinux labels (if using SELinux)
sudo chcon -Rt svirt_sandbox_file_t data/ models/
# Or use :Z mount option (already included in Justfile)# Check MLflow server status
just mlflow-ui
# Reset MLflow tracking
rm -rf mlruns/
mkdir mlruns# Enable debug logging
export LOG_LEVEL=DEBUG
# Run with verbose output
just train --verbose
# Check system information
just info# Run performance benchmark
just benchmark
# Profile specific components
just benchmark --component training
just benchmark --component preprocessing- Check logs: Look in
logs/directory for detailed error messages - Run diagnostics: Use
just infoto check environment - Validate configuration: Use
just data-validateto check data quality - Test installation: Run
just test-smokefor quick validation
Add new anomaly detection algorithms:
- Implement BaseModelProtocol:
from src.models.base import BaseModelProtocol
class CustomAnomalyDetector:
def fit(self, X, y=None):
# Implementation
return self
def predict(self, X):
# Return binary predictions
pass
def decision_function(self, X):
# Return anomaly scores
pass- Register in ModelFactory:
# src/models/factory.py
factory.register("custom_model", CustomAnomalyDetector)- Create configuration:
# config/model/custom_model.yaml
type: "custom_model"
hyperparameters:
param1: value1
param2: value2Add new datasets:
- Create dataset configuration:
# config/dataset/my_dataset.yaml
name: "my_dataset"
download_url: "path/to/data"
target_col: "label"
train_split: 0.7
val_split: 0.15
test_split: 0.15- Implement data loader (if needed):
# src/data/loaders.py
def load_my_dataset(config):
# Custom loading logic
return dataframeExtend the preprocessing pipeline:
# Custom feature engineering
from src.features.engineering import FeatureEngineer
class CustomFeatureEngineer(FeatureEngineer):
def create_pipeline(self):
# Custom preprocessing steps
passβββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Data Layer β β Config Layer β β Model Layer β
β β β β β β
β β’ DVC Pipeline β β β’ Hydra Configs β β β’ BaseProtocol β
β β’ Data Access β β β’ Pydantic β β β’ Model Factory β
β β’ Validation β β β’ Validation β β β’ Algorithms β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Training Layer β β Tracking Layer β β Serving Layer β
β β β β β β
β β’ Orchestrator β β β’ MLflow β β β’ FastAPI β
β β’ Pipelines β β β’ Experiments β β β’ Health Checks β
β β’ Evaluation β β β’ Artifacts β β β’ Monitoring β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- Separation of Concerns: Clear module boundaries and responsibilities
- Configuration-Driven: All behavior controlled via Hydra configurations
- Type Safety: Comprehensive Pydantic schemas and mypy validation
- Testability: Dependency injection and protocol-based interfaces
- Reproducibility: Deterministic pipelines with version control
- Hydra Documentation
- MLflow Documentation
- Optuna Documentation
- Pydantic Documentation
- uv Documentation
notebooks/01_data_exploration.ipynb- Dataset analysis and visualizationnotebooks/02_model_comparison.ipynb- Algorithm comparison and selectionnotebooks/03_hyperparameter_analysis.ipynb- Optimization results analysis
- Fork the repository
- Run
just bootstrapto set up environment - Create feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Run quality checks:
just check - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
- Type Hints: All functions must have type annotations
- Documentation: Docstrings for all public functions and classes
- Testing: Minimum 80% test coverage for new code
- Quality: All code must pass
just check(lint + typecheck + test)
Follow Conventional Commits:
feat:New featuresfix:Bug fixesdocs:Documentation changestest:Test additions or modificationsrefactor:Code refactoring
MIT License - see LICENSE file for details.
- Documentation: Check this README and docs/ directory
- Issues: Search existing issues or create a new one
- Discussions: Use GitHub Discussions for questions
- Examples: Check notebooks/ for usage examples
When reporting issues, please include:
- Python version and operating system
- Complete error message and stack trace
- Steps to reproduce the issue
- Configuration files (if relevant)
- Output of
just info
For feature requests, please:
- Check existing issues and discussions
- Describe the use case and expected behavior
- Consider contributing the feature yourself
Built with β€οΈ for the ML Engineering Community
This project demonstrates production-ready ML engineering practices and serves as a comprehensive example for building scalable, maintainable machine learning systems.