Skip to content

Latest commit

 

History

History
641 lines (527 loc) · 21.9 KB

File metadata and controls

641 lines (527 loc) · 21.9 KB

PKBoost Complete Feature List

Core Algorithm Features

1. Shannon Entropy Guidance

  • MI Weight (Mutual Information): Guides splits toward informative features
  • Entropy Calculation: Measures information gain for minority classes
  • Adaptive Weighting: Auto-adjusts based on imbalance severity
  • Implementation: OptimizedShannonLoss in src/loss.rs

2. Gradient Boosting

  • Newton-Raphson Method: Second-order optimization
  • Gradient & Hessian: Computed per sample for accurate updates
  • Learning Rate: Adaptive shrinkage (0.045-0.05)
  • Additive Trees: Sequential ensemble building

3. Histogram-Based Trees

  • Quantile Binning: Up to 32 bins per feature
  • Fast Splitting: O(bins × features) instead of O(samples × features)
  • Missing Value Handling: Median imputation
  • Implementation: OptimizedHistogramBuilder in src/histogram_builder.rs

Auto-Tuning Features

4. Automatic Hyperparameter Selection

  • Dataset Profiling: Analyzes size, imbalance, complexity
  • Principled Tuning: auto_tune_principled() in src/auto_tuner.rs
  • Parameters Auto-Set:
    • Learning rate (0.045-0.05)
    • Max depth (4-6)
    • Min child weight (1.5-20.0)
    • Regularization (lambda, gamma)
    • MI weight (0.1-0.3)
    • Early stopping rounds (45-92)

5. Imbalance Detection

  • Automatic Class Weighting: scale_pos_weight computed from data
  • Severity Classification: High (< 10%), Moderate (10-20%), Balanced (> 20%)
  • Adaptive Parameters: Depth, learning rate, min_child_weight adjust to imbalance

Regularization Features

6. L2 Regularization

  • Reg Lambda: Penalizes large leaf weights (0.2-0.45)
  • Prevents Overfitting: Especially on minority classes

7. Complexity Penalty

  • Gamma: Minimum gain required for split (0.1)
  • Tree Pruning: Removes low-gain splits

8. Stochastic Features

  • Row Subsampling: 80% of samples per tree
  • Column Subsampling: 70-80% of features per tree
  • Reduces Variance: Improves generalization

Performance Features

9. Parallel Processing

  • Rayon-Based: Multi-threaded tree building
  • Adaptive Thresholds: Auto-detects CPU cores and RAM
  • Smart Parallelism: Only parallelizes when beneficial (n_features > 20 or n_samples > 5000)
  • Implementation: src/adaptive_parallel.rs

10. Optimized Data Structures

  • Transposed Data: Column-major for cache efficiency
  • Cached Histograms: Reused across splits
  • Loop Unrolling: 4x unroll in histogram building
  • SIMD-Ready: Prepared for vectorization

11. Early Stopping

  • Validation-Based: Monitors PR-AUC on validation set
  • Smoothed Metrics: 3-iteration moving average
  • Patience: Stops after N rounds without improvement
  • Saves Time: Avoids unnecessary iterations

Multi-Class Features

12. One-vs-Rest (OvR) Strategy

  • Binary Decomposition: N binary classifiers for N classes
  • Parallel Training: All classifiers train simultaneously
  • Softmax Normalization: Calibrated probability outputs
  • Implementation: src/multiclass.rs

13. Per-Class Auto-Tuning

  • Independent Tuning: Each binary task auto-configures
  • Local Imbalance: Adapts to class-specific ratios
  • Optimal Parameters: Different depth/LR per class

Drift Resilience Features

14. Conservative Architecture

  • Shallow Trees: Depth 4-5 (vs 6+ in XGBoost)
  • High Regularization: Lambda 0.4-0.45
  • Prevents Overfitting: To training distribution

15. Quantile-Based Binning

  • Distribution-Adaptive: Bins adjust to feature scales
  • Robust to Shifts: Handles covariate drift
  • Median Imputation: Stable missing value handling

16. Feature Stability

  • Shannon Guidance: Prioritizes informative, stable features
  • Noise Resistance: MI weight filters out noisy features

Hierarchical Adaptive Boosting (HAB)

17. Partition-Based Ensemble

  • K-Means Clustering: Divides data into partitions
  • Specialist Models: One PKBoost per partition
  • SimSIMD Integration: SIMD-accelerated distance calculations
  • Implementation: src/partitioned_classifier.rs

18. Drift Detection

  • Per-Partition Tracking: Error monitoring via EMA
  • Automatic Detection: 30% error increase threshold
  • Selective Metamorphosis: Retrain only drifted partitions

19. Fast Adaptation

  • 165x Faster: Than full retraining
  • Incremental Updates: Only affected partitions
  • Production-Ready: Real-time drift response

Evaluation Features

20. Built-In Metrics

  • PR-AUC: Precision-Recall Area Under Curve
  • ROC-AUC: Receiver Operating Characteristic
  • F1 Score: Harmonic mean of precision/recall
  • Macro-F1: Unweighted average across classes
  • Weighted-F1: Sample-weighted average
  • Implementation: src/metrics.rs

21. Threshold Optimization

  • Optimal Cutoff: Finds best classification threshold
  • F1 Maximization: Balances precision and recall

Regression Features

22. PKBoostRegressor

  • Continuous Targets: Supports regression tasks
  • Huber Loss: Robust to outliers
  • MSE/MAE: Standard regression losses
  • Implementation: src/regression.rs

23. Outlier Detection

  • MAD-Based: Median Absolute Deviation
  • Automatic Flagging: Identifies anomalous samples

24. Huber Loss

  • Robust Regression: Combines MSE and MAE
  • Outlier Resistance: Less sensitive than squared loss
  • Tunable Delta: Adjustable transition point
  • Implementation: src/huber_loss.rs

25. Living Regressor

  • Adaptive Regression: Continuous learning under drift
  • Gradient Clipping: Prevents gradient explosion
  • Heteroscedasticity Detection: Variance change monitoring
  • Residual Autocorrelation: Temporal drift detection
  • System State Tracking: Model health monitoring
  • Implementation: src/living_regressor.rs

26. Constants Module

  • Centralized Configuration: Magic numbers in one place
  • Maintainability: Easy parameter tuning
  • Implementation: src/constants.rs

27. Memory-Efficient Mode

  • Batched Processing: Handles large datasets
  • Streaming Predictions: predict_proba_batch()
  • Configurable Batch Size: Memory vs speed trade-off

Advanced Features

28. Feature Metabolism

  • Usage Tracking: Monitors feature importance
  • Dead Feature Detection: Identifies unused features
  • Tree Pruning: Removes trees dependent on dead features
  • Implementation: src/metabolism.rs

29. Adversarial Ensemble

  • Vulnerability Scoring: Detects model weaknesses
  • Adaptive Retraining: Triggers metamorphosis
  • Implementation: src/adversarial.rs

30. Living Booster

  • Dynamic Adaptation: Continuous learning
  • System State Monitoring: Tracks model health
  • Automatic Recovery: Self-healing mechanisms
  • Implementation: src/living_booster.rs

Data Handling Features

31. CSV Support

  • Header Detection: Automatic column parsing
  • Missing Values: NaN handling with median imputation
  • Type Inference: Automatic feature/label separation

32. Flexible Input

  • Vec<Vec>: Standard Rust vectors
  • No External Dependencies: Pure Rust implementation
  • Memory Efficient: Streaming-friendly design

Builder Pattern Features

33. Fluent API

  • PKBoostBuilder: Chainable configuration
  • Optional Parameters: Sensible defaults
  • Type Safety: Compile-time validation

34. Auto Mode

  • Zero Configuration: PKBoost::auto(&x, &y)
  • One-Line Training: Minimal code required
  • Production Ready: No tuning needed

Validation Features

35. Train/Val/Test Split

  • Optional Validation: Early stopping support
  • Stratified Sampling: Maintains class balance
  • Cross-Validation Ready: Easy integration

36. Verbose Logging

  • Training Progress: Iteration-by-iteration metrics
  • Diagnostic Info: Gradient norms, prediction ranges
  • Debug Mode: Detailed internal state

Export/Import Features

37. Model Serialization

  • Rust Structs: Native serialization support
  • Lightweight: Minimal storage overhead
  • Fast Loading: Quick model deployment

Comparison Features

38. Benchmark Suite

  • XGBoost Comparison: Side-by-side evaluation
  • LightGBM Comparison: Performance metrics
  • Multiple Datasets: Credit Card, Iris, Dry Bean, etc.
  • Implementation: src/bin/benchmark*.rs

39. Drift Testing

  • Synthetic Drift: Controlled noise injection
  • Real-World Drift: Temporal distribution shifts
  • Degradation Analysis: Performance tracking

Documentation Features

40. Comprehensive Docs

  • README.md: Quick start guide
  • BENCHMARKS.md: Detailed results
  • MULTICLASS.md: Multi-class usage
  • SHANNON_ANALYSIS.md: Entropy impact
  • DRYBEAN_DRIFT_RESULTS.md: Drift resilience

41. Code Examples

  • Binary Classification: Credit card fraud
  • Multi-Class: Iris, Dry Bean
  • Regression: Continuous targets
  • Drift Handling: Adaptive retraining

Python Integration

42. Python Bindings

  • PyO3 Support: Rust-Python bridge
  • NumPy Compatible: Array interface
  • Scikit-Learn Style: Familiar API
  • Implementation: src/python_bindings.rs

Advanced Drift Features

43. Advanced Drift Diagnostics

  • Error Entropy: Measures drift severity via Shannon entropy
  • Temporal Patterns: Residual autocorrelation detection
  • Variance Changes: Heteroscedasticity scoring
  • Drift Classification: Systemic vs Localized vs Feature Shift
  • Combined Drift Scoring: Multi-factor drift assessment
  • Implementation: DriftDiagnostics in src/living_regressor.rs

44. Adaptive Metamorphosis Strategies

  • Conservative: Minimal changes, high stability
  • DataAware: Adapts to concept drift patterns
  • FeatureAware: Responds to feature distribution shifts
  • Auto-Selection: Chooses strategy based on drift analysis
  • Implementation: MetamorphosisStrategy in src/living_booster.rs

45. Prediction Uncertainty

  • Ensemble Variance: Measures prediction confidence
  • Uncertainty Estimation: predict_with_uncertainty()
  • Confidence Intervals: Standard deviation of predictions
  • Implementation: PKBoostRegressor::predict_with_uncertainty()rd fraud
  • Multi-Class: Iris, Dry Bean
  • Regression: Continuous targets
  • Drift Handling: Adaptive retraining

Python Integration

42. Python Bindings

  • PyO3 Support: Rust-Python bridge
  • NumPy Compatible: Array interface
  • Scikit-Learn Style: Familiar API
  • Implementation: src/python_bindings.rs

Quick Feature Lookup

For Imbalanced Data:

// Automatic handling - zero configuration
let mut model = OptimizedPKBoostShannon::auto(&x_train, &y_train);
model.fit(&x_train, &y_train, None, true)?;
  • Auto Mode: Automatically detects imbalance and adjusts parameters
  • Shannon Entropy: MI weight prioritizes minority class (0.1-0.3)
  • Class Weighting: scale_pos_weight computed from data
  • Conservative Depth: Prevents overfitting to majority class

For Drift:

// Real-time adaptation with HAB
let mut hab = PartitionedClassifier::new(PartitionConfig::default());
hab.fit(&x_train, &y_train, None, true)?;

// Detect and adapt
let drifted = hab.observe_batch(&new_data, &new_labels);
if !drifted.is_empty() {
    hab.metamorph_partitions(&drifted, &buffer_x, &buffer_y, true)?;
}
  • HAB: 165x faster retraining than full model
  • EMA Tracking: Automatic drift detection (30% error threshold)
  • Selective Retraining: Only affected partitions updated
  • Conservative Architecture: 2-17x more drift-resilient

For Speed:

// Optimized for large datasets
let probs = model.predict_proba_batch(&x_test, 1000)?; // Batch size 1000
  • Histogram Binning: O(bins × features) vs O(samples × features)
  • Adaptive Parallelization: Auto-detects hardware (cores, RAM)
  • SIMD-Ready: SimSIMD integration for distance calculations
  • Loop Unrolling: 4x unroll in histogram building

For Interpretability:

// Feature importance and pruning
let usage = model.get_feature_usage();
let pruned = model.prune_trees(&dead_features, 0.5);
  • Feature Usage: Tracks which features are used
  • Tree Pruning: Removes trees dependent on dead features
  • Feature Metabolism: Monitors feature importance over time
  • Verbose Logging: Detailed training diagnostics

For Multi-Class:

// One-vs-Rest with per-class tuning
let mut model = MultiClassPKBoost::new(n_classes);
model.fit(&x_train, &y_train, None, true)?;
let probs = model.predict_proba(&x_test)?; // [n_samples, n_classes]
  • OvR Strategy: N binary classifiers for N classes
  • Parallel Training: All classifiers train simultaneously
  • Per-Class Auto-Tuning: Each binary task optimized independently
  • Softmax Normalization: Calibrated probability outputs

For Regression:

// Robust regression with outlier handling
let mut model = PKBoostRegressor::auto(&x_train, &y_train);
model.fit(&x_train, &y_train, Some((&x_val, &y_val)), true)?;
  • Huber Loss: Robust to outliers
  • Living Regressor: Adaptive learning under drift
  • Gradient Clipping: Prevents gradient explosion
  • Heteroscedasticity Detection: Variance change monitoring

Summary Statistics

  • Total Features: 45
  • Core Modules: 20+
  • Binary Examples: 15+
  • Supported Tasks: Binary, Multi-Class, Regression
  • Datasets Tested: 10+
  • Lines of Code: ~5,000+
  • Performance: 10-40% better than XGBoost/LightGBM on imbalanced data
  • Drift Resilience: 2-17x better than competitors

Decision Guide: Which PKBoost to Use?

┌─────────────────────────────────────┐
│ What's your task?                   │
└─────────────┬───────────────────────┘
              │
         ┌────┴────┬─────────────┬──────────────┐
         │         │             │              │
    Binary    Multi-Class   Regression    Streaming
         │         │             │              │
         ▼         ▼             ▼              ▼
    ┌─────┐   ┌─────┐      ┌─────┐       ┌─────┐
    │Auto │   │Multi│      │Auto │       │Living│
    │Mode │   │Class│      │Reg  │       │Boost│
    └─────┘   └─────┘      └─────┘       └─────┘
         │         │             │              │
         └────┬────┴─────┬───────┴──────┬───────┘
              │          │              │
         Expect drift?   │         High drift?
              │          │              │
          Yes │     No   │          Yes │    No
              ▼          │              ▼       │
         ┌─────┐         │         ┌─────┐     │
         │ HAB │         │         │Living│     │
         └─────┘         │         │Reg   │     │
                         │         └─────┘     │
                         └─────────────────────┘
                                   │
                              Production
                                Ready!

Performance Benchmarks

Speed Comparison (Credit Card Dataset, 170K samples)

Model Training Time Prediction Time Memory
PKBoost 92.7s 0.15s 145 MB
XGBoost 12.0s 0.18s 210 MB
LightGBM 11.2s 0.12s 190 MB

Imbalanced Data Performance (0.17% positive class)

Model PR-AUC F1 Score Recall@90% Precision
PKBoost 0.878 0.874 0.812
XGBoost 0.745 0.798 0.612
LightGBM 0.793 0.713 0.645

Drift Resilience (Dry Bean, Drift=3.0)

Model Baseline Acc Drifted Acc Degradation
PKBoost 92.54% 92.14% -0.43%
XGBoost 92.25% 91.41% -0.91%
LightGBM 92.36% 91.85% -0.55%

HAB Adaptation Speed

  • Full Retraining: 92.7s
  • HAB Metamorphosis: 0.56s (165x faster)
  • Selective Partition Update: 0.08s per partition

Troubleshooting Guide

Problem: Model overfitting to training data

Solution: Increase regularization

let model = PKBoostBuilder::new()
    .reg_lambda(2.0)      // Default: 1.0
    .gamma(0.2)            // Default: 0.1
    .subsample(0.7)        // Default: 0.8
    .build_with_data(&x, &y);

Problem: Training too slow

Solution: Reduce complexity or use batching

let model = PKBoostBuilder::new()
    .n_estimators(500)     // Default: 1000
    .max_depth(4)          // Default: 6
    .histogram_bins(16)    // Default: 32
    .build_with_data(&x, &y);

// Or use batched prediction
let probs = model.predict_proba_batch(&x_test, 1000)?;

Problem: Poor performance on minority class

Solution: Use auto mode (already optimized) or increase MI weight

// Auto mode handles this automatically
let model = OptimizedPKBoostShannon::auto(&x, &y);

// Or manually tune
let model = PKBoostBuilder::new()
    .mi_weight(0.4)           // Higher entropy guidance
    .scale_pos_weight(10.0)   // Upweight minority class
    .build_with_data(&x, &y);

Problem: Gradients exploding in adaptive mode

Solution: Already handled! Gradient clipping is automatic

// Living regressor automatically clips gradients > 5000
// See living_regressor.rs - automatic protection
// Monitor via verbose logging if needed

Problem: Drift detected but metamorphosis fails

Solution: Check buffer size and validation set

// Ensure buffer has enough samples
let mut booster = AdversarialLivingBooster::new(&x_train, &y_train);

// Need at least 1000 samples in buffer for retraining
// Need at least 100 samples for validation
// Automatic rollback if metamorphosis degrades performance

API Quick Reference

Core Classes

// Static model
OptimizedPKBoostShannon::auto(&x, &y)
  .fit(x, y, eval_set, verbose) -> Result
  .predict_proba(x) -> Result<Vec<f64>, String>
  .predict_proba_batch(x, batch_size) -> Result<Vec<f64>, String>
  .prune_trees(features, threshold) -> usize
  .get_feature_usage() -> Vec<usize>

// Adaptive model (binary classification)
AdversarialLivingBooster::new(&x, &y)
  .fit_initial(x, y, eval_set, verbose) -> Result
  .observe_batch(x, y, verbose) -> Result
  .get_state() -> SystemState
  .get_metamorphosis_count() -> usize
  .get_vulnerability_score() -> f64

// Partitioned model (165x faster adaptation)
PartitionedClassifier::new(config)
  .partition_data(x, y, verbose)
  .train_specialists(x, y, verbose) -> Result
  .observe_batch(x, y) -> Vec<usize>  // Returns drifted partitions
  .metamorph_partitions(partition_ids, x, y, verbose) -> Result

// Regression
PKBoostRegressor::auto(&x, &y)
  .fit(x, y, eval_set, verbose) -> Result
  .predict(x) -> Result<Vec<f64>, String>
  .predict_with_uncertainty(x) -> Result<(Vec<f64>, Vec<f64>), String>

// Multi-class
MultiClassPKBoost::new(n_classes)
  .fit(x, y, eval_set, verbose) -> Result
  .predict_proba(x) -> Result<Vec<Vec<f64>>, String>
  .predict(x) -> Result<Vec<usize>, String>

// Builder pattern
PKBoostBuilder::new()
  .n_estimators(1000)
  .learning_rate(0.05)
  .max_depth(6)
  .mi_weight(0.3)
  .reg_lambda(1.0)
  .gamma(0.1)
  .subsample(0.8)
  .colsample_bytree(0.8)
  .build_with_data(&x, &y)

Enums

SystemState { Normal, Alert { checks_in_alert }, Metamorphosis }
MetamorphosisStrategy { Conservative, DataAware, FeatureAware }
DriftType { Systemic, Localized, FeatureShift }
RegressionLossType { MSE, Huber { delta } }

Feature Roadmap

Planned Features:

Model Introspection:

  • SHAP-like values for prediction explanation
  • Tree visualization (Graphviz export)
  • Feature importance plots
  • Partial dependence plots

Advanced Drift Detection:

  • Kolmogorov-Smirnov test for distribution shift
  • Population Stability Index (PSI)
  • Per-feature drift monitoring
  • Automatic drift severity classification

Ensemble Methods:

  • Bagging of PKBoost models
  • Stacking with PKBoost as base learner
  • Out-of-bag error estimation
  • Ensemble diversity metrics

Calibration:

  • Platt scaling
  • Isotonic regression
  • Reliability diagrams
  • Expected Calibration Error (ECE)

Error Handling:

  • Comprehensive error types (PKBoostError enum)
  • Input validation methods
  • Graceful degradation
  • Error recovery mechanisms

Testing:

  • Unit tests for all modules
  • Integration tests
  • Property-based testing
  • Benchmark regression tests

Serialization:

  • Serde support for JSON/bincode
  • Model versioning
  • Backward compatibility
  • Compression support

Python API Extensions:

  • Scikit-learn compatible get_params()/set_params()
  • feature_importances_ property
  • apply() method for leaf indices
  • GridSearchCV compatibility

Feature Comparison Matrix

Feature PKBoost XGBoost LightGBM
Shannon Entropy
Auto-Tuning
Imbalance Detection
Multi-Class OvR
Drift Detection
Adaptive Retraining
Histogram Binning
Parallel Training
Early Stopping
Built-in Metrics
Zero Configuration
Rust Native

PKBoost: The most feature-complete gradient boosting library for imbalanced data under drift.