RunIA-core is an open-source Python library for uncertainty estimation and Out-of-Distribution (OoD) detection in AI models. It provides comprehensive tools for evaluating and deploying uncertainty estimation methods across computer vision tasks (image classification, object detection, semantic segmentation) and natural language processing (LLM hallucination detection).
- Object Detection support: Beyond Image classification, RunIA-core includes modules for object detection architectures (Faster RCNN, YOLOv8, RT-DETR, Deformable DETR, OWLv2) and semantic segmentation (DeepLabv3+, U-Net)
- Latent Space Uncertainty Estimation: LaRED (Latent Representations Density) and LaREM (Latent Representations Mahalanobis) methods for OoD detection
- Multiple Baseline Methods: Support for 15+ baseline OoD detection methods (MSP, Energy, Mahalanobis, kNN, ViM, DDU, DICE, ReAct, and more)
- LLM Uncertainty: Hallucination detection with methods like semantic entropy, RAUQ, perplexity, and eigen scores
- Monte Carlo Dropout (MCD): Epistemic uncertainty estimation through MC sampling and entropy computation
- Feature Extraction: Image-level and object-level feature extraction for various architectures
- Flexible Inference: Production-ready inference modules for real-time OoD detection
- Comprehensive Evaluation: Built-in metrics (AUROC, AUPR, FPR@95) and visualization tools
- Installation
- Quick Start
- Usage Examples
- Supported Tasks and Architectures
- API Overview
- Hardware Requirements
- References
- Python 3.9 or higher
- CUDA-capable GPU (recommended for computer vision tasks and LLMs)
Using pip
# Clone the repository
git clone <repository-url>
cd runia_core
# Create a virtual environment (recommended)
python -m venv runia_env
source runia_env/bin/activate # On Windows: runia_env\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install .Using conda
# Create a conda environment
conda create -n runia_env python=3.9
conda activate runia_env
# Install dependencies and package
pip install -r requirements.txt
pip install .Using uv
See uv documentation for installation instructions. Dependencies are installed on first run. Therefore, you can run any script with:
# Directly run any script (dependencies are installed on first run)
uv run your_script.pyimport torch
from runia_core.evaluation import Hook, get_latent_representation_mcd_samples, get_dl_h_z
from runia_core.inference import LaRExInference, MCSamplerModule, LaREMPostprocessor
from runia_core import apply_pca_ds_split
# Setup model with dropout/dropblock layer
model = YourModel()
hooked_layer = Hook(model.dropout_layer)
model.eval()
# Extract MC samples and compute entropy
latent_samples = get_latent_representation_mcd_samples(
model, dataloader, n_samples=16, hooked_layer
)
_, entropy_samples = get_dl_h_z(latent_samples, mcd_samples_nro=16)
# Setup OoD detector
pca_train, pca_transform = apply_pca_ds_split(entropy_samples, nro_components=256)
detector = LaREMPostprocessor()
detector.setup(pca_train)
# Inference on new images
inference_module = LaRExInference(
dnn_model=model,
detector=detector,
mcd_sampler=MCSamplerModule,
pca_transform=pca_transform,
mcd_samples_nro=16,
layer_type="Conv"
)
prediction, confidence_score = inference_module.get_score(test_image, layer_hook=hooked_layer)from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from runia_core.llm_uncertainty import compute_uncertainties
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
gen_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)
# Define uncertainty methods
uncertainty_requests = [
{"method_name": "semantic_entropy"},
{"method_name": "perplexity"},
{"method_name": "eigen_score"},
{"method_name": "RAUQ", "token_aggregation": "original", "head_aggregation": "mean_heads"}
]
# Compute uncertainties
generated_text, scores = compute_uncertainties(
model, tokenizer, "Your prompt here",
uncertainty_requests, gen_config, num_samples=10
)| Task | Datasets (In-Dist) | Datasets (OoD) | Architectures |
|---|---|---|---|
| Image Classification | CIFAR10 | FMNIST, SVHN, Places365, Textures, iSUN, LSUN | ResNet-18, ResNet-18 + Spectral Norm |
| Object Detection | BDD100k, Pascal VOC | COCO, OpenImages | Faster RCNN, YOLOv8, RT-DETR, Deformable DETR, OWLv2 |
| Semantic Segmentation | Woodscape, Cityscapes | Woodscape-anomalies, Cityscapes-anomalies | DeepLabv3+, U-Net |
| Task | Datasets (In-Dist) | Datasets (OoD) | Architectures |
|---|---|---|---|
| Hallucination Detection | SQuADv2 | TriviaQA, Natural Questions, HotpotQA | Llama-3.1, DistilBERT-base |
Note: For epistemic uncertainty estimation in computer vision tasks, models should include dropout or DropBlock2D layers to enable Monte Carlo Dropout sampling. However, RunIA can be used with any architecture by hooking any latent layer and extracting features for OoD detection with LaRED/LaREM, without the need for MC sampling. In this case, the latent space methods will be applied on the extracted features instead of the entropy from MC samples.
OOD detection in Object Detection. Evaluation Pipeline
Evaluate OOD detection methods on In-Distribution (InD) vs Out-of-Distribution (OoD) datasets in Object Detection. The library is focused on latent space methods but can compute 10+ other methods (MSP, Energy, Mahalanobis, kNN, ViM, DDU, DICE, ReAct, etc.):
import torch
from omegaconf import OmegaConf
from runia_core.feature_extraction import Hook, BoxFeaturesExtractor, get_aggregated_data_dict, associate_precalculated_baselines_with_raw_predictions
from runia_core.evaluation import log_evaluate_larex, calculate_all_baselines, remove_latent_features
from runia_core.inference.abstract_classes import get_baselines_thresholds
# Setup
BASELINES_NAMES = ["msp", "gen", "energy", "mdist", "knn", "ddu"]
LATENT_SPACE_POSTPROCESSORS = ["MD"]
cfg = OmegaConf.create({"ood_datasets": ['ood_dataset_name']})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load model and hook dropout/dropblock layer
model = YourModel.load_from_checkpoint("model.pt")
hooked_layers = [Hook(model.my_latent_layer)]
model.to(device).eval()
# Instantiate BoxFeaturesExtractor for object detection architectures (e.g., Faster RCNN, YOLOv8, RT-DETR, Deformable DETR, OWLv2)
samples_extractor = BoxFeaturesExtractor(
model=model,
hooked_layers=hooked_layers,
device=device,
roi_output_sizes=[16],
roi_sampling_ratio=-1,
return_raw_predictions=False,
return_stds=False,
hook_layer_output=True,
architecture="rcnn"
)
# Extract latent samples and features for InD and OoD datasets
ind_data_dict = {
"train": samples_extractor.get_ls_samples(ind_train_data_loader, predict_conf=0.5),
"valid": samples_extractor.get_ls_samples(ind_val_data_loader, predict_conf=0.5)
}
aggregated_ind_data_dict = dict()
# Track images with no objects found from varying the confidence of predictions
ind_no_obj = dict()
non_empty_preds_ind_im_ids = dict()
ood_data_dict = {"ood_dataset_name": samples_extractor.get_ls_samples(ood_data_loader, predict_conf=0.5)}
aggregated_ood_data_dict = dict()
non_empty_preds_ood_im_ids = dict()
ood_no_obj = dict()
# Preprocess ID datasets and aggregate data in the format required for evaluation (one entry per image with aggregated features from all predicted boxes)
for split in ind_data_dict:
aggregated_ind_data_dict, ind_no_obj, non_empty_preds_ind_im_ids = get_aggregated_data_dict(
data_dict=ind_data_dict,
dataset_name=split,
aggregated_data_dict=aggregated_ind_data_dict,
no_obj_dict=ind_no_obj,
non_empty_predictions_ids=non_empty_preds_ind_im_ids,
probs_as_logits=False
)
# Preprocess OOD datasets and aggregate data in the format required for evaluation
for ood_dataset_name in cfg.ood_datasets:
aggregated_ood_data_dict, ood_no_obj, non_empty_preds_ood_im_ids = get_aggregated_data_dict(
data_dict=ood_data_dict,
dataset_name=ood_dataset_name,
aggregated_data_dict=aggregated_ood_data_dict,
no_obj_dict=ood_no_obj,
non_empty_predictions_ids=non_empty_preds_ood_im_ids,
probs_as_logits=False
)
aggregated_ind_data_dict, aggregated_ood_data_dict, ood_baselines_scores_dict = calculate_all_baselines(
baselines_names=BASELINES_NAMES,
ind_data_dict=aggregated_ind_data_dict,
ood_data_dict=aggregated_ood_data_dict,
fc_params=None,
cfg=cfg,
num_classes=10 if cfg.ind_dataset == "bdd" else 20
)
aggregated_ind_data_dict, aggregated_ood_data_dict = remove_latent_features(
id_data=aggregated_ind_data_dict,
ood_data=aggregated_ood_data_dict,
ood_names=cfg.ood_datasets
)
baselines_thresholds = get_baselines_thresholds(
baselines_names=BASELINES_NAMES,
baselines_scores_dict=aggregated_ind_data_dict,
z_score_percentile=cfg.z_score_thresholds
)
# Associate calculated baselines scores with raw predictions dicts
# OOD
for ood_dataset_name in cfg.ood_datasets:
ood_data_dict[ood_dataset_name] = associate_precalculated_baselines_with_raw_predictions(
data_dict=ood_data_dict[ood_dataset_name],
dataset_name=ood_dataset_name,
ood_baselines_dict=ood_baselines_scores_dict,
baselines_names=BASELINES_NAMES,
non_empty_ids=non_empty_preds_ood_im_ids[ood_dataset_name],
is_ood=True
)
# InD
ind_data_dict["valid"] = associate_precalculated_baselines_with_raw_predictions(
data_dict=ind_data_dict["valid"],
dataset_name="valid",
ood_baselines_dict=aggregated_ind_data_dict,
baselines_names=BASELINES_NAMES,
non_empty_ids=non_empty_preds_ind_im_ids["valid"],
is_ood=False
)
metrics_df, best_postprocessors_dict, postprocessor_thresholds, aggregated_ood_data_dict = log_evaluate_larex(
cfg=cfg,
baselines_names=BASELINES_NAMES,
ind_data_dict=aggregated_ind_data_dict,
ood_data_dict=aggregated_ood_data_dict,
ood_baselines_scores=ood_baselines_scores_dict,
mlflow_run_name="my_run_name",
mlflow_logging=False,
visualize_score=LATENT_SPACE_POSTPROCESSORS[0],
postprocessors=LATENT_SPACE_POSTPROCESSORS,
)
print(metrics_df)OOD detection in Object Detection. Inference pipeline
Deploy OoD detection in production with the inference module, using the best postprocessor from evaluation (e.g., LaREM or LaRED) or any other method from the evaluation pipeline by setting the appropriate confidence threshold for predictions to be considered in inference (can be tuned based on evaluation results):
import torch
from runia_core.feature_extraction import get_aggregated_data_dict
from runia_core.inference import postprocessors_dict, ObjectLevelInference, postprocessor_input_dict
METHOD = "energy" # or "MD" for LaREM, "KDE" for LaRED, or any other method from the evaluation pipeline
LATENT_SPACE_METHOD = False # Set to True if using latent space postprocessors (LaREM or LaRED), False for other methods
INFERENCE_THRESHOLD = 0.5 # Set the confidence threshold for predictions to be considered in inference (can be tuned based on evaluation results)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load pre-calculated latent space activations, features or logits for InD datasets
# Extracted using the BoxFeaturesExtractor or any other method and saved in the required format for evaluation and inference
# InD
ind_data_splits = ["train", "valid"]
ind_data_dict = dict()
aggregated_ind_data_dict = dict()
non_empty_preds_ind_im_ids = dict()
# Track images with no objects found from varying the confidence of predictions
ind_no_obj = dict()
for split in ind_data_splits:
ind_file_name = f"my/file/name_{split}.pt"
# Load InD latent space activations
ind_data_dict[f"{split}"] = torch.load(ind_file_name, map_location=device)
aggregated_ind_data_dict, ind_no_obj, non_empty_preds_ind_im_ids = get_aggregated_data_dict(
data_dict=ind_data_dict,
dataset_name=split,
aggregated_data_dict=aggregated_ind_data_dict,
no_obj_dict=ind_no_obj,
non_empty_predictions_ids=non_empty_preds_ind_im_ids,
probs_as_logits=False
)
postprocessor = postprocessors_dict[METHOD](flip_sign=False)
postprocessor.setup(ind_train_data=aggregated_ind_data_dict["valid logits"])
# Load model
model = YourModel.load_from_checkpoint("model.pt")
hooked_layers = [] # Specify the layers to hook for feature extraction if using latent space methods (e.g., LaREM or LaRED), otherwise can be left empty for other methods that do not require feature extraction
inference_module = ObjectLevelInference(
model=model,
postprocessor=postprocessor,
architecture="RTDETR", # or "rcnn", "yolo", "deformable_detr", "owlv2"
latent_space_method=LATENT_SPACE_METHOD,
postprocessor_input=postprocessor_input_dict[METHOD] if not LATENT_SPACE_METHOD else ["latent_space_means"],
hooked_layers=hooked_layers,
roi_output_sizes=[16],
)
with torch.no_grad():
# Perform inference on new images
for idx, input_im in enumerate(my_data_loader):
predictions, scores = inference_module.get_score(input_im, predict_conf=INFERENCE_THRESHOLD)LLM uncertainty estimation for hallucination detection
Detect hallucinations and measure uncertainty in LLM outputs:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from runia_core.llm_uncertainty import compute_uncertainties
# Load model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
gen_config = GenerationConfig(
max_new_tokens=50,
do_sample=True,
top_p=0.9,
temperature=1.0
)
# Define uncertainty methods
uncertainty_methods = [
{"method_name": "semantic_entropy"}, # Semantic uncertainty
{"method_name": "eigen_score"}, # Eigenvalue-based score
{"method_name": "perplexity"}, # Model perplexity
{"method_name": "normalized_entropy"}, # Normalized entropy
{"method_name": "generation_entropy"}, # Generation-level entropy
{
"method_name": "RAUQ", # Attention-based uncertainty
"token_aggregation": "original",
"head_aggregation": "mean_heads",
"alphas": [0.2, 0.4, 0.6],
"ablation": True
}
]
# Compute uncertainties
text, scores = compute_uncertainties(
model,
tokenizer,
prompt="What is the capital of France?",
uncertainty_requests=uncertainty_methods,
gen_config=gen_config,
num_samples=10
)
print(f"Generated: {text}")
print(f"Uncertainty Scores: {scores}")| Module | Description |
|---|---|
runia.evaluation |
MC sampling, entropy computation, baselines, OOD evaluation metrics |
runia.inference |
Production-ready inference with LaRED/LaREM postprocessors and other baselines |
runia.feature_extraction |
Image-level and object-level feature extraction |
runia.llm_uncertainty |
LLM uncertainty and hallucination detection methods |
runia.dimensionality_reduction |
PCA and other dimensionality reduction utilities |
Evaluation:
Hook: Capture layer outputs during forward passBoxFeaturesExtractor: Extract latent samples and features for object detection architecturesFastMCDSamplesExtractor: Efficiently extract latent samples for image classification architectureslog_evaluate_larex(): Evaluate LaREx and baselines with metrics
Inference:
postprocessors_dict: Dictionary of available postprocessors for inferenceObjectLevelInference: Inference module for object detection architecturesLaRExInference: Main inference module for OoD detection using latent space methodsLaREMPostprocessor: Mahalanobis distance-based detector (recommended) for latent space postprocessing
LLM Uncertainty:
compute_uncertainties(): Compute multiple uncertainty scores for LLM outputs
- CPU: Supported but slow for computer vision tasks
- GPU: Required for efficient inference on object detection and segmentation
- Memory: Varies by model size (8GB+ GPU memory recommended)
Contributions are welcome! Please feel free to submit issues or pull requests by following the Contribution Guidelines.
See LICENSE.txt for details.
- Fabio Arnez - fabio.arnez@cea.fr
- Daniel Montoya - daniel-alfonso.montoyavasquez@cea.fr
-
Latent representation entropy density for distribution shift detection
-
The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns
