Add Booster.compute_leaf_similarity() method #11926

mfdel · 2026-01-14T20:22:55Z

Description

This PR adds a new method Booster.compute_leaf_similarity() to compute similarity between observations based on leaf node co-occurrence across trees, similar to Random Forest proximity matrices.

Closes #11919

API

similarity = booster.compute_leaf_similarity(data, reference, weight_type="gain")

Parameters:

data (DMatrix): Query dataset (m samples)
reference (DMatrix): Reference dataset (n samples)
weight_type (str): "gain" (default) or "cover"

Returns: ndarray of shape (m, n) with values in [0, 1]

Formula

$$S_w(a, b) = \frac{\sum_{t=1}^{T} w_t \cdot \mathbf{1}[\phi_t(a) = \phi_t(b)]}{\sum_{t=1}^{T} w_t}$$

Where:

$\phi_t(x)$ = leaf index for sample $x$ in tree $t$
$w_t$ = weight of tree $t$

Weight Types

Following the suggestion in #11919 to reuse feature importance definitions:

"gain" (default): Sum of loss reduction across all splits in the tree. Trees that contribute more to model improvement are weighted higher.
"cover": Sum of hessian values across all splits. For regression (hessian=1), this equals sample count. For classification (hessian=p(1-p)), this emphasizes trees that process more uncertain samples.

Implementation Notes

Pure Python using pred_leaf=True and trees_to_dataframe()
Row-by-row computation for memory efficiency with large datasets
Properties: self-similarity = 1.0, symmetric when data==reference, range [0,1]

Tests

Added tests/python/test_leaf_similarity.py with tests for:

Shape and range validation
Self-similarity (diagonal = 1.0)
Both weight types work correctly
Default is gain
Invalid weight_type raises ValueError

Compute similarity between observations based on leaf node co-occurrence across trees. Similar to Random Forest proximity matrices. - Two weight types: 'gain' (default) and 'cover' - Returns similarity matrix with values in [0, 1] - Self-similarity is 1.0 Closes dmlc#11919

trivialfis · 2026-01-16T08:52:44Z

Thank you for opening the PR, I will see if I can run some experiments with it and avoid the use of the tree to dataframe (that method itself is quite difficult to extend).

trivialfis · 2026-01-19T14:51:39Z

note to myself:
todos:

Test multi-class/target.
Gradient boosting random forest.
Non-DMatrix inputs.
strict_shape.

mfdel mentioned this pull request Jan 14, 2026

Find similar observations using leaf node matching #11919

Open

trivialfis mentioned this pull request Jan 22, 2026

[mt] Feature importance variants. #11950

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Booster.compute_leaf_similarity() method #11926

Add Booster.compute_leaf_similarity() method #11926

mfdel commented Jan 14, 2026

Uh oh!

trivialfis commented Jan 16, 2026

Uh oh!

trivialfis commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add Booster.compute_leaf_similarity() method #11926

Are you sure you want to change the base?

Add Booster.compute_leaf_similarity() method #11926

Conversation

mfdel commented Jan 14, 2026

Description

API

Formula

Weight Types

Implementation Notes

Tests

Uh oh!

trivialfis commented Jan 16, 2026

Uh oh!

trivialfis commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trivialfis commented Jan 19, 2026 •

edited

Loading