Skip to content

Conversation

@mfdel
Copy link

@mfdel mfdel commented Jan 14, 2026

Description

This PR adds a new method Booster.compute_leaf_similarity() to compute similarity between observations based on leaf node co-occurrence across trees, similar to Random Forest proximity matrices.

Closes #11919

API

similarity = booster.compute_leaf_similarity(data, reference, weight_type="gain")

Parameters:

  • data (DMatrix): Query dataset (m samples)
  • reference (DMatrix): Reference dataset (n samples)
  • weight_type (str): "gain" (default) or "cover"

Returns: ndarray of shape (m, n) with values in [0, 1]

Formula

$$S_w(a, b) = \frac{\sum_{t=1}^{T} w_t \cdot \mathbf{1}[\phi_t(a) = \phi_t(b)]}{\sum_{t=1}^{T} w_t}$$

Where:

  • $\phi_t(x)$ = leaf index for sample $x$ in tree $t$
  • $w_t$ = weight of tree $t$

Weight Types

Following the suggestion in #11919 to reuse feature importance definitions:

  • "gain" (default): Sum of loss reduction across all splits in the tree. Trees that contribute more to model improvement are weighted higher.
  • "cover": Sum of hessian values across all splits. For regression (hessian=1), this equals sample count. For classification (hessian=p(1-p)), this emphasizes trees that process more uncertain samples.

Implementation Notes

  • Pure Python using pred_leaf=True and trees_to_dataframe()
  • Row-by-row computation for memory efficiency with large datasets
  • Properties: self-similarity = 1.0, symmetric when data==reference, range [0,1]

Tests

Added tests/python/test_leaf_similarity.py with tests for:

  • Shape and range validation
  • Self-similarity (diagonal = 1.0)
  • Both weight types work correctly
  • Default is gain
  • Invalid weight_type raises ValueError

Compute similarity between observations based on leaf node co-occurrence
across trees. Similar to Random Forest proximity matrices.

- Two weight types: 'gain' (default) and 'cover'
- Returns similarity matrix with values in [0, 1]
- Self-similarity is 1.0

Closes dmlc#11919
@trivialfis
Copy link
Member

Thank you for opening the PR, I will see if I can run some experiments with it and avoid the use of the tree to dataframe (that method itself is quite difficult to extend).

@trivialfis
Copy link
Member

trivialfis commented Jan 19, 2026

note to myself:
todos:

  • Test multi-class/target.
  • Gradient boosting random forest.
  • Non-DMatrix inputs.
  • strict_shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Find similar observations using leaf node matching

2 participants