optimize CPU inference with Array-Based Tree Traversal #11519

razdoburdin · 2025-06-20T13:50:25Z

This PR introduces optimization for CPU inference. For each tree, the top N levels are transformed into a compact array-based layout. This allows for a branchless node indexing rule: idx = 2 * idx + int(val < split_cond). To minimize memory overhead, this transformation from the standard tree structure to the array layout is performed on-the-fly for each block of data being processed. Even with this additional calculations, improved data locality in the cache-friendly array layout leads to inference speed up to ~2x (x1.4 on average).

trivialfis · 2025-06-21T01:33:21Z

Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing.

src/predictor/array_tree_layout.h

Co-authored-by: Victoriya Fedotova <[email protected]>

…rdin/xgboost into dev/cpu/eytzinger_layout

Vika-F

Cosmetic changes.

The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.

src/predictor/array_tree_layout.h

src/predictor/cpu_predictor.cc

Co-authored-by: Victoriya Fedotova <[email protected]>

trivialfis

Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?

src/predictor/array_tree_layout.h

trivialfis · 2025-07-01T11:52:32Z

src/predictor/cpu_predictor.cc

+   * We use transforming trees to array layout for each block of data to avoid memory overheads.
+   * It makes the array layout inefficient for block_size == 1
+   */ 
+  const bool use_array_tree_layout = block_size > 1;


What happens if this is a small online inference call? The input size could be a few samples per call.

The default (the old one) realization will be used

src/predictor/cpu_predictor.cc

razdoburdin · 2025-07-21T12:29:58Z

Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?

I added some unit-tests.

trivialfis

I'm still trying to understand the code, in the meantime, let me do some refactoring in this and the next week to accommodate the new optimization. We need a better structure to handle all these:

Predict with scalar leaf.
Predict with vector leaf.
Array predict with scalar leaf.
Array predict with vector leaf.
Column split with scalar leaf.

I think I will split up the CPU predictor into multiple pieces.

src/predictor/cpu_predictor.cc

src/predictor/array_tree_layout.h

trivialfis · 2025-07-31T18:09:34Z

src/predictor/array_tree_layout.h

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Is there a benefit of doing this C++ overloading rather than the simpler tree.IsLeaf? How much faster are we seeing?

I did the overload to handle both RegTree and MultiTargetTree cases. Is there a better option?

Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

Co-authored-by: Jiaming Yuan <[email protected]>

trivialfis · 2025-08-05T19:17:25Z

I'm trying to cleanup the CPU predictor. I will update this PR once it is finished.

trivialfis · 2025-08-07T18:54:04Z

I need to fix a perf regression caused by the new ordinal encoder.

trivialfis · 2025-08-20T20:57:50Z

I need to fix a perf regression caused by the new ordinal encoder.

This has been fixed. I will look deeper into this PR.

src/predictor/array_tree_layout.h

trivialfis · 2025-08-20T21:14:46Z

src/predictor/array_tree_layout.h

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

trivialfis · 2025-08-20T21:28:50Z

Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the RegTree structure as an opt-in method call? My reasoning is as follows:

The existing RegTree and the multi-target tree already use a very similar layout, minus the dummy nodes. It might be easier/cleaner to do it there.
We can avoid complicating the predictor too much.
We can cache the result in the regtree structure to avoid repeated initialization.

You can define a std::unique_ptr<ArrayTree> inside the RegTree, set it to nullptr. Define a method to create the array tree when needed, and reset it back to nullptr if any non-const method is called.

Co-authored-by: Jiaming Yuan <[email protected]>

razdoburdin · 2025-09-04T06:38:20Z

Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the RegTree structure as an opt-in method call? My reasoning is as follows:

The existing RegTree and the multi-target tree already use a very similar layout, minus the dummy nodes. It might be easier/cleaner to do it there.

We can avoid complicating the predictor too much.

We can cache the result in the regtree structure to avoid repeated initialization.

You can define a std::unique_ptr<ArrayTree> inside the RegTree, set it to nullptr. Define a method to create the array tree when needed, and reset it back to nullptr if any non-const method is called.

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout? If so, it would be the natural next optimization step.

trivialfis · 2025-09-04T10:57:37Z

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout?

I think this should be fine since the size of the layout is bound by depth. The implementation here falls back to the original tree after certain level is reached.

razdoburdin · 2025-09-05T16:08:37Z

Do you think memory overhead (about 1KB per tree) is acceptable for storing the layout?

I think this should be fine since the size of the layout is bound by depth. The implementation here falls back to the original tree after certain level is reached.

Can we merge the current implementation and postpone buffering of the layout?

trivialfis · 2025-09-05T19:27:21Z

Can we merge the current implementation and postpone buffering of the layout?

We can. I will look into this PR.

trivialfis

Thank you for the excellent optimization!

I can understand the code (mostly), and it should be cleaner after merging into the regtree. I will merge this PR once the CI is green.

Dmitry Razdoburdin and others added 12 commits May 28, 2025 04:53

basic implementation

e64e20c

optimisations

60c2ffe

fix compilation error

8f6dfe3

perf optimzation

bd13491

add categorial

3827a49

add multitarget

7334bd2

linting

8356855

perf

165b34a

fix perf

52eee0c

refactoring

cb28530

add comments

7ae3a42

more comments

2799644

razdoburdin marked this pull request as draft June 20, 2025 13:50

fix and tildy

a8bb91e

Vika-F reviewed Jun 23, 2025

View reviewed changes

src/predictor/array_tree_layout.h Outdated Show resolved Hide resolved

razdoburdin and others added 7 commits June 23, 2025 15:22

Update src/predictor/array_tree_layout.h

6d94176

Co-authored-by: Victoriya Fedotova <[email protected]>

add static assertions

e34becc

fix randome state usage in sycl training_continuation test

a2f2c75

Merge branch 'master' into dev/cpu/eytzinger_layout

2afad25

check if right child is valid

92ac69e

Merge branch 'dev/cpu/eytzinger_layout' of https://github.com/razdobu…

e2b0f05

…rdin/xgboost into dev/cpu/eytzinger_layout

use signed ints for node indxes

87bee15

Vika-F reviewed Jun 24, 2025

View reviewed changes

razdoburdin and others added 6 commits June 24, 2025 12:53

Update src/predictor/array_tree_layout.h

c3c1c85

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/predictor/array_tree_layout.h

d270ee7

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/predictor/array_tree_layout.h

2a7e575

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/predictor/array_tree_layout.h

3539ec0

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/predictor/array_tree_layout.h

709d233

Co-authored-by: Victoriya Fedotova <[email protected]>

Update src/predictor/array_tree_layout.h

40be7e2

Co-authored-by: Victoriya Fedotova <[email protected]>

trivialfis reviewed Jul 1, 2025

View reviewed changes

add tests

9c1007f

lint

92b5069

trivialfis reviewed Jul 31, 2025

View reviewed changes

razdoburdin and others added 2 commits August 4, 2025 13:35

Update src/predictor/cpu_predictor.cc

b0eaa85

Co-authored-by: Jiaming Yuan <[email protected]>

Merge branch 'master' into dev/cpu/eytzinger_layout

790a98e

trivialfis reviewed Aug 20, 2025

View reviewed changes

razdoburdin and others added 2 commits September 3, 2025 17:01

Update src/predictor/array_tree_layout.h

89e56b7

Co-authored-by: Jiaming Yuan <[email protected]>

Inplace predict always use block.

2f88dce

Dmitry Razdoburdin added 7 commits September 5, 2025 06:06

Merge branch 'master' into dev/cpu/eytzinger_layout

bcbb223

merge master

bb322c6

clean up

32ed633

clean up

0269d3c

fix

13b2011

include <array>

6d26173

remove overloading

8b89b91

trivialfis added 3 commits September 10, 2025 07:56

Small cleanup.

db37a3c

Cleanup inline.

d7cf260

comments.

b8cd8c0

trivialfis approved these changes Sep 10, 2025

View reviewed changes

trivialfis merged commit 446e3b9 into dmlc:master Sep 10, 2025
82 of 84 checks passed

Uh oh!

optimize CPU inference with Array-Based Tree Traversal #11519

optimize CPU inference with Array-Based Tree Traversal #11519

Conversation

razdoburdin commented Jun 20, 2025

Uh oh!

trivialfis commented Jun 21, 2025

Uh oh!

Uh oh!

Vika-F left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trivialfis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

razdoburdin Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

razdoburdin commented Jul 21, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

trivialfis Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

razdoburdin Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

razdoburdin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 5, 2025

Uh oh!

trivialfis commented Aug 7, 2025

Uh oh!

trivialfis commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

trivialfis Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razdoburdin commented Sep 4, 2025

Uh oh!

trivialfis commented Sep 4, 2025

Uh oh!

razdoburdin commented Sep 5, 2025

Uh oh!

trivialfis commented Sep 5, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Aug 20, 2025 •

edited

Loading