Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion include/tl2cgen/detail/data_matrix_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,21 @@ class DenseDMatrix {
missing_value_{missing_value},
num_row_{num_row},
num_col_{num_col} {}

DenseDMatrix(
void const* data, void const* missing_value, std::uint64_t num_row, std::uint64_t num_col)
: missing_value_{*static_cast<ElementType const*>(missing_value)},
num_row_{num_row},
num_col_{num_col} {
auto const* data_ptr = static_cast<ElementType const*>(data);
std::uint64_t const num_elem = num_row * num_col;
data_ = std::vector<ElementType>{data_ptr, data_ptr + num_elem};
data_.reserve(num_elem);
#pragma omp parallel for
for (std::uint64_t i = 0; i < num_elem; ++i) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have benchmarks showing the impact of this change at various scales? How often is this code likely to get called inside a block that is already parallelized? You mentioned this was a bottleneck; can you give an example of a workload where it is a bottleneck?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. In our application we do predictions at large scale: we run tl2cgen.Predictor.predict about 17,000 times on 100,000,000 samples points by 61 features Numpy nd arrays. For this reason we use tl2cgen to compile the model from sklearn, convert the array to DMartrix and then predict. Whoever we noticed that the DMatrix conversion was about 60% of the computing time and that the execution was not doe in parallel. In our machines we have 96 CPUs so this has a big impact. Parallelizing the filling of the std::vector for the DMatrix the time dropped drastically. For comparison you can see the following screenshot (100,000,000 samples points by 61 features input). Here an old version of TreeLite was use for comparison (since it has similar performance and had the same bottle neck), while tl2cgen is with parallel DMatrix.

image

data_[i] = data_ptr[i];
}
}

DenseDMatrix(
void const*, std::uint32_t const*, std::uint64_t const*, std::uint64_t, std::uint64_t) {
TL2CGEN_LOG(FATAL) << "Invalid set of arguments";
Expand Down
2 changes: 2 additions & 0 deletions python/tl2cgen/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ def py_str(string):
"""Convert C string back to Python string"""
return string.decode("utf-8")

def check_if_fast():
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's missing code here or if this was supposed to be removed before the PR was made.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, sorry, this commit was only included to check internally which version of tl2cgen we were running and went pushed to the PR because in the same branch, but should not be included.


def _open_and_validate_recipe(recipe_path: pathlib.Path) -> Dict[str, Any]:
"""Ensure that the build recipe contains necessary fields"""
Expand Down