Multi weight storage #1008

Superharz · 2025-05-26T11:31:05Z

This PR adds a multi weight storage type (called MultiWeight) to boost-histogram. It addresses #83 and is bases on a prototype provided by @HDembinski in boostorg/histogram#211.

It allows one to create a histogram that can store multiple independent weights per bin (the number of weights per bin has to be specified when creating the histogram).

Example

Create and fill a 1 dim histogram that stores 3 weights per bin:

import boost_histogram as bh
import numpy as np
x = np.array([1, 2])
weights = np.array([
        [1, 2, 3],
        [4, 5, 6]
    ])
h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage = bh.storage.MultiWeight(3))
h.fill(x, sample = weights)

h[1] = [1, 2, 3]
h[2] = [4, 5, 6]

Status of this PR

The MultiWeight histograms can be created in python and they can be filled.
Pickle also works.

The buffer and view for the histograms has yet to be implemented.

Superharz · 2025-05-26T14:42:12Z

Some numbers comparing a 8*8 2D MultiWeight histogram with 20k weights vs 20k normal (single weights) histograms:
Filling both with 100k random values takes about 1min 26s for a loop over all 20k single weights histograms vs 1.71s to fill the MultiWeight histogram with the same data.
All single weight histograms together take up about 51 MB of RAM on my machine vs the one MultiWeight histogram only taking up about 15.3 MB

Superharz · 2025-05-26T16:00:20Z

h.values() does not work yet.
However, the important part (for me) is to fill the histograms not to retrieve the data from them.
At the end, the data can be converted into a numpy array by a small loop over the histogram:

For a 8*8 2D histogram h with 20k weights:

a = np.zeros((8,8, 20000))
for i in range(8):
    for j in range(8):
        a[i,j] = np.array(h[i,j])

include/bh_python/fill.hpp

HDembinski · 2025-05-27T07:21:51Z

include/bh_python/multi_weight.hpp

+namespace histogram {
+
+template <class T>
+struct multi_weight_value : public boost::span<T> {


The naming is misleading. This inherits from boost::span, which is a view type, but the name X_value suggest it is a value type. A value type holds its data, while a view does not. So this is a multi_weight_span or multi_weight_view. Probably the latter.

I believe you need a multi_weight_value, a multi_weight_reference, and multi_weight_const_reference. You can make the multi_weight_reference derive from multi_weight_const_reference to avoid some duplication. The multi_weight_value should hold a copy of the data, while the references are views like your current 'multi_weight_value'.

You should try to make a multi_weight_base with common code for all of these, like the operators.

Okay, if I understand your suggestion right, my idea would be to have

multi_weight_value derived from a std::vector that adds assignment and sum operators

multi_weight_const_reference derived from a boost::span that does not define any operators to modify the data

multi_weight_reference derived from multi_weight_const_reference that adds the assignment and sum operators

Should I better use a std::valarray instead of std::vector?

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Therefore, it should be sufficient to rename multi_weight_value -> multi_weight_reference and implement a separate multi_weight_value class maybe based on std::vector

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

So I read my own comment vom 2022 again
boostorg/histogram#211 (comment)
and now I realize again why the multi_weight_value inherited from boost::span. That was not an oversight, this choice is explained there. Now I have to rethink this whole approach.

Maybe we have to implement a copy-on-write mechanism here to meet the requirements, I don't know.

I think that's why this got stuck, the design wasn't clear.

Thinking more about it, I think the approach we are developing is still fine. The storage holds all the weights for all cells in one large contiguous memory block. When you create a multi_weight_value, it has to return a copy, and that's fine, because the interfaces are generally designed to avoid copies, returning const_reference and reference where possible.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Something was bothering me about this and so I checked. And I was right, you cannot implement a const reference like that, look:

#include <boost/core/span.hpp> #include <vector> #include <cassert> using mutable_span = boost::span<double>; using const_span = const boost::span<double>; // not really const const_span foo(const_span x) { x[1] = 0; // oopsie return x; } int main() { std::vector<double> x = {1, 2, 3}; mutable_span y = foo(x); y[0] = 0; assert(x[0] == 0); // oopsie assert(x[1] == 0); // oopsie }

The const modifier on a value type (no &) doesn't prevent you from calling methods which mutate its contents. But even if that were true, there is no way in C++ to prevent initializing a mutable_span from a const_span, because you can also do:

const double x = 3; double y = x;

C++ implicitly assumes that all types which are not references (no &) are value types, meaning assignment creates a copy.

HDembinski · 2025-05-27T07:22:56Z

include/bh_python/multi_weight.hpp

+        return std::equal(this->begin(), this->end(), values.begin());
+    }
+    bool operator!=(const boost::span<T> values) const { return !operator==(values); }
+    void operator+=(const std::vector<T> values) {


Why is this not accepting a span as well?

I have tried to address this in b2cffcf.
However, I can not template += to accept a class S because this results in a huge compiler error message which says

/cvmfs/sft.cern.ch/lcg/views/LCG_107a/x86_64-el9-gcc14-opt/include/boost/histogram/detail/accumulator_traits.hpp:81:37: error: call of overloaded 'accumulator_traits_impl(boost::histogram::multi_weight_value<double>&, boost::histogram::detail::priority<2>)' is ambiguous 81 | decltype(accumulator_traits_impl(std::declval<T&>(), priority<2>{}));

Instead, I have now kept the overload of the operator+= but the vector version now also calls the span version to not duplicate code

However, this might be solved by switching to the split into multi_weight_value and multi_weight_reference as proposed in #1008 (comment).

You should get it to work with boost::span, because then it works with any contiguous memory container without a copy. Also it should be const std::vector<T>& values instead of const std::vector<T> values, the latter is passing by value so a copy is created even when the vector type matches exactly.

include/bh_python/multi_weight.hpp

HDembinski · 2025-05-27T08:06:04Z

Wow, that's quite an impressive patch. I really appreciate the benchmarks, which nicely confirm the expected benefits. I hope you have more time to implement the changes.

I suggest we work on the implementation here and backport it to boostorg/histogram later. In the end, both libraries should be in sync.

Superharz · 2025-05-27T08:10:43Z

Hi @HDembinski .
Thank you for reviewing this. How would you like me to address your comments? Should I do one commit per comment or fix everything and do one commit then?

HDembinski · 2025-05-27T08:13:56Z

Within a PR you don't need to do clean commits, I won't look at the commit history. Feel free to fix multiple issues in one commit.

HDembinski · 2025-05-27T08:45:39Z

By the way, once this feature is done, I suggest you present this at the next PyHEP https://indico.cern.ch/event/1515852/ or the one next year, depending on how quickly we get this done. It is a major feature, and you deserve recognition for implementing this. I left science, so I don't go to any of these workshops anymore.

Superharz · 2025-05-27T15:30:00Z

Hi @HDembinski,
I addressed the first batch of your comments.
For the remaining ones I need some input from your side on how to continue :)

HDembinski · 2025-06-22T12:42:10Z

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword. I see that I did that, too, in my demo. I guess it was the easiest hack to make it work, but that's breaking all kinds of interface contracts.

Superharz · 2025-06-22T12:44:44Z

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword.

I completely agree.
This was taken over from your original prototype and it worked, therefore I kept it.

HDembinski · 2025-06-22T12:54:17Z

However, it is kinda nice that it basically works with sample, so maybe it is enough to change the name of the storage to multi_count, so avoid having name "weight" in the name which gives the wrong idea.

HDembinski · 2025-06-22T12:55:26Z

@Superharz I considering moving development of this feature back to boostorg/histogram. I find it easier to develop C++ code there. I will merge your changes here to that branch.

…erence

Signed-off-by: Henry Schreiner <[email protected]>

henryiii · 2025-07-31T18:43:41Z

Upstream: boostorg/histogram#411. I assume we can work on this, and move to upstream whenever that goes in and gets released. I've fixed up the clang-tidy recommendations, and started adding some of the missing things. I added .nelem, and added it to the repr. It also needs to support .view, .values, etc.

This is far enough from being ready that I'm not going to try to get it in for the next release, but it can probably be in the following one.

github-actions bot added the needs changelog Might need a changelog entry label May 26, 2025

Superharz mentioned this pull request May 26, 2025

Support for multiple weight variations #83

Open