Skip to content

Multi weight storage #1008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

Superharz
Copy link

@Superharz Superharz commented May 26, 2025

This PR adds a multi weight storage type (called MultiWeight) to boost-histogram. It addresses #83 and is bases on a prototype provided by @HDembinski in boostorg/histogram#211.

It allows one to create a histogram that can store multiple independent weights per bin (the number of weights per bin has to be specified when creating the histogram).

Example

Create and fill a 1 dim histogram that stores 3 weights per bin:

import boost_histogram as bh
import numpy as np
x = np.array([1, 2])
weights = np.array([
        [1, 2, 3],
        [4, 5, 6]
    ])
h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage = bh.storage.MultiWeight(3))
h.fill(x, sample = weights)

h[1] = [1, 2, 3]
h[2] = [4, 5, 6]

Status of this PR

The MultiWeight histograms can be created in python and they can be filled.
Pickle also works.

The buffer and view for the histograms has yet to be implemented.

@github-actions github-actions bot added the needs changelog Might need a changelog entry label May 26, 2025
@Superharz
Copy link
Author

Some numbers comparing a 8*8 2D MultiWeight histogram with 20k weights vs 20k normal (single weights) histograms:
Filling both with 100k random values takes about 1min 26s for a loop over all 20k single weights histograms vs 1.71s to fill the MultiWeight histogram with the same data.
All single weight histograms together take up about 51 MB of RAM on my machine vs the one MultiWeight histogram only taking up about 15.3 MB

@Superharz
Copy link
Author

h.values() does not work yet.
However, the important part (for me) is to fill the histograms not to retrieve the data from them.
At the end, the data can be converted into a numpy array by a small loop over the histogram:

For a 8*8 2D histogram h with 20k weights:

a = np.zeros((8,8, 20000))
for i in range(8):
    for j in range(8):
        a[i,j] = np.array(h[i,j])

namespace histogram {

template <class T>
struct multi_weight_value : public boost::span<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is misleading. This inherits from boost::span, which is a view type, but the name X_value suggest it is a value type. A value type holds its data, while a view does not. So this is a multi_weight_span or multi_weight_view. Probably the latter.

Copy link
Member

@HDembinski HDembinski May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you need a multi_weight_value, a multi_weight_reference, and multi_weight_const_reference. You can make the multi_weight_reference derive from multi_weight_const_reference to avoid some duplication. The multi_weight_value should hold a copy of the data, while the references are views like your current 'multi_weight_value'.

Copy link
Member

@HDembinski HDembinski May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should try to make a multi_weight_base with common code for all of these, like the operators.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if I understand your suggestion right, my idea would be to have

  1. multi_weight_value derived from a std::vector that adds assignment and sum operators
  2. multi_weight_const_reference derived from a boost::span that does not define any operators to modify the data
  3. multi_weight_reference derived from multi_weight_const_reference that adds the assignment and sum operators

Should I better use a std::valarray instead of std::vector?

Copy link
Author

@Superharz Superharz May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Therefore, it should be sufficient to rename multi_weight_value -> multi_weight_reference and implement a separate multi_weight_value class maybe based on std::vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I read my own comment vom 2022 again
boostorg/histogram#211 (comment)
and now I realize again why the multi_weight_value inherited from boost::span. That was not an oversight, this choice is explained there. Now I have to rethink this whole approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have to implement a copy-on-write mechanism here to meet the requirements, I don't know.

I think that's why this got stuck, the design wasn't clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, I think the approach we are developing is still fine. The storage holds all the weights for all cells in one large contiguous memory block. When you create a multi_weight_value, it has to return a copy, and that's fine, because the interfaces are generally designed to avoid copies, returning const_reference and reference where possible.

Copy link
Member

@HDembinski HDembinski Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should it not be enough to define
using const_reference = const reference (without the &)
because this would automatically prohibit the use of any function that is not defined const?

Maybe. Usually, I had a reason why I did things in a certain way, but you might be right here. Implementing const_reference without the mutable methods and then inherit a reference from const_reference is cleaner, because then the methods that shouldn't be called aren't there. That being said, I can see the appeal of implementing only one kind of reference.

Something was bothering me about this and so I checked. And I was right, you cannot implement a const reference like that, look:

#include <boost/core/span.hpp>
#include <vector>
#include <cassert>

using mutable_span = boost::span<double>;
using const_span = const boost::span<double>; // not really const

const_span foo(const_span x) {
    x[1] = 0; // oopsie
    return x;
}

int main()
{
  std::vector<double> x = {1, 2, 3};
  
  mutable_span y = foo(x);

  y[0] = 0;

  assert(x[0] == 0); // oopsie
  assert(x[1] == 0); // oopsie
}

The const modifier on a value type (no &) doesn't prevent you from calling methods which mutate its contents. But even if that were true, there is no way in C++ to prevent initializing a mutable_span from a const_span, because you can also do:

const double x = 3;
double y = x;

C++ implicitly assumes that all types which are not references (no &) are value types, meaning assignment creates a copy.

return std::equal(this->begin(), this->end(), values.begin());
}
bool operator!=(const boost::span<T> values) const { return !operator==(values); }
void operator+=(const std::vector<T> values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not accepting a span as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to address this in b2cffcf.
However, I can not template += to accept a class S because this results in a huge compiler error message which says

/cvmfs/sft.cern.ch/lcg/views/LCG_107a/x86_64-el9-gcc14-opt/include/boost/histogram/detail/accumulator_traits.hpp:81:37: error: call of overloaded 'accumulator_traits_impl(boost::histogram::multi_weight_value<double>&, boost::histogram::detail::priority<2>)' is ambiguous
   81 |     decltype(accumulator_traits_impl(std::declval<T&>(), priority<2>{}));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I have now kept the overload of the operator+= but the vector version now also calls the span version to not duplicate code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this might be solved by switching to the split into multi_weight_value and multi_weight_reference as proposed in #1008 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should get it to work with boost::span, because then it works with any contiguous memory container without a copy. Also it should be const std::vector<T>& values instead of const std::vector<T> values, the latter is passing by value so a copy is created even when the vector type matches exactly.

@HDembinski
Copy link
Member

Wow, that's quite an impressive patch. I really appreciate the benchmarks, which nicely confirm the expected benefits. I hope you have more time to implement the changes.

I suggest we work on the implementation here and backport it to boostorg/histogram later. In the end, both libraries should be in sync.

@Superharz
Copy link
Author

Hi @HDembinski .
Thank you for reviewing this. How would you like me to address your comments? Should I do one commit per comment or fix everything and do one commit then?

@HDembinski
Copy link
Member

HDembinski commented May 27, 2025

Within a PR you don't need to do clean commits, I won't look at the commit history. Feel free to fix multiple issues in one commit.

@HDembinski
Copy link
Member

HDembinski commented May 27, 2025

By the way, once this feature is done, I suggest you present this at the next PyHEP https://indico.cern.ch/event/1515852/ or the one next year, depending on how quickly we get this done. It is a major feature, and you deserve recognition for implementing this. I left science, so I don't go to any of these workshops anymore.

@Superharz
Copy link
Author

Hi @HDembinski,
I addressed the first batch of your comments.
For the remaining ones I need some input from your side on how to continue :)

@HDembinski
Copy link
Member

HDembinski commented Jun 22, 2025

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword. I see that I did that, too, in my demo. I guess it was the easiest hack to make it work, but that's breaking all kinds of interface contracts.

@Superharz
Copy link
Author

h.fill(x, sample = weights)

I didn't notice before, but you use sample to pass the weights, that's breaking interface assumptions. Weights should be passed with the weight keyword.

I completely agree.
This was taken over from your original prototype and it worked, therefore I kept it.

@HDembinski
Copy link
Member

However, it is kinda nice that it basically works with sample, so maybe it is enough to change the name of the storage to multi_count, so avoid having name "weight" in the name which gives the wrong idea.

@HDembinski
Copy link
Member

@Superharz I considering moving development of this feature back to boostorg/histogram. I find it easier to develop C++ code there. I will merge your changes here to that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog Might need a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants