[EVAL] Big-Bench Extra Hard (BBEH)

## Evaluation short description
Google has releases BBEH as a way to compensate for the saturation of BBH in the latest generation of LLMs. Overall looks like a good benchmark to probe reasoning capabilities.

## Evaluation metadata
Provide all available
- Paper url: https://arxiv.org/pdf/2502.19187
- Github url: https://github.com/google-deepmind/bbeh
- Dataset url: