Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Nov 6, 2025

This PR modernizes the Apache HBase mixin to use commonlib and g.libsonnet to generate its dashboards. This will make future stylizations easier to accomplish :)

Feel free to use this PR to generate a Sample environment: grafana/integration-sample-apps#102
Cluster Overview:
image
image

RegionServer Overview:
image
image

Logs:
image

@schmikei schmikei marked this pull request as ready for review November 6, 2025 19:50
@schmikei schmikei requested a review from a team as a code owner November 6, 2025 19:50
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One teensy comment from my end, otherwise looks fantastic!

Copy link

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few requests for changes, below is a summary:

  • Do we want to add {%(filteringSelector)s} to the alerts below?
  • There are a few raw signals in regionserver and cluster files that we might want to change to counter type with a rangeFunction: 'rate' or convert the query to rate(...), or increase(...).
  • There is a counter metric (totalRequestRate) that might want a rangeFunction: 'rate' added to it
  • I noticed that the masterQueueSize metric has a weird unit ('decbytes') but it's a number (total items in a queue), shouldn't it be short?
  • There are a few duplicate signals between regionserver and cluster files with the same key, metric but slightly different queries (outlined below)

{
alert: 'HBaseDeadRegionServer',
expr: |||
server_num_dead_region_servers > %(alertsDeadRegionServer)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseOldRegionsInTransition',
expr: |||
100 * assignment_manager_rit_count_over_threshold / clamp_min(assignment_manager_rit_count, 1) > %(alertsOldRegionsInTransition)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseHighMasterAuthFailRate',
expr: |||
100 * rate(master_authentication_failures[5m]) / (clamp_min(rate(master_authentication_successes[5m]), 1) + clamp_min(rate(master_authentication_failures[5m]), 1)) > %(alertsHighMasterAuthFailRate)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

{
alert: 'HBaseHighRSAuthFailRate',
expr: |||
100 * rate(region_server_authentication_failures[5m]) / (clamp_min(rate(region_server_authentication_successes[5m]), 1) + clamp_min(rate(region_server_authentication_failures[5m]), 1)) > %(alertsHighRSAuthFailRate)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the filteringSelector?

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (master_authentication_successes{%(queriesSelector)s})',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: This signal reads as "Rate of ..." but the query doesn't use the rate(...) function, which most likely will lead them to be displayed as as raw counter values instead of rates.

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_successes{%(queriesSelector)s}[$__rate_interval]))',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a raw metric that uses a rate function correctly but it's also a duplicate of the same signal and same metric in cluster.libsonnet file but that one doesn't use the rate function as this one. Can we remove it from the clusters signals?

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (region_server_authentication_successes{%(queriesSelector)s})',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of the same signal with the same metric (but different query) in regionserver.libsonnet: https://github.com/grafana/jsonnet-libs/pull/1537/files#r2560283881

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (region_server_authentication_failures{%(queriesSelector)s})',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another duplicate of the same signal in regionserver.libsonnet file.

unit: 'reqps',
sources: {
prometheus: {
expr: 'sum by(' + groupAggList + ') (rate(region_server_authentication_failures{%(queriesSelector)s}[$__rate_interval]))',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct raw signal that uses a rate function but it's a duplicate of another one in cluster.libsonnet, can we remove the other one from the cluster signals file?

g.panel.table.standardOptions.withLinks([
{
title: '',
url: '/d/apachehbase-regionserver-overview?from=${__from}&to=${__to}&var-instance=${__data.fields["RegionServer"]}',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use the links from links.libsonnet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants