-
Notifications
You must be signed in to change notification settings - Fork 988
[JMX Insight] Hadoop jmx metics semconv alignment #14411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
889a9c9
3b54d6e
c60c0e5
a0c743b
be3f90c
51c3875
579c8fe
a4626e7
551436c
e8b8375
9c02f4d
ae462de
6217116
01f6503
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Hadoop Metrics | ||
|
||
Here is the list of metrics based on MBeans exposed by Hadoop. | ||
|
||
| Metric Name | Type | Attributes | Description | | ||
|---------------------------------|---------------|-------------------------------------|--------------------------------------------------------| | ||
| hadoop.dfs.capacity | UpDownCounter | hadoop.node.name | Current raw capacity of data nodes. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The capacity is in bytes, so maybe we should start adding a dedicated column for the units (then we can do the same for other metrics as follow-up). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also not sure if we should have a metric that is a prefix of another, maybe |
||
| hadoop.dfs.capacity.used | UpDownCounter | hadoop.node.name | Current used capacity across all data nodes. | | ||
| hadoop.dfs.block.count | UpDownCounter | hadoop.node.name | Current number of allocated blocks in the system. | | ||
| hadoop.dfs.block.missing | UpDownCounter | hadoop.node.name | Current number of missing blocks. | | ||
| hadoop.dfs.block.corrupt | UpDownCounter | hadoop.node.name | Current number of blocks with corrupt replicas. | | ||
| hadoop.dfs.volume.failure.count | Counter | hadoop.node.name | Total number of volume failures across all data nodes. | | ||
SylvainJuge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| hadoop.dfs.file.count | UpDownCounter | hadoop.node.name | Current number of files and directories. | | ||
| hadoop.dfs.connection.count | UpDownCounter | hadoop.node.name | Current number of connection. | | ||
| hadoop.dfs.data_node.count | UpDownCounter | hadoop.node.name, hadoop.node.state | The number of data nodes. | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
--- | ||
rules: | ||
- bean: Hadoop:service=NameNode,name=FSNamesystem | ||
prefix: hadoop.dfs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to always have the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed today, while |
||
metricAttribute: | ||
hadoop.node.name: beanattr(tag\.Hostname) | ||
mapping: | ||
# hadoop.dfs.capacity | ||
CapacityTotal: | ||
metric: capacity | ||
type: updowncounter | ||
unit: By | ||
desc: Current raw capacity of DataNodes. | ||
# hadoop.dfs.capacity.used | ||
CapacityUsed: | ||
metric: capacity.used | ||
type: updowncounter | ||
unit: By | ||
desc: Current used capacity across all DataNodes. | ||
# hadoop.dfs.block.count | ||
BlocksTotal: | ||
metric: block.count | ||
type: updowncounter | ||
unit: "{block}" | ||
desc: Current number of allocated blocks in the system. | ||
# hadoop.dfs.block.missing | ||
MissingBlocks: | ||
metric: block.missing | ||
type: updowncounter | ||
unit: "{block}" | ||
desc: Current number of missing blocks. | ||
# hadoop.dfs.block.corrupt | ||
CorruptBlocks: | ||
metric: block.corrupt | ||
type: updowncounter | ||
unit: "{block}" | ||
desc: Current number of blocks with corrupt replicas. | ||
# hadoop.dfs.volume.failure.count | ||
VolumeFailuresTotal: | ||
metric: volume.failure.count | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could map Also, I think this is safe to keep the |
||
type: counter | ||
unit: "{failure}" | ||
desc: Total number of volume failures across all DataNodes. | ||
# hadoop.dfs.file.count | ||
FilesTotal: | ||
metric: file.count | ||
type: updowncounter | ||
unit: "{file}" | ||
desc: Current number of files and directories. | ||
# hadoop.dfs.connection.count | ||
TotalLoad: | ||
metric: connection.count | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here the With
So if those metrics were added in the future, then having However, if the |
||
type: updowncounter | ||
unit: "{connection}" | ||
desc: Current number of connections. | ||
|
||
# hadoop.dfs.data_node.count | ||
NumLiveDataNodes: | ||
metric: &metric data_node.count | ||
type: &type updowncounter | ||
unit: &unit "{node}" | ||
desc: &desc The number of DataNodes. | ||
metricAttribute: | ||
hadoop.node.state: const(live) | ||
NumDeadDataNodes: | ||
metric: *metric | ||
type: *type | ||
unit: *unit | ||
desc: *desc | ||
metricAttribute: | ||
hadoop.node.state: const(dead) | ||
Comment on lines
+58
to
+71
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not very familiar with hadoop, but that there are more than 2 states for the data nodes as we can infer from the For example, if a node would go into a state that is not mapped here, then it means that the total number of nodes in the cluster would go down by one, even if the node is still part of the cluster. With individual metrics per state, the consumer can expect that the metric does not represent the whole cluster node count. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
/* | ||
* Copyright The OpenTelemetry Authors | ||
* SPDX-License-Identifier: Apache-2.0 | ||
*/ | ||
|
||
package io.opentelemetry.instrumentation.jmx.rules; | ||
|
||
import static io.opentelemetry.instrumentation.jmx.rules.assertions.DataPointAttributes.attribute; | ||
import static io.opentelemetry.instrumentation.jmx.rules.assertions.DataPointAttributes.attributeGroup; | ||
|
||
import io.opentelemetry.instrumentation.jmx.rules.assertions.AttributeMatcher; | ||
import java.io.IOException; | ||
import java.net.URISyntaxException; | ||
import java.nio.file.Files; | ||
import java.nio.file.Path; | ||
import java.nio.file.Paths; | ||
import java.time.Duration; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.stream.Collectors; | ||
import java.util.stream.Stream; | ||
import org.junit.jupiter.api.Test; | ||
import org.testcontainers.containers.GenericContainer; | ||
import org.testcontainers.containers.wait.strategy.Wait; | ||
import org.testcontainers.images.builder.Transferable; | ||
|
||
class HadoopTest extends TargetSystemTest { | ||
|
||
public static final String ENDPOINT_PLACEHOLDER = "<<ENDPOINT_PLACEHOLDER>>"; | ||
|
||
@Test | ||
void testMetrics_Hadoop2x() throws URISyntaxException, IOException { | ||
List<String> yamlFiles = Collections.singletonList("hadoop.yaml"); | ||
|
||
yamlFiles.forEach(this::validateYamlSyntax); | ||
|
||
// Hadoop startup script does not propagate env vars to launched hadoop daemons, | ||
// so all the env vars needs to be embedded inside the hadoop-env.sh file | ||
GenericContainer<?> target = | ||
new GenericContainer<>("bmedora/hadoop:2.9-base") | ||
.withCopyToContainer( | ||
Transferable.of(readAndPreprocessEnvFile("hadoop2-env.sh")), | ||
"/hadoop/etc/hadoop/hadoop-env.sh") | ||
.withCreateContainerCmdModifier(cmd -> cmd.withHostName("test-host")) | ||
.withStartupTimeout(Duration.ofMinutes(3)) | ||
.withExposedPorts(50070) | ||
.waitingFor(Wait.forListeningPorts(50070)); | ||
|
||
copyAgentToTarget(target); | ||
copyYamlFilesToTarget(target, yamlFiles); | ||
|
||
startTarget(target); | ||
|
||
verifyMetrics(createMetricsVerifier()); | ||
} | ||
|
||
private String readAndPreprocessEnvFile(String fileName) throws URISyntaxException, IOException { | ||
Path path = Paths.get(getClass().getClassLoader().getResource(fileName).toURI()); | ||
|
||
String data; | ||
try (Stream<String> lines = Files.lines(path)) { | ||
data = | ||
lines | ||
.map(line -> line.replace(ENDPOINT_PLACEHOLDER, otlpEndpoint)) | ||
.collect(Collectors.joining("\n")); | ||
} | ||
|
||
return data; | ||
} | ||
|
||
@Test | ||
void testMetrics_Hadoop3x() throws URISyntaxException, IOException { | ||
List<String> yamlFiles = Collections.singletonList("hadoop.yaml"); | ||
|
||
yamlFiles.forEach(this::validateYamlSyntax); | ||
|
||
// Hadoop startup script does not propagate env vars to launched hadoop daemons, | ||
// so all the env vars needs to be embedded inside the hadoop-env.sh file | ||
GenericContainer<?> target = | ||
new GenericContainer<>("loum/hadoop-pseudo:3.3.6") | ||
.withExposedPorts(9870, 9000) | ||
.withCopyToContainer( | ||
Transferable.of(readAndPreprocessEnvFile("hadoop3-env.sh")), | ||
"/opt/hadoop/etc/hadoop/hadoop-env.sh") | ||
.withCreateContainerCmdModifier(cmd -> cmd.withHostName("test-host")) | ||
.waitingFor( | ||
Wait.forListeningPorts(9870, 9000).withStartupTimeout(Duration.ofMinutes(3))); | ||
Comment on lines
+77
to
+87
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [minor] this is the only part that differs between hadoop 2.x and 3.x, it could be worth refactoring this with a common method for the test body that delegates the container creation to a |
||
|
||
copyAgentToTarget(target); | ||
copyYamlFilesToTarget(target, yamlFiles); | ||
|
||
startTarget(target); | ||
|
||
verifyMetrics(createMetricsVerifier()); | ||
} | ||
|
||
private static MetricsVerifier createMetricsVerifier() { | ||
AttributeMatcher nodeNameAttribute = attribute("hadoop.node.name", "test-host"); | ||
|
||
return MetricsVerifier.create() | ||
.disableStrictMode() | ||
.add( | ||
"hadoop.dfs.capacity", | ||
metric -> | ||
metric | ||
.hasDescription("Current raw capacity of DataNodes.") | ||
.hasUnit("By") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.capacity.used", | ||
metric -> | ||
metric | ||
.hasDescription("Current used capacity across all DataNodes.") | ||
.hasUnit("By") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.block.count", | ||
metric -> | ||
metric | ||
.hasDescription("Current number of allocated blocks in the system.") | ||
.hasUnit("{block}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.block.missing", | ||
metric -> | ||
metric | ||
.hasDescription("Current number of missing blocks.") | ||
.hasUnit("{block}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.block.corrupt", | ||
metric -> | ||
metric | ||
.hasDescription("Current number of blocks with corrupt replicas.") | ||
.hasUnit("{block}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.volume.failure.count", | ||
metric -> | ||
metric | ||
.hasDescription("Total number of volume failures across all DataNodes.") | ||
.hasUnit("{failure}") | ||
.isCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.file.count", | ||
metric -> | ||
metric | ||
.hasDescription("Current number of files and directories.") | ||
.hasUnit("{file}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.connection.count", | ||
metric -> | ||
metric | ||
.hasDescription("Current number of connections.") | ||
.hasUnit("{connection}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithOneAttribute(nodeNameAttribute)) | ||
.add( | ||
"hadoop.dfs.data_node.count", | ||
metric -> | ||
metric | ||
.hasDescription("The number of DataNodes.") | ||
.hasUnit("{node}") | ||
.isUpDownCounter() | ||
.hasDataPointsWithAttributes( | ||
attributeGroup(attribute("hadoop.node.state", "live"), nodeNameAttribute), | ||
attributeGroup(attribute("hadoop.node.state", "dead"), nodeNameAttribute))); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,7 +63,7 @@ public class TargetSystemTest { | |
private static OtlpGrpcServer otlpServer; | ||
private static Path agentPath; | ||
private static Path testAppPath; | ||
private static String otlpEndpoint; | ||
protected static String otlpEndpoint; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [minor] I would suggest to use a protected getter to expose this as read-only to sub-classes only as the field is not final. |
||
|
||
private GenericContainer<?> targetSystem; | ||
private Collection<GenericContainer<?>> targetDependencies; | ||
|
@@ -150,7 +150,7 @@ protected static Map<String, String> otelConfigProperties(List<String> yamlFiles | |
// disable runtime telemetry metrics | ||
config.put("otel.instrumentation.runtime-telemetry.enabled", "false"); | ||
// set yaml config files to test | ||
config.put("otel.jmx.target", "tomcat"); | ||
config.put("otel.jmx.target", "hadoop"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this line should probably be removed as we test with only the explicit list of yaml files here. |
||
config.put( | ||
"otel.jmx.config", | ||
yamlFiles.stream() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[for reviewer] Naming prefix of metrics has been change to utilize a metric context (dfs).
Context is described in official docs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html