Skip to content

Conversation

HussainRaza28
Copy link

@HussainRaza28 HussainRaza28 commented Jul 17, 2025

Description of the issue

Kubernetes deployments of the CloudWatch Agent currently lack proper health monitoring capabilities. Without health check endpoints, Kubernetes cannot accurately determine if the agent is functioning correctly, leading to potential silent failures and reduced visibility in EKS console. This PR adds a health check extension to enable Kubernetes liveness and readiness probes, improving operational transparency.

Description of changes

  • Added healthcheck.go translator to support the OpenTelemetry health check extension
  • Configured default health check endpoint at 0.0.0.0:13133 with root path
  • Registered the health check extension in the CloudWatch Agent configuration
  • Ensured compatibility with Kubernetes probe configurations

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Verified health check extension unit tests pass successfully:
image13

Confirmed CloudWatch Agent is running with health probes properly configured:
image

Validated health endpoint is accessible and returns proper status information:
image

@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch 3 times, most recently from f6c81f5 to 3870f19 Compare July 23, 2025 19:55
@HussainRaza28 HussainRaza28 changed the base branch from main to feature-health-observability-addon July 23, 2025 20:45
@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch from 291ac88 to e0b110d Compare July 24, 2025 01:44
@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch from e0b110d to 36c8164 Compare July 24, 2025 02:08
@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch from 92acaab to f418b61 Compare July 24, 2025 19:52
@HussainRaza28 HussainRaza28 marked this pull request as ready for review July 25, 2025 13:17
@HussainRaza28 HussainRaza28 requested a review from a team as a code owner July 25, 2025 13:17
@HussainRaza28 HussainRaza28 self-assigned this Jul 25, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maintain proper naming convention. Put this under a folder called healthcheck and make this and the test file called translator.go and translator_test.go respectively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I moved the files to a healthcheck folder and renamed them to translator.go and translator_test.go to follow proper naming conventions.

@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch from 62a1111 to ab2ffec Compare July 27, 2025 02:15
@HussainRaza28 HussainRaza28 changed the base branch from feature-health-observability-addon to feature-health-observability-addon-update July 27, 2025 02:16
pipelines.Translators.Extensions.Set(server.NewTranslator())
}

pipelines.Translators.Extensions.Set(healthcheckextension.NewHealthCheckTranslator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there is a better way to enable this extension for every cloudwatch-agen customer. Could this be made into a feature flag ? I am hesitant to enable a health extension for all supported environments when primarily it would be only be used for EKS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, it should only be enabled under Kubernetes environments. But I am fine with it being default in this feature branch for now..

require.NoError(t, yaml.Unmarshal([]byte(yamlStr), &actual))

// Remove health_check/health_check extension from both expected and actual for comparison
// This allows tests to pass when the health check extension is dynamically added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is the correct way to fix the translator tests. What happens if a unit test actually has the health extension enabled ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. I fixed it to properly add the health check extension to expected results in Kubernetes mode instead of removing it, so the tests work correctly whether the extension is enabled or not.

pipelines.Translators.Extensions.Set(entitystore.NewTranslator())
}
if context.CurrentContext().KubernetesMode() != "" {
pipelines.Translators.Extensions.Set(server.NewTranslator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add pipelines.Translators.Extensions.Set(healthcheckextension.NewHealthCheckTranslator()) under here since we only need the healthcheck extension for Kubernetes environments, not just when the agent is in a container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I moved it under the Kubernetes mode check since we only need the health check extension for Kubernetes environments.


type healthCheckTranslator struct {
name string
mux sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - Do we need this RWMutex?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the RWMutex and similar unused components since they weren't needed.

@HussainRaza28 HussainRaza28 force-pushed the health_check_extension branch from 9cc95d5 to 77a9c61 Compare July 30, 2025 14:25
yamlStr := toyamlconfig.ToYamlConfig(yamlConfig)
require.NoError(t, yaml.Unmarshal([]byte(yamlStr), &actual))

// Add health check extension to expected results for Kubernetes environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree with this approach to add the health extension for the sake of unit tests. I would have updated all the expected yaml for kubernetes to expect the health extension.

Although I am open to disagreeing to my proposal since this is primarily for fixing unit tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought I wasn't supposed to edit the YAML files, but I agree with the proposal and have updated the expected Kubernetes YAML to include the health extension.

@HussainRaza28 HussainRaza28 merged commit 14ef2e8 into feature-health-observability-addon-update Aug 6, 2025
23 of 25 checks passed
@HussainRaza28 HussainRaza28 deleted the health_check_extension branch August 6, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants