-
Notifications
You must be signed in to change notification settings - Fork 237
Add health check extension for liveness & readiness probes #1779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health check extension for liveness & readiness probes #1779
Conversation
f6c81f5
to
3870f19
Compare
291ac88
to
e0b110d
Compare
e0b110d
to
36c8164
Compare
92acaab
to
f418b61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should maintain proper naming convention. Put this under a folder called healthcheck
and make this and the test file called translator.go
and translator_test.go
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I moved the files to a healthcheck
folder and renamed them to translator.go
and translator_test.go
to follow proper naming conventions.
62a1111
to
ab2ffec
Compare
pipelines.Translators.Extensions.Set(server.NewTranslator()) | ||
} | ||
|
||
pipelines.Translators.Extensions.Set(healthcheckextension.NewHealthCheckTranslator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there is a better way to enable this extension for every cloudwatch-agen customer. Could this be made into a feature flag ? I am hesitant to enable a health extension for all supported environments when primarily it would be only be used for EKS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, it should only be enabled under Kubernetes environments. But I am fine with it being default in this feature branch for now..
require.NoError(t, yaml.Unmarshal([]byte(yamlStr), &actual)) | ||
|
||
// Remove health_check/health_check extension from both expected and actual for comparison | ||
// This allows tests to pass when the health check extension is dynamically added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is the correct way to fix the translator tests. What happens if a unit test actually has the health extension enabled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's right. I fixed it to properly add the health check extension to expected results in Kubernetes mode instead of removing it, so the tests work correctly whether the extension is enabled or not.
pipelines.Translators.Extensions.Set(entitystore.NewTranslator()) | ||
} | ||
if context.CurrentContext().KubernetesMode() != "" { | ||
pipelines.Translators.Extensions.Set(server.NewTranslator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add pipelines.Translators.Extensions.Set(healthcheckextension.NewHealthCheckTranslator())
under here since we only need the healthcheck extension for Kubernetes environments, not just when the agent is in a container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I moved it under the Kubernetes mode check since we only need the health check extension for Kubernetes environments.
|
||
type healthCheckTranslator struct { | ||
name string | ||
mux sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - Do we need this RWMutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the RWMutex and similar unused components since they weren't needed.
9cc95d5
to
77a9c61
Compare
yamlStr := toyamlconfig.ToYamlConfig(yamlConfig) | ||
require.NoError(t, yaml.Unmarshal([]byte(yamlStr), &actual)) | ||
|
||
// Add health check extension to expected results for Kubernetes environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I agree with this approach to add the health extension for the sake of unit tests. I would have updated all the expected yaml for kubernetes to expect the health extension.
Although I am open to disagreeing to my proposal since this is primarily for fixing unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought I wasn't supposed to edit the YAML files, but I agree with the proposal and have updated the expected Kubernetes YAML to include the health extension.
14ef2e8
into
feature-health-observability-addon-update
Description of the issue
Kubernetes deployments of the CloudWatch Agent currently lack proper health monitoring capabilities. Without health check endpoints, Kubernetes cannot accurately determine if the agent is functioning correctly, leading to potential silent failures and reduced visibility in EKS console. This PR adds a health check extension to enable Kubernetes liveness and readiness probes, improving operational transparency.
Description of changes
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Verified health check extension unit tests pass successfully:

Confirmed CloudWatch Agent is running with health probes properly configured:

Validated health endpoint is accessible and returns proper status information:
