-
Notifications
You must be signed in to change notification settings - Fork 478
Change k8s namespace UI for debug tool #33386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Change k8s namespace UI for debug tool #33386
Conversation
- Before we'd just loop through all the namespaces and find the necessary services for scraping via the first one. Issue with doing so is: a) It doesn't make it explicit that we need the user to target the k8s namespace of the materialize deployment b) If a user has multiple materialize instances in the same namespace, it's quite random which materialize instance we choose Thus we: - Separate CR namespace with other k8s namespaces via additional-k8s-namespace - Add a materialize instance argument, but by default we choose the last created Materialize instance
3e31de9
to
7c3c0f6
Compare
7c3c0f6
to
e021086
Compare
``` | ||
|
||
### Debug the `materialize` namespace | ||
### Debugging Kubernetes namespace `materialize` that does not contain Materialize instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rephrase this to something like Debugging a Materialize instance with supporting infrastructure in other namespaces
?
- Option: "`--mz-instance-name <MZ_INSTANCE_NAME>`" | ||
Description: | | ||
<a name="mz-instance-name"></a> The Materialize instance to target. | ||
Defaults to the latest Materialize instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make this required? The concept of "latest" doesn't make much sense here. Does this mean last updated? Does it mean last created? What if the latest is marked for deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreso last created but that's a good point. Mainly made it optional for backwards compatibility and less friction to get started. But it's probably better to be explicit.
let Some(name) = service.metadata.name.clone() else { | ||
return None; | ||
}; | ||
let Some(spec) = service.spec.clone() else { | ||
return None; | ||
}; | ||
let Some(selector) = spec.selector else { | ||
return None; | ||
}; | ||
let Some(ports) = spec.ports else { return None }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from the ?
operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah for some reason I thought the ?
operator short circuited the entire function rather than just the closure. My bad and good catch!
src/mz-debug/src/main.rs
Outdated
/// The name of the Materialize instance to target. By default, the tool will target the first | ||
/// Materialize instance in the namespace. | ||
#[clap(long)] | ||
mz_instance_name: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just make this required.
let service_ids_to_scrape: std::collections::BTreeSet<String> = pods | ||
.iter() | ||
.filter_map(|pod| { | ||
pod.metadata | ||
.labels | ||
.as_ref() | ||
.and_then(|labels| labels.get("environmentd.materialize.cloud/service-id")) | ||
.map(|s| s.to_string()) | ||
}) | ||
.collect(); | ||
|
||
let services: Api<Service> = Api::namespaced(client.clone(), k8s_namespace); | ||
let services = services | ||
.list(&ListParams::default()) | ||
.await | ||
.with_context(|| format!("Failed to list services in namespace {}", k8s_namespace))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to be getting all the services anyway, we might as well filter them based on their owner reference to the Materialize CR instead. That will save us from collecting this list of pods.
In fact, we can do that for all objects created by orchestratord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so the reason why I did it this way was the ownerReference
for Cluster services only link to the environmentd statefulset while the environmentd ownerReference
points to the Materialize CR. e.g.
mzxco8zi0o48-cluster-u1-replica-u1-gen-1:
name: mzxco8zi0o48-cluster-u1-replica-u1-gen-1
namespace: materialize-environment
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: mzxco8zi0o48-environmentd-1
uid: 7214f7a9-e6fa-45f8-b2ee-f3d78a2af55b
resourceVersion: "780"
uid: 587c6791-f0fe-49cf-acc6-31d1d59e5a72
mzxco8zi0o48-environmentd-1:
ownerReferences:
- apiVersion: materialize.cloud/v1alpha1
blockOwnerDeletion: true
kind: Materialize
name: 12345678-1234-1234-1234-123456789012
uid: 9cc32546-2421-4911-bc2c-234d842c3c67
Given the pods already have the materialize.cloud/organization-name
label that maps to the Materialize CR, thought it'd be less code to just filter the services on the set of cluster pods with that label value. If you think it's more correct to:
1. Get list of tuples: (cluster_service, cluster_service.owner.statefulset, cluster_service.owner.statefulset.materialize_cr)
2. Filter on the tuples where cluster_service.owner.statefulset.materialize_cr == mz_instance_name
I can make the change!
src/mz-debug/src/utils.rs
Outdated
let object_list = materialize_api | ||
.list(&ListParams::default()) | ||
.await | ||
.with_context(|| { | ||
format!( | ||
"Failed to get Materialize CR in namespace: {}", | ||
k8s_namespace | ||
) | ||
})?; | ||
|
||
let materialize_cr = object_list | ||
.items | ||
.into_iter() | ||
.find(|item| item.metadata.name.as_ref() == Some(mz_instance_name)) | ||
.with_context(|| { | ||
format!( | ||
"Could not find Materialize CR with name: {}", | ||
mz_instance_name | ||
) | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use a get? There can only be one with a specific namespace and name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Changed!
- Added examples for viewing Materialize instance names in Kubernetes. - Updated debugging commands to require the Materialize instance name.
.iter() | ||
.find_map(|service| match (&service.metadata.name, &service.spec) { | ||
(Some(service_name), Some(spec)) => { | ||
if !service_name.to_lowercase().contains("environmentd") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something to fix in this PR, but I am realizing now that this could match both the generation service and the globally active one. It is unclear which we are looking for in this function, but I would guess we want the active one.
|
||
// // Filter by cluster services | ||
// if let Some(service_id) = selector.get("environmentd.materialize.cloud/service-id") { | ||
// if !service_ids_to_scrape.contains(service_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover comments?
pod.metadata | ||
.labels | ||
.as_ref() | ||
.and_then(|labels| labels.get("environmentd.materialize.cloud/service-id")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a bug in our service labeling! We don't have any labels identifying which materialize instance a service should point to. This may cause services to go to the wrong materialize instance!
I just filed https://github.com/MaterializeInc/cloud/issues/11527
This also means you should probably change to getting the services with an owner reference pointing at environmentd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat! And with that we can skip fetching the pods
Motivation
Before we'd just loop through all the namespaces and find the necessary services for scraping via the first one. Issue with doing so is:
a) It doesn't make it explicit that we need the user to target the k8s namespace of the materialize deployment
b) If a user has multiple materialize instances in the same namespace, it's quite random which materialize instance we choose
Thus we:
Followup to #33129 (comment)
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.