Skip to content

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Aug 20, 2025

Motivation

Before we'd just loop through all the namespaces and find the necessary services for scraping via the first one. Issue with doing so is:
a) It doesn't make it explicit that we need the user to target the k8s namespace of the materialize deployment
b) If a user has multiple materialize instances in the same namespace, it's quite random which materialize instance we choose

Thus we:

  • Separate CR namespace with other k8s namespaces via additional-k8s-namespace
  • Add a materialize instance argument, but by default we choose the last created Materialize instance

Followup to #33129 (comment)

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

- Before we'd just loop through all the namespaces and find the necessary services for scraping via the first one. Issue with doing so is:
a) It doesn't make it explicit that we need the user to target the k8s namespace of the materialize deployment
b) If a user has multiple materialize instances in the same namespace, it's quite random which materialize instance we choose

Thus we:
- Separate CR namespace with other k8s namespaces via additional-k8s-namespace
-  Add a materialize instance argument, but by default we choose the last created Materialize instance
@SangJunBak SangJunBak force-pushed the debug-tool/single-k8s-namespace branch from 3e31de9 to 7c3c0f6 Compare August 20, 2025 23:32
@SangJunBak SangJunBak force-pushed the debug-tool/single-k8s-namespace branch from 7c3c0f6 to e021086 Compare August 20, 2025 23:36
@SangJunBak SangJunBak requested a review from jubrad August 20, 2025 23:37
@SangJunBak SangJunBak marked this pull request as ready for review August 20, 2025 23:37
@SangJunBak SangJunBak requested a review from a team as a code owner August 20, 2025 23:37
```

### Debug the `materialize` namespace
### Debugging Kubernetes namespace `materialize` that does not contain Materialize instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rephrase this to something like Debugging a Materialize instance with supporting infrastructure in other namespaces?

Comment on lines 11 to 16
- Option: "`--mz-instance-name <MZ_INSTANCE_NAME>`"
Description: |
<a name="mz-instance-name"></a> The Materialize instance to target.
Defaults to the latest Materialize instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make this required? The concept of "latest" doesn't make much sense here. Does this mean last updated? Does it mean last created? What if the latest is marked for deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreso last created but that's a good point. Mainly made it optional for backwards compatibility and less friction to get started. But it's probably better to be explicit.

Comment on lines 219 to 228
let Some(name) = service.metadata.name.clone() else {
return None;
};
let Some(spec) = service.spec.clone() else {
return None;
};
let Some(selector) = spec.selector else {
return None;
};
let Some(ports) = spec.ports else { return None };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from the ? operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah for some reason I thought the ? operator short circuited the entire function rather than just the closure. My bad and good catch!

/// The name of the Materialize instance to target. By default, the tool will target the first
/// Materialize instance in the namespace.
#[clap(long)]
mz_instance_name: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just make this required.

Comment on lines +199 to +214
let service_ids_to_scrape: std::collections::BTreeSet<String> = pods
.iter()
.filter_map(|pod| {
pod.metadata
.labels
.as_ref()
.and_then(|labels| labels.get("environmentd.materialize.cloud/service-id"))
.map(|s| s.to_string())
})
.collect();

let services: Api<Service> = Api::namespaced(client.clone(), k8s_namespace);
let services = services
.list(&ListParams::default())
.await
.with_context(|| format!("Failed to list services in namespace {}", k8s_namespace))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be getting all the services anyway, we might as well filter them based on their owner reference to the Materialize CR instead. That will save us from collecting this list of pods.

In fact, we can do that for all objects created by orchestratord.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so the reason why I did it this way was the ownerReference for Cluster services only link to the environmentd statefulset while the environmentd ownerReferencepoints to the Materialize CR. e.g.

mzxco8zi0o48-cluster-u1-replica-u1-gen-1:

name: mzxco8zi0o48-cluster-u1-replica-u1-gen-1
  namespace: materialize-environment
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: mzxco8zi0o48-environmentd-1
    uid: 7214f7a9-e6fa-45f8-b2ee-f3d78a2af55b
  resourceVersion: "780"
  uid: 587c6791-f0fe-49cf-acc6-31d1d59e5a72

mzxco8zi0o48-environmentd-1:

ownerReferences:
  - apiVersion: materialize.cloud/v1alpha1
    blockOwnerDeletion: true
    kind: Materialize
    name: 12345678-1234-1234-1234-123456789012
    uid: 9cc32546-2421-4911-bc2c-234d842c3c67

Given the pods already have the materialize.cloud/organization-name label that maps to the Materialize CR, thought it'd be less code to just filter the services on the set of cluster pods with that label value. If you think it's more correct to:

1. Get list of tuples: (cluster_service, cluster_service.owner.statefulset, cluster_service.owner.statefulset.materialize_cr)
2. Filter on the tuples where cluster_service.owner.statefulset.materialize_cr == mz_instance_name

I can make the change!

Comment on lines 86 to 105
let object_list = materialize_api
.list(&ListParams::default())
.await
.with_context(|| {
format!(
"Failed to get Materialize CR in namespace: {}",
k8s_namespace
)
})?;

let materialize_cr = object_list
.items
.into_iter()
.find(|item| item.metadata.name.as_ref() == Some(mz_instance_name))
.with_context(|| {
format!(
"Could not find Materialize CR with name: {}",
mz_instance_name
)
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use a get? There can only be one with a specific namespace and name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Changed!

- Added examples for viewing Materialize instance names in Kubernetes.
- Updated debugging commands to require the Materialize instance name.
.iter()
.find_map(|service| match (&service.metadata.name, &service.spec) {
(Some(service_name), Some(spec)) => {
if !service_name.to_lowercase().contains("environmentd") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to fix in this PR, but I am realizing now that this could match both the generation service and the globally active one. It is unclear which we are looking for in this function, but I would guess we want the active one.


// // Filter by cluster services
// if let Some(service_id) = selector.get("environmentd.materialize.cloud/service-id") {
// if !service_ids_to_scrape.contains(service_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover comments?

pod.metadata
.labels
.as_ref()
.and_then(|labels| labels.get("environmentd.materialize.cloud/service-id"))
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a bug in our service labeling! We don't have any labels identifying which materialize instance a service should point to. This may cause services to go to the wrong materialize instance!
I just filed https://github.com/MaterializeInc/cloud/issues/11527

This also means you should probably change to getting the services with an owner reference pointing at environmentd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat! And with that we can skip fetching the pods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants