Skip to content

Conversation

@mboersma
Copy link
Contributor

@mboersma mboersma commented Nov 6, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Updates commands in Windows templates to work with either old nssm.exe or new sc.exe for running the kubelet.exe service.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Nov 6, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 6, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 6, 2025
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.54%. Comparing base (d64c97c) to head (42411d3).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5967   +/-   ##
=======================================
  Coverage   44.54%   44.54%           
=======================================
  Files         279      279           
  Lines       25140    25140           
=======================================
  Hits        11199    11199           
  Misses      13128    13128           
  Partials      813      813           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mboersma
Copy link
Contributor Author

mboersma commented Nov 6, 2025

/cherry-pick release-1.21

@k8s-infra-cherrypick-robot

@mboersma: once the present PR merges, I will cherry-pick it on top of release-1.21 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 6, 2025

/test ?

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiserver-ilb
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-custom-builds
/test pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
/test pull-cluster-api-provider-azure-dra-scalability
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-windows
/test pull-cluster-api-provider-azure-load-test-1k-dra-with-workload-custom-builds
/test pull-cluster-api-provider-azure-load-test-custom-builds
/test pull-cluster-api-provider-azure-load-test-dra-custom-builds
/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds
/test pull-cluster-api-provider-azure-perf-test-apiserver-availability
/test pull-cluster-api-provider-azure-windows-custom-builds
/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-conformance
pull-cluster-api-provider-azure-conformance-custom-builds
pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-e2e-workload-upgrade
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 6, 2025

Looking through the test configs, I don't see any of our presubmits that invoke ci-entrypoint with TEST_WINDOWS=true, so I don't think we can exercise this directly before we merge. I'll try once locally but if I have an issue setting up the test I'm not going to spend all day getting it to work. I think the worst case here is that the tests that are already failing continue to fail.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 6, 2025
Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 41e9612aae1d53dc281848eeeeb88b10c80860f1

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2025
@marosset
Copy link
Contributor

marosset commented Nov 6, 2025

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marosset, nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 6, 2025

In my testing I'm seeing the Windows Nodes come up, but then cloud-node-manager is skipping them so they don't get a providerID which CAPI expects.

I1106 17:59:34.680210    5428 nodemanager.go:359] This node capz-t03o-l6tjb is registered without the cloud taint. Will not process.

I can see in https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-azurefile-csi-driver-e2e-capz-windows-2022-hostprocess/1986347754665807872 that the kubelet failed right after it started up but the logs are empty in the artifacts. The template is still setting the --cloud-provider=external flag for kubelet.

From scm.log:

11/6/2025 8:38:35 AM 7036 Information      The kubelet service entered the running state.          
11/6/2025 8:38:35 AM 7039 Warning          A service process other than the one launched by the    
                                           Service Control Manager connected when starting the     
                                           kubelet service.  The Service Control Manager launched  
                                           process 572 and process 4960 connected instead.         
                                                                                                   
                                             Note that if this service is configured to start      
                                           under a debugger, this behavior is expected. 
...
11/6/2025 8:37:22 AM 7000 Error            The kubelet service failed to start due to the          
                                           following error:                                        
                                           The service did not respond to the start or control     
                                           request in a timely fashion.                            
11/6/2025 8:37:22 AM 7009 Error            A timeout was reached (45000 milliseconds) while        
                                           waiting for the kubelet service to connect.  

@mboersma
Copy link
Contributor Author

mboersma commented Nov 6, 2025

A service process other than the one launched by the
Service Control Manager connected when starting the
kubelet service. The Service Control Manager launched
process 572 and process 4960 connected instead.

Maybe there's some contention between the two service launchers?

@marosset
Copy link
Contributor

marosset commented Nov 6, 2025

/cc @zylxjtu

@k8s-ci-robot k8s-ci-robot requested a review from zylxjtu November 6, 2025 19:18
@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 6, 2025

These are the kubelet logs I've gathered from a Node in my local testing: https://gist.github.com/nojnhuh/465e48aeee5be0bdde8c7bf070803ad6

@marosset @zylxjtu Do you spot any meaningful errors there?

There are quite a few messages about CNI not being ready, but I wonder if those are expected while everything is starting up. All pods are reporting Running and Ready.

@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 6, 2025

For context, in my testing I'm trying to mimic this job: https://testgrid.k8s.io/provider-azure-azurefile-csi-driver#pull-azurefile-csi-driver-e2e-capz-windows-2022-hostprocess

KUBERNETES_VERSION=latest-1.32, so pod-infra-container-image shouldn't be a problem.

@mboersma
Copy link
Contributor Author

All pods are reporting Running and Ready.

I'm running into the same issue trying to fix this in testgrid. We're waiting on the MachineDeployment which stays ScalingUp despite the fact that the two Windows nodes are Ready.

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-ci-entrypoint

@k8s-ci-robot
Copy link
Contributor

@mboersma: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e 42411d3 link true /test pull-cluster-api-provider-azure-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

5 participants