Skip to content

Conversation

@afcollins
Copy link
Member

Adds VM recovery playbooks
Enable hugepages on the hypervisor and VM configuration. Add playbook to disable devices created for virtual functions Working changes to configure hugetlb
Complaint of a missing var, this import seems required

Improve CSR approve and node Ready wait loop

Add interfaces that can be used as virtual functions But they seemed to generate a lot of iowait activity on the VMs, so I don't know whether something is wrong with them or not

Some changes were generated using Cursor and the claude-4-sonnet model.

@openshift-ci
Copy link

openshift-ci bot commented Oct 31, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign akrzos for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@akrzos akrzos requested a review from mcornea November 3, 2025 14:36
Object value modified successfully
```

## Failing ImagePull due to Pull Secret
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scenario for when the user can no longer update the pull-secret by oc -n openshift-config edit secret/pull-secret ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I only encountered it when the credentials installed in the cluster belonged to an account that was deactivated and no nodes would pull images.

I didn't like this section of the Troubleshooting doc to place it under, but wasn't sure there was any better option. Open to suggestions on that too.

<features>
<acpi/>
<apic/>
<ioapic driver='qemu'/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this option needed? Is it required for the virtual functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So it can also be included in a flag switch to configure virtual functions.

{% endif %}
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</interface>
{% for i in range(1, 6) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an extra option which controls the creation of these interfaces? From what I read in the description there may be some issues with them so I suggest not enabling them by default.

"host_name": "{{ hostname }}",
"host_role": "{{ host_role }}"
}
ignore_errors: yes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need ignore_errors: yes here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a case where I was getting errors when I couldn't find anything wrong. Can revert it if we would rather not ignore.

shell: |
KUBECONFIG={{ bastion_cluster_config_dir }}/kubeconfig oc get nodes --no-headers -l node-role.kubernetes.io/worker | grep -c -v NotReady
register: oc_get_nodes_workers
until: oc_get_nodes_workers.stdout|int < current_worker_count+scale_out_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition correct? Shouldn't it be oc_get_nodes_workers.stdout|int == current_worker_count+scale_out_count ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I thought I remember this working and the prior code specifically not working. But would have to test it more. Could also remove from this PR since I don't have much confidence in it.

@@ -1,3 +1,6 @@
[defaults]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They improved my quality of life. I was getting python interpreter warnings on every step, so added auto_silent and deprecation_warnings.

Writing log to a file is good to be able to trace back, also if we need a second set of eyes on a playbook run.

I am not sure display_args_to_stdout works properly so that one can go.

Adds VM recovery playbooks
Enable hugepages on the hypervisor and VM configuration.
Add playbook to disable devices created for virtual functions
Working changes to configure hugetlb
Complaint of a missing var, this import seems required

Improve CSR approve and node Ready wait loop

Add interfaces that can be used as virtual functions
But they seemed to generate a lot of iowait activity on the VMs, so I don't know whether something is wrong with them or not

Some changes were generated using Cursor and the claude-4-sonnet model.

Signed-off-by: Andrew Collins <[email protected]>

Apply suggestion from @mcornea

Co-authored-by: Marius Cornea <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants