-
Notifications
You must be signed in to change notification settings - Fork 49
Changes from 500 VMs Hybrid work #733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| Object value modified successfully | ||
| ``` | ||
|
|
||
| ## Failing ImagePull due to Pull Secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this scenario for when the user can no longer update the pull-secret by oc -n openshift-config edit secret/pull-secret ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I only encountered it when the credentials installed in the cluster belonged to an account that was deactivated and no nodes would pull images.
I didn't like this section of the Troubleshooting doc to place it under, but wasn't sure there was any better option. Open to suggestions on that too.
| <features> | ||
| <acpi/> | ||
| <apic/> | ||
| <ioapic driver='qemu'/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this option needed? Is it required for the virtual functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. So it can also be included in a flag switch to configure virtual functions.
| {% endif %} | ||
| <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> | ||
| </interface> | ||
| {% for i in range(1, 6) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an extra option which controls the creation of these interfaces? From what I read in the description there may be some issues with them so I suggest not enabling them by default.
| "host_name": "{{ hostname }}", | ||
| "host_role": "{{ host_role }}" | ||
| } | ||
| ignore_errors: yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need ignore_errors: yes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a case where I was getting errors when I couldn't find anything wrong. Can revert it if we would rather not ignore.
| shell: | | ||
| KUBECONFIG={{ bastion_cluster_config_dir }}/kubeconfig oc get nodes --no-headers -l node-role.kubernetes.io/worker | grep -c -v NotReady | ||
| register: oc_get_nodes_workers | ||
| until: oc_get_nodes_workers.stdout|int < current_worker_count+scale_out_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this condition correct? Shouldn't it be oc_get_nodes_workers.stdout|int == current_worker_count+scale_out_count ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. I thought I remember this working and the prior code specifically not working. But would have to test it more. Could also remove from this PR since I don't have much confidence in it.
| @@ -1,3 +1,6 @@ | |||
| [defaults] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They improved my quality of life. I was getting python interpreter warnings on every step, so added auto_silent and deprecation_warnings.
Writing log to a file is good to be able to trace back, also if we need a second set of eyes on a playbook run.
I am not sure display_args_to_stdout works properly so that one can go.
Adds VM recovery playbooks Enable hugepages on the hypervisor and VM configuration. Add playbook to disable devices created for virtual functions Working changes to configure hugetlb Complaint of a missing var, this import seems required Improve CSR approve and node Ready wait loop Add interfaces that can be used as virtual functions But they seemed to generate a lot of iowait activity on the VMs, so I don't know whether something is wrong with them or not Some changes were generated using Cursor and the claude-4-sonnet model. Signed-off-by: Andrew Collins <[email protected]> Apply suggestion from @mcornea Co-authored-by: Marius Cornea <[email protected]>
c04998a to
69b8036
Compare
Adds VM recovery playbooks
Enable hugepages on the hypervisor and VM configuration. Add playbook to disable devices created for virtual functions Working changes to configure hugetlb
Complaint of a missing var, this import seems required
Improve CSR approve and node Ready wait loop
Add interfaces that can be used as virtual functions But they seemed to generate a lot of iowait activity on the VMs, so I don't know whether something is wrong with them or not
Some changes were generated using Cursor and the claude-4-sonnet model.