Skip to content

feat: add cephadm maintenance playbooks #696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jackhodgkiss
Copy link
Contributor

Add two playbooks for entering and exiting maintenance mode for a given Ceph node.

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-enter-maintenance.yml --limit ceph-mon-01

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-exit-maintenance.yml --limit ceph-mon-01

Note these playbooks use stackhpc.cephadm.commands which will delegate the command to the first mon within your inventory. If this node is in maintenance you must specify --cephadm_delegate_host and provide another mon.

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/cephadm-exit-maintenance.yml --limit ceph-mon-01 --cephadm_delegate_host ceph-mon-02

Note: this relies on something such as stackhpc/ansible-collection-cephadm/pull/109 being merged with some additional changes.

@Alex-Welsh
Copy link
Member

Note these playbooks use stackhpc.cephadm.commands which will delegate the command to the first mon within your inventory. If this node is in maintenance you must specify --cephadm_delegate_host and provide another mon.

Is there a way of determining the hosts that are not in maintenance and selecting one of them? That would make things a lot more simple.

@jackhodgkiss jackhodgkiss force-pushed the ceph-maintenance-playbook branch from 007e649 to 0522985 Compare October 9, 2023 13:12
@jackhodgkiss
Copy link
Contributor Author

Note these playbooks use stackhpc.cephadm.commands which will delegate the command to the first mon within your inventory. If this node is in maintenance you must specify --cephadm_delegate_host and provide another mon.

Is there a way of determining the hosts that are not in maintenance and selecting one of them? That would make things a lot more simple.

I suppose it is possible. However I see this as no different to how we handle controllers and VIP and intentionally avoid the VIP until the end.

My concern would be if a host is in maintenance the command gets trapped as it will proceed to authenticate with the cluster and silently fail so it would involve timeouts and other work arounds.

@Alex-Welsh
Copy link
Member

Note these playbooks use stackhpc.cephadm.commands which will delegate the command to the first mon within your inventory. If this node is in maintenance you must specify --cephadm_delegate_host and provide another mon.

Is there a way of determining the hosts that are not in maintenance and selecting one of them? That would make things a lot more simple.

I suppose it is possible. However I see this as no different to how we handle controllers and VIP and intentionally avoid the VIP until the end.

My concern would be if a host is in maintenance the command gets trapped as it will proceed to authenticate with the cluster and silently fail so it would involve timeouts and other work arounds.

Does it error gracefully if the node is in maintenance? If not it might be worth adding "precheck" task to verify

@jackhodgkiss
Copy link
Contributor Author

Note these playbooks use stackhpc.cephadm.commands which will delegate the command to the first mon within your inventory. If this node is in maintenance you must specify --cephadm_delegate_host and provide another mon.

Is there a way of determining the hosts that are not in maintenance and selecting one of them? That would make things a lot more simple.

I suppose it is possible. However I see this as no different to how we handle controllers and VIP and intentionally avoid the VIP until the end.
My concern would be if a host is in maintenance the command gets trapped as it will proceed to authenticate with the cluster and silently fail so it would involve timeouts and other work arounds.

Does it error gracefully if the node is in maintenance? If not it might be worth adding "precheck" task to verify

I will have to check but I think ceph has a tendency to return 0 regardless of it being successful or not.

name: stackhpc.cephadm.commands
vars:
cephadm_commands:
- "orch host maintenance enter {{ ansible_facts.nodename }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be possible for any host holding RGW services - gets:

WARNING: Removing RGW daemons can cause clients to lose connectivity.
Note: Warnings can be bypassed with the --force flag

Of course --force defeats the purpose of other checks and is not viable here.

@markgoddard
Copy link
Contributor

I reworked these into roles in the cephadm collection: stackhpc/ansible-collection-cephadm#153. Once that merges I'll propose some playbooks in SKC.

@jackhodgkiss jackhodgkiss deleted the ceph-maintenance-playbook branch January 15, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants