Skip to content

If a node is removed from the cluster, does it need to be restarted before it can rejoin the cluster? #596

@MCKZX-llx

Description

@MCKZX-llx

I had a three-node cluster (S1, S2, S3), S1 is the leader. Firstly I removed S3 and then I removed S2, but nodes S3 and S2 still in progress.
Secondly I add node S2 and S3 to the cluster, S2 add successful but S3 failed.
The logs in S3 are as follows:

[2025-06-16 17:31:15.880] [raft] [info] [NuRaftLoggerWrapper.h:57] [reconfigure:970] new configuration: log idx 5, prev log idx 4
peer 0, DC ID 0, llx01:10009, voting member, regular member, 1
peer 2, DC ID 0, llx03:10009, voting member, regular member, 1
peer 1, DC ID 0, llx02:10009, voting member, regular member, 1
my id: 2, leader: 0, term: 1
[2025-06-16 17:31:29.143] [raft] [info] [NuRaftLoggerWrapper.h:57] [handle_append_entries:923] receive a config change from leader at 6
[2025-06-16 17:31:29.147] [raft] [info] [NuRaftLoggerWrapper.h:57] [commit_conf:414] config at index 6 is committed, prev config log idx 5
[2025-06-16 17:31:29.148] [raft] [error] [NuRaftLoggerWrapper.h:63] [operator():357] session 2 failed to read rpc header from socket 192.168.1.25:45010 due to error 2, End of file, ref count 1
[2025-06-16 17:31:29.148] [raft] [info] [NuRaftLoggerWrapper.h:57] [reconfigure:743] new config log idx 6, prev log idx 5, cur config log idx 5, prev log idx 4
[2025-06-16 17:31:29.148] [raft] [info] [NuRaftLoggerWrapper.h:57] [reconfigure:846] this server (2) has been removed from the cluster, will step down itself soon. config log idx 6
[2025-06-16 17:31:29.148] [raft] [info] [NuRaftLoggerWrapper.h:57] [reconfigure:922] peer 2 cannot be found, no action for removing
[2025-06-16 17:31:29.148] [raft] [info] [NuRaftLoggerWrapper.h:57] [reconfigure:970] new configuration: log idx 6, prev log idx 5
peer 0, DC ID 0, llx01:10009, voting member, regular member, 1
peer 1, DC ID 0, llx02:10009, voting member, regular member, 1
my id: 2, leader: 0, term: 1
[2025-06-16 17:31:29.545] [raft] [info] [NuRaftLoggerWrapper.h:57] [handle_election_timeout:239] stepping down (cycles left: 1), skip this election timeout event
[2025-06-16 17:31:29.802] [raft] [info] [NuRaftLoggerWrapper.h:57] [handle_election_timeout:212] no hearing further news from leader, remove this server from cluster and step down
[2025-06-16 17:31:42.343] [raft] [info] [NuRaftLoggerWrapper.h:57] [handle_accept:997] receive a incoming rpc connection
[2025-06-16 17:31:42.343] [raft] [info] [NuRaftLoggerWrapper.h:57] [prepare_handshake:290] session 3 got connection from 192.168.1.25:58888 (as a server)
[2025-06-16 17:31:42.343] [raft] [info] [NuRaftLoggerWrapper.h:57] [handle_join_cluster_req:170] this server is already in a cluster, ignore the request

when I check the code 'handle_join_cluster_req:170' in file 'handle_join_leave.cxx', it indicate that when it has multiple nodes in its view, it can't been add to cluster.

ptr<resp_msg> raft_server::handle_join_cluster_req(req_msg& req) {
......
    ptr<cluster_config> cur_config = get_config();
    if (cur_config->get_servers().size() > 1) {
        p_in("this server is already in a cluster, ignore the request");
        return resp;
    }
......
}

S3 is removed first, three are still two nodes (S1, S2) in the cluster in its view, so if I add S3 again, it doesn't work, unless I restart node S3 and clear the saved config.

I wonder does it need to be shut down after a node been removed from a cluster?
or does it need to be restarted before a node can rejoin the cluster?

Why don't we just check if this node is in the cluster and instead check the number of nodes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions