Restoring Rabbit functionality

  1. Environment

    3 virtual machines and a cluster has been built. Check Cluster construction

    Master node yzm1(128), slave node yzm2(129), slave node yzm3(130)

  2. The master node is normal, but the slave node is down.

    Check the cluster status, all 3 virtual machines are in normal operation:

    ../../_images/cluster_status.png

    Cluster checking

    Make yzm2 down, shut down the virtual machine corresponding to yzm2:

    ../../_images/yzm2.png

    yzm2

    Remove the node of yzm2:

    ../../_images/removing_yzm2.png

    yzm2

    Restart the rabbitmq service of yzm2:

    ../../_images/restarting_rabbit.png

    Restarting the rabbitmq service

    Error message: Node rabbit@yzm2 thinks it’s clustered with node rabbit@yzm3, but rabbit@yzm3 disagrees

    It means that the yzm2 downtime is not a normal exit from the cluster, and the node configuration information of the cluster is still retained, but the yzm2 node has been kicked out of the cluster

    Solution Clear the reserved node configuration information:

    rm -rf /var/lib/rabbitmq/mnesia/*.
    
    ../../_images/solution.png

    Clearing information

    After the files in the mnesia directory are deleted, they will be regenerated after a while.

    ../../_images/regeneration.png

    Data recovery

    If you accidentally delete the mnesia directory, then recreate the mnesia directory, but you must ensure that the user and group of the directory are rabbitmq

    Command:

    chown Rabbitmq:rabbitmq /var/lib/rabbitmq/
    

    yzm2 rejoins the cluster:

    ../../_images/rejoins.png

    yzm2 rejoining the cluster

  3. The master node is down, and the slave node is normal.

    Close yzm3, yzm2

    ../../_images/close_yzm.png

    Closing yzm3, yzm2

    yzm1 is down, turn off the virtual machine

    At this time, yzm1 is the last node, and the last node is also regarded as the master node.

    Start yzm2, it will wait for yzm1 to start, and it will fail to start after timeout:

    ../../_images/starting_yzm.png

    Starting yzm2

    Operate yzm3 to eliminate yzm1:

    ../../_images/eliminate_yzm1.png

    Eliminating yzm1

    Main command:

    rabbitmqctl force_boot
    rabbitmqctl start_app
    rabbitmqctl forget_cluster_node rabbit@yzm1
    

    I have used the –offline parameter many times, but it has not been used successfully, and I don’t know why

    For example such usage:

    rabbitmqctl forget_cluster_node --offline rabbit@yzm1
    rabbitmqctl -n rabbit@yzm3 forget_cluster_node --offline rabbit@yzm1
    

    Finally restart yzm1 and join the cluster

    This step is the same as the above yzm2 restart to join the cluster, you need to delete the node configuration information first cluster status:

    ../../_images/deleting_configuration.png

    Deleting configuration

  4. Node update before starting the node

    The current cluster has nodes yzm2 and yzm3.

    ../../_images/nodes_2.png

    Nodes yzm2 и yzm3

    Shutdown node yzm2:

    ../../_images/shutdown_yzm2.png

    Shutdowning node yzm2

    At the same time, start yzm1 to rejoin the cluster (the node configuration information needs to be cleared before starting the rabbitmq service):

    ../../_images/starting_yzm1.png

    Starting yzm1

    Close the yzm3 node and start the yzm2 node:

    ../../_images/closing_yzm3.png

    Closing yzm3

    ../../_images/starting_yzm2.png

    Starting yzm2

    yzm2 will wait until it times out and reports an error

    It is correct to consult yzm1 that is running normally in the cluster before starting:

    rabbitmqctl update_cluster_nodes rabbit@yzm1
    
    ../../_images/consulting_yzm1.png

    Checking yzm1