Saturday, January 16, 2016

HOW to recover from Amnesai situation


Amnesia Scenario:
 Node node-1 is shut down.
Node node-2 crashes and will not boot due to hardware failure.
Node node-1 is rebooted but stops and prints out the messages: 
Booting as part of a cluster
    NOTICE: CMM: Node node-1 (nodeid = 1) with votecount = 1 added.
    NOTICE: CMM: Node node-2 (nodeid = 2) with votecount = 1 added.
    NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d4s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
    NOTICE: CMM: Node node-1: attempting to join cluster.
  
 NOTICE: CMM: Quorum device 1 (gdevname /dev/did/rdsk/d4s2) can not be acquired by the current cluster members. This quorum device is held by node 2.
NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.
Node node-1 cannot boot completely because it cannot achieve the needed quorum vote count.
 In the above case, node node-1 cannot start the cluster due to the amnesia protection of Oracle Solaris Cluster. Since node node-1 was not a member of the cluster when it was shut down (when node-2 crashed) there is a possibility it has an outdated CCR and should not be allowed to automatically start up the cluster on its own.
The general rule is that a node can only start the cluster if it was part of the cluster when the cluster was last shut down. In a multi node cluster it is possible for more than one node to become "the last" leaving the cluster. 
How to recover Sun Cluster 3.3 from amnesia if its having only one operatinal node 
When we stop all nodes in Sun Cluster, the last node that leaves the cluster is the first that have to boot for the CCR consistency. However, if for any reason the last node that leaves the cluster can not boot (hardware failure … etc)
we will find the problem that the other nodes in the cluster will not boot and this message will appear:
Jul 15 11:05:19 maquina01 cl_runtime: [ID 980942 kern.notice]
 NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting
 for quorum.
This is a normal behavior that occurs to prevent what Sun Cluster called “amnesia” (see documentation for details). To start the cluster while the faulty node is repaired, we must make the following changes:
boot the node outside of the cluster
# reboot -- -x
Edit the file /etc/cluster/ccr/global/infrastructure
# cd /etc/cluster/ccr/global/
# vi infrastructure
Edit the /etc/cluster/ccr/infrastructure file and change the quorum_vote to 1 for the node that is up:
# vi /etc/cluster/ccr/infrastructure
  cluster.nodes.1.name   NODE1
  cluster.nodes.1.state  enable
  cluster.nodes.1.properties.quorum_vote  1
For all other nodes and any Quorum Device, set the votecount to zero (0). For example:
cluster.nodes.N.properties.quorum_vote  0
cluster.quorum_devices.Q.properties.votecount  0
Where N is the node id and Q is the quorum device id.
Regenerate the checksum of /etc/cluster/ccr/infrastructure file:
# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/infrastructure -o
Reboot node NODE1 into the cluster:
# reboot

No comments:

Post a Comment