Tags Archives: watchdog

Configuring SBD Cluster Node Fencing

SBD Storage Based Device or Storage Based Death uses a storage disk based method to fence nodes

 

 

So you need a shared disk for the nodes with minimum 8MB size partition (ie small, it is just used for this purpose only)

NOTE: You need ISCSI configured first in order to use SBD!

 

each node

 

– gets one node slot to track status info

 

– runs sbd daemon, started as a corosync dependency

 

– has /etc/sysconfig/sbd which contains a stonith device list

 

– a watchdog hardware timer is required, generates a reset if it reaches zero

 

 

 

How SBD works:

 

a node gets fenced by the cluster writing a “poison pill” into its respective SBD disk slot

its the opposite of scsi reservation – something is removed in order to fence, whereas on sbd something is ADDED in order to fence.

node hardware communicates to the watchdog device and the timer is reset, if hardware stops communicating to the watchdog, then the timer will continue to run down and will lead to a poison pill!

 

Some hardware uses hardware watchdog kernel modules in some cases! So if your hardware does not support it, then you can use softdog. This is also a kernel module.

 

check with:

 

systemctl status systemd-modules-load

 

Put a config file in /etc/modules-load.d with name softdog.conf   (this is essential)

 

in this file just put the line:

 

softdog

 

then do

 

systemctl restart systemd-modules-load

 

lsmod | grep dog to verify the watchdog module is active.

 

THIS IS ESSENTIAL TO USE SBD!!

 

then set up SBD module

on suse you can run ha-cluster-init interactively

 

or you can use the sbd util:

 

sbd -d /dev/whatever create (your sbd device ie partition)

 

then

 

edit the /etc/sysconfig/sbd

 

SDB_DEVICE – as above

SBD_WATCHDOG=”yes”

SBD_STARTMODE=”clean” – this is optional, for test env don’t use

 

then sync your cluster config

 

pcs cluster sync

 

and restart cluster stack on all nodes

 

pcs cluster restart

 

 

then create the cluster resource in the pacemaker config using crm configure

 

eg

 

primitive my-stonith-sbd stonith:external/sbd

 

my-stonith-sbd is the name you assign to the device

 

then set the cluster properties for the resource:

 

property stonith-enabled=”true” (the default is true)

 

property stonith-timeout=”30″ (default)

 

 

to verify the config:

 

sbd -d

 

sbd -d /dev/whatever list

 

sbd -d /dev/whatever dump

 

 

On the node itself that you want to crash:

echo c > /proc/sysrq-trigger

this will hang the node immediately.

 

to send  messages to a device:

 

sbd -d /dev/whatever message node1 test| reset | poweroff

 

to clear a poison pill manually from a node slot – you have to do this if a node is fenced and has not processed the poison pill properly – else it will crash again on rebooting:

 

sbd -d /dev/whatever message node clear

 

ESSENTIAL if you have set SBD_STARTMODE=”clean”

 

but in worse case if you don’t do this, then it will boot a second time and on the second time it should clear the poison pill.

 

Use fence_xvm -o list on the KVM hypervisor host to display information about your nodes

 

An important additional point about SBD and DRBD 

 

The external/sbd fencing mechanism requires the SBD disk partition to be readable directly from each node in the cluster.

 

For this reason,  a DRBD device must not be used to house an SBD partition.

 

However, you can deploy SBD fencing mechanism for a DRBD cluster, provided the SBD disk partition is located on a shared disk that is neither mirrored nor replicated.

Continue Reading