Contents:
About this guide
This document is focusing on a design where XEN virtual machines (domU) are centrally managed according to a set policy and highly available. What is written here is the real thing not a demo and has been working in production for over 18 months. We use this technology for hosting many UNIX services including DNS, Web proxy, SMTP, NFS, VPN, CUPS, etc. and traditional Netware services with OES2 although in this document I only present solution for hosting home directories for UNIX users by NFS (HASI).
It's inspired by and based on the original Novell demo introduced by Jo De Baer to whom I am thankful for support and effort for this piece of technology.
http://wiki.novell.com/images/3/37/Exploring_HASF.pdf
http://wiki.novell.com/images/c/c8/Tut323_bs2007.pdf
HASI (High Availability Storage Infrastructure) is fully supported by Novell on SLES10:
http://www.novell.com/products/server/ha_storage.html
Please read the guides above to have better understanding about this technology. This guide assumes that you already have hands on SLES10 experience, Heartbeat, EVMS and other components and intended for experienced system administrators.
Overview
We have a 2 node XEN cluster (HP DL360G5) running on a small local SAS storage but also has connectivity to our fiber channel SAN where all our resources (XEN virtual machines) reside. We run SLES10 SP2 on both cluster member nodes and configure Heartbeat (high availability) services upon our member nodes (XEN hosts). We set a policy which treats each virtual machine as an individual primitive resource, monitors each virtual machine alongside of their network connectivity and act according to an event.
Point
This is a 2 node cluster what isn't a big deal to manage but HA (heartbeat) supports up to 16 nodes in one cluster and sharing a storage between these nodes is very dangerous. Due to HA is configured on all cluster member nodes, it tracks each resource (XEN domU) and through monitoring operation, it knows where a certain resource is running, what it's doing, etc.
It protects you from your own mistakes for instance starting up a new instance of an already running virtual machine corrupting its storage instantly.
HA is your control center, you must do everything there including starting, stopping, migrating, etc. your virtual machines otherwise HA gets confused. It's not just the safe way of managing virtual machines it's very good for DR and business continuity.
Should you lose one of your virtual machines, should one of them crash, freeze your host, should your LAN switch go faulty? HA will restart (stop/start), migrate (even live if possible) your XEN domU resource to another healthy host with all its dependent services.
Remember that you have to eliminate the single point of failure and do as much as possible for redundancy for everything else in this scenario. Resources can only be highly available if you have redundancy in your storage, servers, switches, power supply (UPS and emergency generator) HA communication paths, etc. as well.
Storage
Both hosts see both LUNs even if the diagram is a bit confusing
I have LUN2 (currently 100GB) for virtual machines. I split this up with EVMS and through CSM container we can share the same block based storage layer between both XEN hosts (HP DL360G5).
It's simpler somewhat comparing to file image based setup where you'd need an extra layer to mount the images as well as this setup is well tested and works even for live migration seamlessly not to mention the unbeatable I/O performance.
LUN1 (currently 2T) is for user data (home directories) what I do NOT manage on the XEN hosts. I actually forced EVMS to manage (able to see) only the 100GB LUN2 because I want to manage the 2T array within my NFS virtual machine. We just map this LUN1 to our NFS virtual machine.
What makes this advanced compared to the original Novell demo is as follows:
- Fiber channel SAN for storage (HP EVA 6000)
- SLES10 SP2 (refreshed HA, EVMS, XEN, OCFS2, etc. components)
- Block device based XEN virtual machines (unbeatable I/O performance)
- Live migration ready XEN virtual machines
- HA includes network connection monitoring
- HA includes adjusted timing for XEN virtual machines (hypervisor friendly)
- HA includes HP iLO STONITH resource for increased protection (fencing)
Configuration
I installed a fairly cut down copy of SLES10 SP2 on my XEN hosts, LUNs are presented to both nodes, etc. I configured my first eth0 NIC with a class C private IP address which will be the main connection back to our private LAN. The other NIC is configured with a class A private IP address what I use solely for HA communication. I simply connected these to multiple switches (redundancy) but if you have your hosts close to each other you could use crossover cable as well. I did have several virtual machines already hence it's not covered in this guide by the way there are plenty of notes about that already. I run mainly SLES10 SP2 XEN virtual machines whereas possible although have some Debian Linux 4.0 virtual machines as well.
- NTP Setup
The time on the two physical machines needs to be synchronized. Several components in the HASF stack require this. I have configured both nodes to use our internal ntp servers (3 of them) in addition to the other node which would give us fairly decent redundancy.
host1:~ # vi /etc/sysconfig/ntp
NTPD_INITIAL_NTPDATE="ntp2.domain.co.nz ntp3.domain.co.nz ntp1.domain.co.nz"
NTPD_ADJUST_CMOS_CLOCK="no"
NTPD_OPTIONS="-u ntp -L -I eth0 -4"
NTPD_RUN_CHROOTED="yes"
NTPD_CHROOT_FILES=""
NTP_PARSE_LINK=""
NTP_PARSE_DEVICE=""
NTPD_START="yes"
Remember that making changes to the /etc/sysconfig directory from the command line you need to run SuSEconfig to apply changes:
host1:~ # SuSEconfig
host1:~ # vi /etc/ntp.conf
server 127.127.1.0
fudge 127.127.1.0 flag1 0 flag2 0 flag3 0 flag4 0 stratum 5
server ntp2.domain.co.nz iburst
server ntp3.domain.co.nz iburst
server ntp1.domain.co.nz iburst
server host2.domain.co.nz iburst
driftfile /var/lib/ntp/drift/ntp.drift
logfile /var/log/ntp
It was set by the YaSTGUI module, includes mainly the defaults. I added the servers with iburst option to reduce the initial sync time and made the local NTP server to be stratum 5.
Both nodes need to reach each other without DNS, I added the iLO IP addresses here too:
host1:~ # vi /etc/hosts
192.168.1.1 host1.domain.co.nz host1
192.168.1.2 host2.domain.co.nz host2
10.0.1.1 host1.domain.co.nz host1
10.0.1.2 host2.domain.co.nz host2
172.16.1.1 host1-ilo.domain.co.nz host1-ilo
172.16.1.2 host2-ilo.domain.co.nz host2-ilo
Certainly these need to be done on the other node as well the same way respectively.
- Multipathing
I have redundant SAN switches and controllers hence for proper redundancy we need to configure this service and it could confuse EVMS as well. There's a guide from HP but it requires the HP drivers to be installed. I prefer using the SuSE stock kernel drivers because it is maintained by Novell and works pretty much out of the box. Using the HP one may require you to reinstall or update the HP drivers at a time when you receive new kernel update. Tools we need:
host1:~ # rpm -qa | grep -E 'mapper|multi'
device-mapper1.02.13-6.14
multipath-tools-0.4.7-34.38
Find out what parameters the stock kernel driver supports.
host1:~ # modinfo qla2xxx| grep parm
We need as quick response as possible in an emergency hence we instruct the stock driver to disable the HBA built in failover and propagate this up to the dm I/O layer. The stock driver does support this as shown by the list above... Activate it:
host1:~ # echo "options qla2xxx qlport_down_retry=1">> /etc/modprobe.conf.local
Update the ramdisk image then reboot the server:
host1:~ # mkinitrd && reboot
After reboot, ensure that modules for multipathing are loaded:
host1:~ # lsmod | grep 'dm'
dm_round_robin 7424 1
dm_multipath 30344 2 dm_round_robin
dm_mod 67504 39 dm_multipath
Your SAN devices should be visible by now (2 for each LUN), in my case /dev/sda, sdb, sdc, sdd. Note: this will dynamically change when you add additional LUNs to the machine or get any other disk managed by dm.
Find out your WWID numbers, it's needed for multipath configuration:
host1:~ # for disk in sda sdb sdc sdd; do scsi_id -g -s /block/$disk; done
3600508b4001046490000700000360000
3600508b40010464900007000000c0000
3600508b4001046490000700000360000
3600508b40010464900007000000c0000
Multipathing is somewhat complex and not easy to understand at the first time but we need to pay attention to it otherwise it will give us hard time. It's well documented in the SLES storage guide:
http://www.novell.com/documentation/sles10/stor_evms/data/bookinfo.html
The problem I found that dm tries to take over and manage pretty much every block device (after SP2 update) therefor we will need to blacklist everything including CD/DVD drives, local cciss (HP SmartArray), etc. as well except the SAN LUNs.
Configure multipath service according to your WWID numbers. Remember we have duplicates because we have double paths for each LUN:
#
## Global settings: should be sufficient, change it on your own risk
#
defaults {
multipath_tool "/sbin/multipath -v0"
udev_dir /dev
polling_interval 5
default_selector "round-robin 0"
default_path_grouping_policy multibus
default_getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
default_prio_callout /bin/true
default_features "0"
rr_min_io 100
failback immediate
}
#
## Local devices MUST NOT be managed by multipathd
#
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z][0-9]*"
devnode "^cciss!c[0-9]d[0-9]*"
devnode "^sd[e-z][0-9]*"
devnode "^xvdp*"
}
multipaths {
#
## Mpath1: Storage area for user's home within NFS domU
#
multipath {
wwid 3600508b40010464900007000000c0000
alias mpath1
path_grouping_policy multibus
path_checker readsector0
path_selector "round-robin 0"
}
#
## Mpath2: Storage area for clustered domUs via EVMS
#
multipath {
wwid 3600508b4001046490000700000360000
alias mpath2
path_grouping_policy multibus
path_checker readsector0
path_selector "round-robin 0"
}
}
#
## Device settings: should be sufficient, change it on your own risk
#
devices {
device {
vendor "HP"
product "HSV200"
path_grouping_policy group_by_prio
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
path_checker tur
path_selector "round-robin 0"
prior_callout "/sbin/mpath_prio_alua %n"
failback immediate
rr_weight uniform
rr_min_io 100
no_path_retry 60
}
}
Enable services upon reboot:
host1:~ # insserv boot.device-mapper boot.multipath multipathd
Remember that every time you change the multipath.conf file you must rebuild your initrd and reboot the server:
host1:~ # mkinitrd -f mpath && reboot
After reboot check your multipaths:
host1:~ # multipath -ll
mpath2 (3600508b4001046490000700000360000) dm-1 HP,HSV200
[size=100G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 0:0:1:1 sdc 8:32 [active][ready]
\_ 0:0:0:1 sda 8:0 [active][ready]
mpath1 (3600508b40010464900007000000c0000) dm-0 HP,HSV200
[size=2.0T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=2][active]
\_ 0:0:1:2 sdd 8:48 [active][ready]
\_ 0:0:0:2 sdb 8:16 [active][ready]
Check blacklists with verbose output:
host1:~ # multipath -ll -v3
For further information on HP technology please refer to the original HP guide:
http://h20000.www2.hp.com/bc/docs/support/SupportM...
Do exactly the same on the other node as well. I copied the multipath.conf over to the other host followed by setting up the services as shown above.
Multipath for root device:
If you want to boot your XEN host off the SAN you would have an extra LUN for the host's OS. The method doing your host this way is that you install the system onto one of your disks (of your OS LUN). Once you have a running system you can build multipath on top of that, the only difference is that you will not create alias for your host's OS disk LUN!
Further information:
http://www.novell.com/documentation/sles10/stor_ev...
http://www.novell.com/support/php/search.do?cmd=di...
- Heartbeat
Note:HA has been going through a transformation lately in order to support both the OpenAIS and Heartbeat cluster stacks equally. The resource manager (crm) got extracted out of the HA package and became and individual project named Pacemaker.
What you see in SLES10 at this time of writing (version 2.1.4) is a special Novell port for SLES10 customers only, bundled with new features, bug fixes since the change in the project. Ultimately it will change in SLES11, HA will be replaced with OpenAIS and follow the same packaging and naming convention according to the recent changes in the project.
More information:
http://www.novell.com/linux/volumemanagement/strategy.html
http://www.clusterlabs.org
Heartbeat (referred as HA) is a very powerful, versatile, open source clustering solution for Linux. SuSE and IBM are big contributors of this project (code) which I am personally thankful for.
HA will be our central database, control center, it will manage our resources (domU) and their dependencies according to a set policy. Some services can be configured via LSB scripts for certain runlevels but basically HA will take over this for most services necessary for domU management by the way EVMS what we will configure in a minute doesn't maintain cluster memberships. We need HA to actually maintain memberships and activate EVMS volumes upon startup on our nodes.
Install heartbeat package first (there's plenty of ways doing this):
host1:~ # yast2 sw_single &
Change the filter to pattern then select the entire High Availability group since we will need other components of that later on. Ignore the disk section, it's not the picture of the actual server...
HA at present still supports v1 configuration (haresources file) but we use the new v2 style (crm with XML files) known to be better and more powerful.
In this section I only show the initial setup of HA, resources and the complete cluster setup will be discussed later on. Since evmsd is started by HA, I have to present this part before I can discuss EVMS volumes and disk configurations.
Configuration:
host1:~ # vi /etc/ha.d/ha.cf
udpport 694 # Default
use_logd on # Need powerful logging
keepalive 2 # Interval of the members checking each other
initdead 50 # Patience for services starting up after boot
deadtime 50 # Time to clarify a member host dead
deadping 50 # Time to clarify ping host dead
autojoin none # We add members manually only
crm true # We want v2 configuration
coredumps true # Useful
auto_failback off # Prefer resources staying wherever they are
ucast eth0 192.168.1.2 # Other member host to talk to (NIC1)
ucast eth1 10.0.1.2 # Other member host to talk to (NIC2)
node host1 host2 # Cluster member host names (hostnames - not FQDN)
ping 192.168.1.254 # Ping node to talk to
apiauth evms uid=hacluster,root
respawn root /sbin/evmsd
The bold lines tell heartbeat to bring up evmsd at startup. evmsd is a remote extension of the EVMS engine without the brain. I configured unicast simply because I prefer it over broadcast and multicast. At last but not least I configured one ping node because I only care about one connection back to my private LAN, the other NIC is reserved for HA communication only.
Configure authentication for the member nodes' communication:
host1:~ # sha1sum
yoursecretpassword
7769bf61f294d7bb91dd3583198d2e16acd8cd76 -
host1:~ # vi /etc/ha.d/authkeys
auth 1
1 sha1 7769bf61f294d7bb91dd3583198d2e16acd8cd76
Configure logging:
host1:~ # vi /etc/ha.d/ha_logd.cf
logfacilitylocal7
debugfile /var/log/ha-debug
logfile/var/log/ha-log
host1:~ # ln -s /etc/ha.d/ha_logd.cf /etc/logd.cf
Configure log rotation for your HA logs:
host1:~ # vi /etc/logrotate.d/heartbeat
/var/log/ha-debug {
weekly
missingok
compress
rotate 4
copytruncate
}
/var/log/ha-log {
weekly
missingok
compress
rotate 4
copytruncate
}
The reason why we need internal logging from HA is that for busy clusters there's a good chance to lose logs.
Logs are vital parts of a well functioning system so for that reason I have a central logserver where I am going to send all cluster member node logs alongside of the local files.
You don't need this part if you are not planning to use central logging.
host1:~ # vi /etc/syslog-ng/syslog-ng.conf.in
options { long_hostnames(off); sync(0); perm(0640); stats(3600);
check_hostname(no); dns_cache(yes); dns_cache_size(100);
log_fifo_size(4096); keep_hostname(yes); chain_hostnames(no); };
-snip-
## ------------------------------------- ##
## HA logs sent to remote server as well ##
## ------------------------------------- ##
filter f_ha { facility(local7) and not level(debug); };
destination ha_tcp { tcp("yourserver" port(514)); };
log { source(src); filter(f_ha); destination(ha_tcp); };
I modified the global options as shown above for performance reasons and added the second part to the bottom of the configuration file. I used my server's DNS name, excluded debug info for my own taste, you may consider those too. At completion rebuild syslog-ng's configuration file:
host1:~ # SuSEconfig --module syslog-ng
Do not forget to prepare your remote syslog server for these log entries! It's a bit outdated and doesn't include performance settings but related reading:
http://wiki.linux-ha.org/SyslogNgConfiguration
Start and enable HA at boot:
host1:~ # rcheartbeat start && insserv heartbeat
Don't forget the other node (everything is the same except the IP addresses):
host2:~ # vi /etc/ha.d/ha.cf
-snip-
ucast eth0 192.168.1.1# Other member host to talk to (NIC1)
ucast eth1 10.0.1.1# Other member host to talk to (NIC2)
-snip-
Configure logging, authentication, everything just like for the first host then:
host2:~ # rcheartbeat start && insserv heartbeat
Ensure they see each other before proceeding, they may need few minutes to get in sync:
host1:~ # crmadmin -N
normal node: host1 (91275fec-a9f4-442d-9875-27e9b7233f33)
normal node: host2 (d94a39a4-dcb5-4305-99f0-a52c8236380a)
host2:~ # crmadmin -N
normal node: host1 (91275fec-a9f4-442d-9875-27e9b7233f33)
normal node: host2 (d94a39a4-dcb5-4305-99f0-a52c8236380a)
- EVMS
EVMS is a great, open source, enterprise class volume manager with yet again significant support from IBM and SuSE. It includes a feature called CSM (Cluster Segment Manager) what we use to manage shared LUNs, distribute the block devices (partitions) and the complete storage arrangement between the dom0 nodes identically. On top of CSM we use LVM2 volume management to create, resize, extend logical volumes.
Since you could have different LVM2 or EVMS arrangement within your domU I include only the device in the evms.conf file what I want to manage on the XEN host. This is the 100GB LUN for the virtual machines only. I want to hide the rest from the host system, I don't want EVMS to discover or interfere with other disks I am not planning to use or manage from the XEN host.
The multipath -ll command (earlier) tells you the device-mapper (referred as dm) managed number you need for this.
Note: this numbering changes as you add or remove disks, LUNs, etc. When you do so it's advised to reboot the host to see what the new dm layer is like then update the EVMS configuration accordingly.
host1:~ # vi /etc/evms.conf
engine {
mode = readwrite
debug_level = default
log_file = /var/log/evms-engine.log
metadata_backup_dir = /var/evms/metadata_backups
auto_metadata_backup = yes
remote_request_timeout = 12
}
-snip-
sysfs_devices {
include = [ dm-1 ]
exclude = [ iseries!vcd* ]
}
-snip-
It's a very good idea to set up automatic metadata backup, just remove the comment from the corresponding line as shown above.
Remember that when you save the configuration you update the metadata, utilities read the metadata from the disk. The evms.conf is really just for the global behavior of the engine.
LVM2 versus EVMS:
Remember that I have another 2T LUN what I want to manage by LVM2 solely within the NFS domU hence I also disabled LVM2 on the XEN host to avoid interfering with EVMS managed disks and with the domU's LVM2 configuration later on.
The reason being is that when you map a block device, file image, CDROM, etc. to a XEN virtual machine, it appears on the XEN host just like it does within the domU. You can see pretty much the same thing from both (you can mount it, partition, etc.) hence I always make sure that I only manage disks on the XEN host what I need to manage on the host. In a complex environment this could create confusion...
Change the filter to exclude everything from LVM2 discovery:
host1:~ # vi /etc/lvm/lvm.conf
devices {
dir = "/dev"
scan = [ "/dev" ]
filter = [ "r|.*|" ]
cache = "/etc/lvm/.cache"
write_cache_state = 1
sysfs_scan = 1
md_component_detection = 1
}
-snip-
File images versus block devices:
So why this big fuss is, what is wrong with the original HASI design, why we don't use file image based virtual machines?
Well, it's a long story, started more than 2 years ago when I first started playing with this technology. At that time the loop mounted file images were slow, we just simply couldn't afford it. As of today we have blktap driver shipped with SLES10 SP2 providing nearly native disk I/O performance on file images out of the box.
http://wiki.xensource.com/xenwiki/blktap
You could say then, that this EVMS complexity is unnecessary and makes the domU importable what I would disagree with. I think having file images is one extra layer of complexity in our storage and involves OCFS2. I had issues with it often especially when new versions came along hence I only use it for my configstore.
The other reason is that block devices are just like file images, just virtual or special files. I can create a copy of them any time with dd, redirect the output to a file creating an identical copy of my block device, (partition) it's just a special type of file and at last but not least it has been working for nearly 2 years in production without one glitch.
Now, we can create the volumes. Note: I am going to present my configuration here just for reference. If you need a step by step guide please read this document:
http://wiki.novell.com/images/0/01/CHASF_preview_Nov172006.pdf
I strongly recommend to visit the project's home as well:
http://evms.sourceforge.net
http://evms.sourceforge.net/clustering
Disks:
Segments:
CSM container and LVM2 on top:
Regions:
Volumes:
After you have created (with evmsn or evmsgui utilities) your EVMS volumes, save the configuration. To activate changes (create the devices on the file system) on all XEN hosts immediately, we need to run evms_activate on every other node simply because the default behavior of EVMS is to apply changes upon the local node only.
I have 2 nodes at this stage and I want to activate only the other node:
host1:~ # evms_activate -n host2
This is where evmsd process we started with HA becomes important. It's our engine handler on the remote node, without we wouldn't be able to create the devices on remote hosts' file system.
What if I had 16 nodes? It would be a bit overwhelming so a quick solution to do this on all nodes: (there are many other ways of doing this)
host1:~ # for n in `grep node /etc/ha.d/ha.cf | cut -d '' -f2-`; do evms_activate -n $n; done
If you are unfamiliar with the power of UNIX shell, yes it's a funny looking one line command above.
Reference: http://wiki.xensource.com/xenwiki/EVMS-HAwSAN-SLES10
The future of EVMS in SLES:
It will disappear in SLES11. Novell will support it until the lifetime of SLES10, probably same for OES2 after all NSS volumes can only be done at this stage with EVMS. I did ask Novell about the transition from EVMS to cLVM, (when you upgrade from SLES10 to SLES11) and as usual there will be tools, procedures and support for this.
It's your call to decide whether you want to use a discontinued or unsupported technology (in future releases) today or not, but as mentioned on the page below customers shouldn't be put off by this decision and still encouraged to use it.
More information:
http://www.novell.com/linux/volumemanagement/strategy.html
- OCFS2 cluster file system for XEN domU configurations
We need a fairly small volume for XEN virtual machine configurations. The best for this is OCFS2, Oracle's cluster file system. We will mount this under the default /etc/xen/vm directory on all member nodes. Both XEN nodes will see the same files, both will be able to make changes or create new ones on the same file system.
I will not provide step-by-step solution for this, it's been discussed many times, there's a lot about it already.
http://wiki.novell.com/images/3/37/Exploring_HASF.pdf (page 72.)
http://www.novell.com/coolsolutions/feature/18287.html (section 6.)
http://wiki.novell.com/index.php/SUSE_Linux_Enterprise_Server#High_Availability_Storage_Infrastructure
Outline:
- create a small EVMS volume (32MB is plenty)
(actual device on my cluster is /dev/evms/san2/cfgpool) - create cluster configuration (vi /etc/ocfs2/cluster.conf or ocfs2console GUI)
- ensure the configuration is identical on all other XEN hosts at the same location
(GUI Propagate Configuration option copies the created config to all nodes via ssh) - enable user space OCFS2 cluster membership settings on all nodes
- create OCFS2 file system
That's all you need to do for now. The last step will be to integrate OCFS2 in the Heartbeat v2 cluster (mount the device under /etc/xen/vm) what I will discuss in the next chapter.
Configuration for reference:
host1:~ # vi /etc/ocfs2/cluster.conf
node:
ip_port = 7777
ip_address = 192.168.1.1
number = 0
name = host1
cluster = ocfs2
node:
ip_port = 7777
ip_address = 192.168.1.2
number = 1
name = host2
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
The configuration is identical with the other XEN host.
In SLES10 SP1, there was a timing issue with Heartbeat managed OCFS2 on XEN host. In SLES10 SP2, the defaults changed to fix this, but it may be worthwhile to mention the solution in this guide. (Probably you don't need to do this)
Sometimes the networking on a XEN host takes more time to come up (remember xend modifies the network configuration, creates virtual bridges, etc.), maybe your switches are busy or STP is enabled, perhaps something else is causing slight delay. Nevertheless OCFS2 is very sensitive for this. We need to ensure that OCFS2 has enough time for the members to handshake if there's a delay on the network:
host1:~ # vi /etc/sysconfig/o2cb
-snip-
O2CB_HEARTBEAT_THRESHOLD=50
O2CB_HEARTBEAT_MODE="user"
O2CB_IDLE_TIMEOUT_MS=30000
O2CB_RECONNECT_DELAY_MS=2000
O2CB_KEEPALIVE_DELAY_MS=5000
Ensure that OCFS2 is restarted if settings changed and services are turned on:
host1:~ # SuSEconfig
host1:~ # rco2cb restart
host1:~ # insserv o2cb
host1:~ # insserv ocfs2
host1:~ # chkconfig -l o2cb
o2cb 0:off 1:off 2:on 3:on 4:off 5:on 6:off
host1:~ # chkconfig -l ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:off 5:on 6:off
This needs to be applied upon all XEN hosts, related Novell TID:
http://www.novell.com/support/php/search.do?cmd=displayKC&docType=kc&externalId=7001469&sliceId=2&docTypeID=DT_TID_1_1&dialogID=13902114&stateId=0%200%2013900539
- Heartbeat Cluster Configuration
HA is blank, empty at this stage. We configured it to bring evmsd process up when it starts, which other cluster members it will need to talk to, they should already be in sync. This chapter explains the cluster's operation, the policy, what resources we want to manage by HA and how they should react to certain events. I shall try to explain everything as clear as possible, but there will be details not being covered in this guide.
One of the biggest issues with HA is the documentation. There's some but it's usually outdated and hard to find. I understand that the project is trying hard to make it better but it's still far away from being good, the product is changing rapidly and complex. The best documentation I found was the Novell one, I encourage everybody to read it, gaining decent understanding about the product:
http://www.novell.com/it-it/documentation/sles10/heartbeat/data/b3ih73g.html
The best possible source of information is still the mailing list though, you will want to join or at least browse the archives if you are serious about HA: http://wiki.linux-ha.org/ContactUs
As mentioned previously, we will use the new v2 (crm) type configuration with XML files. It's not as nice when it comes to reading as a text file, but easy to get used to and any decent text editor nowadays can recognize XML, help you in the syntax with highlighting, etc.
Outline:
- create and save XML entry for each resource or policy
- load them into the cluster one by one
- monitor the cluster for reaction
- backup the final (complete) configuration
Note:HA does have a GUI (hb_gui) interface but as of today, it's really just for basic operations, it's still not useful for complex configurations. I only use it for monitoring perhaps start/stop a resource or put a node in to standby. Therefore the configuration method presented in this guide will be mainly CLI (command line) based.
The cluster configuration is replicated amongst all member nodes therefore you don't need to do this from other nodes, it has to be done once and from any node although my preference is always the DC (designated controller) node. You can find this information from monitoring commands (hb_gui, crm_mon) or alternatively:
host2:~ # crmadmin -D
Designated Controller is: host1
Global settings:
HA has a very good default configuration/behavior therefor we have a very little to change here:
host1:~ # vi cibbootstrap.xml
<cluster_property_set id="cibbootstrap">
<attributes>
<nvpair id="cibbootstrap-01" name="cluster-delay" value="60"/>
<nvpair id="cibbootstrap-02" name="default-resource-stickiness" value="INFINITY"/>
<nvpair id="cibbootstrap-03" name="default-resource-failure-stickiness" value="-500"/>
<nvpair id="cibbootstrap-04" name="stonith-enabled" value="true"/>
<nvpair id="cibbootstrap-05" name="stonith-action" value="reboot"/>
<nvpair id="cibbootstrap-06" name="symmetric-cluster" value="true"/>
<nvpair id="cibbootstrap-07" name="no-quorum-policy" value="stop"/>
<nvpair id="cibbootstrap-08" name="stop-orphan-resources" value="true"/>
<nvpair id="cibbootstrap-09" name="stop-orphan-actions" value="true"/>
<nvpair id="cibbootstrap-10" name="is-managed-default" value="true"/>
</attributes>
</cluster_property_set>
Usually id="something" is a given name by you, it can be anything although you should keep it meaningful and easy to read (keep the indents) without symbols, etc.
What is worth mentioning here is that we enable STONITH with a default action of reboot. It's our power switch, which will ensure that in case of node failure such as reboot, freeze, network issue or any occasion when heartbeat stops receiving signals from the other node the misbehaving node is rebooted. It is extremely important then that you have multiple communication paths (multiple NICs for example) between your nodes to avoid serious problems.
For more information: http://wiki.linux-ha.org/SplitBrain
The resource-stickiness setting will ensure that if a failing node comes back up online after reboot ,the resource (virtual machine) which was moved will NOT move back to its original location. (where it was started) It's a safety feature and save you having bouncing resources between failing nodes.
Generally a resource stays where it is started unless its originating node was rebooted, either was put into standby or the resource was forced to move by the administrator. HA will always try to keep the balance and the harmony in your cluster. It means that for example you have 10 domUs to load into the cluster or just start up in a brand new one, HA will balance this amongst all nodes (2 node cluster = 5 each) unless you configure the policy with preferred locations for certain domUs. This sort of thing is out of the scope of this guide because it's not very useful for XEN clustering what this guide is supposed to be about or at least it wasn't for me.
For more information about preferred locations, please visit this page: http://wiki.linux-ha.org/ciblint/crm_config
Load the created XML file into the cluster:
host1:~ # cibadmin -C -o crm_config -x cibbootstrap.xml
STONITH resources:
So far we just enabled this globally in the cluster, we don't have power switches yet. These are like daemons running on each member node and execute the reboot command but it depends on the STONITH resource type and the set global action. Remember that we talk about rebooting XEN cluster member nodes not resources (virtual machines).
HA ships with a test STONITH agent which executes reboot via ssh. It's not for production use but I configured it because I rather have more than one. It could do the job as long as the failing node is responding (not frozen).
Failing as a term can be anything in HA. You may have perfectly working XEN cluster member node but for example if HA cannot start a new domU (resource) up on a node, (because you made typo in the XEN configuration file) it would be severe from HA's point of view because its job is to keep all your resources up and running in default. As a result it would migrate (or stop and start) all existing resources from the node where the startup failed to another node and would reboot the misbehaving node. Of course it would try starting that resource on the other node and it would fail too. Once the rebooted node came back online again HA would migrate all resources away from the other node and would try again the new resource with the typo within, of course it would fail again too so from this point it will wait for admin interaction, keep running what is healthy and mark the misbehaving resource failed. It can take significant amount of time if you had more nodes as it would try the same thing on all nodes, one by one. It may not be an issue in a test environment but can be severe in production.
Outline:
- always triple check all configuration changes, entires, etc. before activation
- always check dependent configuration files (XEN domU), etc. you refer to
- perhaps set the resource to not managed so HA will not bother if it fails to start
- put the resource back to managed mode if everything is working as expected
ssh test STONITH agent:
We need passwordless login (by ssh keys) between all nodes and for the root user. Doing this is out of the scope of this document, (original HASI guide discusses this) ensure that before you configure the agent you can log into all nodes as root without password from all nodes.
host1:~ # vi stonithcloneset.xml
<clone id="STONITH" globally_unique="false">
<instance_attributes id="STONITH-ia">
<attributes>
<nvpair id="STONITH-ia-01" name="clone_max" value="2"/>
<nvpair id="STONITH-ia-02" name="clone_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="STONITH-child" class="stonith" type="external/ssh" provider="heartbeat">
<operations>
<op id="STONITH-child-op-01" name="monitor" interval="5s" timeout="20s" prereq="nothing"/>
<op id="STONITH-child-op-02" name="start" timeout="20s" prereq="nothing"/>
</operations>
<instance_attributes id="STONITH-child-ia">
<attributes>
<nvpair id="STONITH-child-ia-01" name="hostlist" value="host1,host2"/>
</attributes>
</instance_attributes>
</primitive>
</clone>
It's a clone resource meaning that we will have a running copy on each member node, their characteristics will be identical. As shown, I configured the max number (number of nodes I have) and the max copies. (usually one on each node) We set monitoring as well, should something happen to my ssh daemon and this resource cannot log into one of my nodes? I shall be notified about it.
Load the resource into the cluster:
host1:~ # cibadmin -C -o resources -x stonithcloneset.xml
More information: http://wiki.linux-ha.org/v2/Concepts/Clones
The last thing is to enable at daemon. The way the ssh STONITH agent works is that in case of a node failure, an intact node logs into the failing one (assuming it's possible) and schedules a reboot via at daemon:
host1:~ # insserv atd && rcatd start
riloe STONITH agent
The next thing is to configure a lot more production ready STONITH agent for my servers. The best is to use something, which independent from the operating system like iLO for HP. At this time of writing HA comes with agents for most hardware vendors like HP, IBM, etc.
So which STONITH agent will actually act and execute the reboot when the disaster strike if I have more than one STONITH agent? The answer is any, HA will pick one randomly.
In design the common sense is to configure STONITH agents as clone resources. For clusters with many nodes it's actually causing minor issues because:
iLO resource is configured on all nodes. (even the one where the iLO is installed physically) It makes sense to log into the failing node from another node and execute the reboot right? (suicide is not actually allowed by default) So when the monitoring operation is due (every 30sec or whatever you set) these agents from all nodes will try logging into the iLO device.
Occasionally a race condition evolve when 2 nodes trying to log into the same iLO device what they can't causing weird behavior, error, etc.
According to a recent discussion on linux-ha mailing list, it should be fine by now and fixed regardless what method you use but I just couldn't see any point having a copy on a node where the iLO device is installed physically even if suicide is not allowed (safe).
I think it's nonsense to run iLO STONITH agent on all nodes regardless how many nodes you have because we need only one on a healthy node, the DC (always the stonithd on the DC receives the fencing request) will instruct that node (where the agent is running) to execute reboot on the corresponding iLO device. (installed on the failing node) hence I took a different approach due to the nature of the iLO device and:
- created one primitive iLO STONITH resource
- configured the cluster to run this anywhere but on the node where it is installed
On a 2 node cluster obviously it will be the other node but on a many node cluster it could run anywhere depending on node availability, cluster load, etc. This solution is working seamlessly for me for quite sometime.
Create the policy first for iLO on host1:
host1:~ # vi stonithhost1_constraint.xml
<rsc_location id="STONITH-iLO-host1:anywhere" rsc="STONITH-iLO-host1">
<rule id="STONITH-iLO-host1:anywhere-r1" score="-INFINITY">
<expression id="STONITH-iLO-host1:anywhere-r1-e1" attribute="#uname" operation="eq" value="host1"/>
</rule>
</rsc_location>
As usual we create unique id for a rule then tell the cluster that STONITH-iLO-host1 resource (doesn't exist yet) has a score -INFINITY. In HA a preference is expressed by always scores and it's the smallest one meaning that it's the least preferred, it cannot something... For a rule then we create an expression (with unique id also) after the rule line telling the cluster where the rule applies to, on the node where the iLO interface is installed.
Create the policy now for iLO on host2:
host1:~ # vi stonithhost2_constraint.xml
<rsc_location id="STONITH-iLO-host2:anywhere" rsc="STONITH-iLO-host2">
<rule id="STONITH-iLO-host2:anywhere-r1" score="-INFINITY">
<expression id="STONITH-iLO-host2:anywhere-r1-e1" attribute="#uname" operation="eq" value="host2"/>
</rule>
</rsc_location>
Load both policies into the cluster:
host1:~ # cibadmin -C -o constraints -x stonithhost1_constraint.xml
host1:~ # cibadmin -C -o constraints -x stonithhost2_constraint.xml
At this stage we don't have the resources in the cluster to whom the policies referring to, warning messages will appear in the logs, ignore them, that's just normal.
Create the primitive resource for iLO on host1:
host1:~ # vi stonithhost1.xml
<primitive id="STONITH-iLO-host1" class="stonith" type="external/riloe" provider="heartbeat">
<operations>
<op id="STONITH-iLO-host1-op-01" name="monitor" interval="30s" timeout="20s" prereq="nothing"/>
<op id="STONITH-iLO-host1-op-02" name="start" timeout="60s" prereq="nothing"/>
</operations>
<instance_attributes id="STONITH-iLO-host1-ia">
<attributes>
<nvpair id="STONITH-iLO-host1-ia-01" name="hostlist" value="host1"/>
<nvpair id="STONITH-iLO-host1-ia-02" name="ilo_hostname" value="host1-ilo"/>
<nvpair id="STONITH-iLO-host1-ia-03" name="ilo_user" value="Administrator"/>
<nvpair id="STONITH-iLO-host1-ia-04" name="ilo_password" value="CLEARTEXTPASSWORD"/>
<nvpair id="STONITH-iLO-host1-ia-05" name="ilo_can_reset" value="true"/>
<nvpair id="STONITH-iLO-host1-ia-06" name="ilo_protocol" value="2.0"/>
<nvpair id="STONITH-iLO-host1-ia-07" name="ilo_powerdown_method" value="power"/>
</attributes>
</instance_attributes>
</primitive>
Along the normal operations we configure instance_attributes as well which describe the details of our iLO device. It's for iLOv2, if you happened to need this for older iLOv1, the difference would be:
-snip-
<nvpair id="STONITH-iLO-host1-ia-05" name="ilo_can_reset" value="false"/>
<nvpair id="STONITH-iLO-host1-ia-06" name="ilo_protocol" value="1.2/>
-snip-
iLOv1 cannot cold reset a node, since we used ilo_powerdown_method "value="power"
iLOv1 would stop then start the server (in order) which is pretty much the same thing.
Create the primitive resource for iLO on host2:
host1:~ # vi stonithhost2.xml
<primitive id="STONITH-iLO-host2" class="stonith" type="external/riloe" provider="heartbeat">
<operations>
<op id="STONITH-iLO-host2-op-01" name="monitor" interval="30s" timeout="20s" prereq="nothing"/>
<op id="STONITH-iLO-host2-op-02" name="start" timeout="60s" prereq="nothing"/>
</operations>
<instance_attributes id="STONITH-iLO-host2-ia">
<attributes>
<nvpair id="STONITH-iLO-host2-ia-01" name="hostlist" value="host2"/>
<nvpair id="STONITH-iLO-host2-ia-02" name="ilo_hostname" value="host2-ilo"/>
<nvpair id="STONITH-iLO-host2-ia-03" name="ilo_user" value="Administrator"/>
<nvpair id="STONITH-iLO-host2-ia-04" name="ilo_password" value="CLEARTEXTPASSWORD"/>
<nvpair id="STONITH-iLO-host2-ia-05" name="ilo_can_reset" value="true"/>
<nvpair id="STONITH-iLO-host2-ia-06" name="ilo_protocol" value="2.0"/>
<nvpair id="STONITH-iLO-host2-ia-07" name="ilo_powerdown_method" value="power"/>
</attributes>
</instance_attributes>
</primitive>
Load both resources into the cluster:
host1:~ # cibadmin -C -o resources -x stonithhost1.xml
host1:~ # cibadmin -C -o resources -x stonithhost2.xml
This can take a few moments to come up green, be patient for a while, monitor the cluster.
Hint:
If it didn't come up for some reason or stopped itself after a while (rarely happens) select “stop”, wait few seconds then “clean the resource on all nodes” with the GUI. Few seconds later select “default” option with the GUI which should start it up fine after all. I have slow hubs connecting my iLO network, perhaps that's causing this minor issue occasionally.
Reference:http://wiki.linux-ha.org/CIB/Idioms/RiloeStonith
Ping daemon resource:
At this point we are just configuring our cluster for general behavior, setting up rescue and safety tools, resources. The last one on the list is the ping daemon. We already specified the gateway IP address of my private LAN (in ha.cf) which uses eth0 interface. The eth1 interface is purely for my HA communication in my setup so if I can't ping my gateway IP via eth0 then it means something is wrong. Since all my domU resources will share the eth0 interface, it's crucial to monitor the eth0 interface and make sure that it's working otherwise all my resources (virtual machines) could become unreachable.
Outline:
- create a clone resource for pingd
- configure the cluster to score ping (network) connectivity
- configure the cluster to run resources only where ping connectivity is defined
Each resource (domU) in the cluster will have to be individually configured for pingd connectivity hence I will leave this for later, right now I just configure the clone resource:
host1:~ # vi pingdcloneset.xml
<clone id="pingd" globally_unique="false">
<instance_attributes id="pingd-ia">
<attributes>
<nvpair id="pingd-ia-01" name="clone_max" value="2"/>
<nvpair id="pingd-ia-02" name="clone_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="pingd-child" provider="heartbeat" class="ocf" type="pingd">
<operations>
<op id="pingd-child-op-01" name="monitor" interval="20s" timeout="40s" prereq="nothing"/>
<op id="pingd-child-op-02" name="start" prereq="nothing"/>
</operations>
<instance_attributes id="pingd-child-ia">
<attributes>
<nvpair id="pingd-child-ia-01" name="dampen" value="5s"/>
<nvpair id="pingd-child-ia-02" name="multiplier" value="100"/>
<nvpair id="pingd-child-ia-03" name="user" value="root"/>
<nvpair id="pingd-child-ia-04" name="pidfile" value="/var/run/pingd.pid"/>
</attributes>
</instance_attributes>
</primitive>
</clone>
Load it in:
host1:~ # cibadmin -C -o resources -x pingdcloneset.xml
More information: http://wiki.linux-ha.org/v2/faq/pingd
Now we should see something like this:
host1:~ # crm_mon -1
============
Last updated: Thu Jan 8 14:59:21 2009
Current DC: host1 (91275fec-a9f4-442d-9875-27e9b7233f33)
2 Nodes configured.
4 Resources configured.
============
Node: host1 (91275fec-a9f4-442d-9875-27e9b7233f33): online
Node: host2 (d94a39a4-dcb5-4305-99f0-a52c8236380a): online
Clone Set: STONITH
STONITH-child:0 (stonith:external/ssh): Started host1
STONITH-child:1 (stonith:external/ssh): Started host2
STONITH-iLO-host1 (stonith:external/riloe): Started host2
STONITH-iLO-host2 (stonith:external/riloe): Started host1
Clone Set: pingd
pingd-child:0 (ocf::heartbeat:pingd): Started host1
pingd-child:1 (ocf::heartbeat:pingd): Started host2
crm_mon is a great utility, it displays information about cluster in various ways even provide output for nagios monitoring system. See the man page for further information.
EVMS resource:
On SLES10 the /dev directory is actually tmpfs meaning that it's an interim file system created by udev every time the system starts. It also means that next time we boot the servers our evms devices will not be available under /dev/evms/...
Remember that when we saved the evms disk configuration we had to run evms_activate on the other node to make the devices available. (created) This is exactly what we need to do and luckily HA ships with an OCF resource agent to do exactly this for us. The other good thing doing evms this way is that we can make all the other resources dependent on this.
The benefit of this is that HA will ensure that evms_activate was run, devices are in place before starting dependent resources (domU).
Create EVMS clone resource:
host1:~ # vi evmscloneset.xml
<clone id="evms" notify="true" globally_unique="false">
<instance_attributes id="evms-ia">
<attributes>
<nvpair id="evms-ia-01" name="clone_max" value="2"/>
<nvpair id="evms-ia-02" name="clone_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="evms-child" class="ocf" type="EvmsSCC" provider="heartbeat">
</primitive>
</clone>
All it does is runs evms_activate on all nodes when the resource starts up.
Load the resource into the cluster:
host1:~ # cibadmin -C -o resources -x evmscloneset.xml
Note: SLES10 ships with LSB scripts (found in /etc/init.d) doing the same thing, creating EVMS devices during boot process but it will not work with CSM containers. They would run fine without errors but your devices won't be created. Perhaps because evmsd would not be running but whatever causes this according to the project's page, the designed way at this time of writing is to manage cluster memberships with HA. Even if you didn't plan deploying clustered XEN domUs, just want to share storage between 2 bare metal servers with EVMS and CSM, you would need the same setup until this point except ping daemon.
OCFS2 cluster file system resource:
By now we should have our OCFS2 cluster up and running, file system created, ready to mount, services set for boot, etc. The last step is to mount the actual device what we will do with HA and its file system resource agent. The reason being is that it will need to be done on more than one node and doing it with HA makes it:
- cluster aware (every node will know when a node leaves, joins the cluster)
- simple (single configuration for multiple mounts)
This will need to be again a clone resource that mounts
/dev/evms/san2/cfgpool volume to /etc/xen/vm on each node in the cluster:
host1:~ # vi configpoolcloneset.xml
<clone id="configpool" notify="true" globally_unique="false">
<instance_attributes id="configpool-ia">
<attributes>
<nvpair id="configpool-ia-01" name="clone_max" value="2"/>
<nvpair id="configpool-ia-02" name="clone_node_max" value="1"/>
</attributes>
</instance_attributes>
<primitive id="configpool-child" class="ocf" type="Filesystem" provider="heartbeat">
<operations>
<op id="configpool-child-op-01" name="monitor" interval="20s" timeout="60s" prereq="nothing"/>
<op id="configpool-child-op-02" name="stop" timeout="60s" prereq="nothing"/>
</operations>
<instance_attributes id="configpool-child-ia">
<attributes>
<nvpair id="configpool-child-ia-01" name="device" value="/dev/evms/san2/cfgpool"/>
<nvpair id="configpool-child-ia-02" name="directory" value="/etc/xen/vm"/>
<nvpair id="configpool-child-ia-03" name="fstype" value="ocfs2"/>
</attributes>
</instance_attributes>
</primitive>
</clone>
Again we are configuring an anonymous cloneset globally_unique parameter is (again) set to false. Since this time we are configuring a cloneset that contains an OCFS2 file system resource agent, we want to enable notify for it, so that the clones (each agent on each node) will receive notifications from the cluster and hence get informed on the cluster membership status. To enable notifications, set notify to true for the cloneset. We also configure the monitor operation, so that the cluster checks every 20 seconds if the mount is still there.
The most important part of the XML blob is the attributes section of the configpool primitive part. Set the device parameter to the OCFS2 file system that needs to be mounted, directory to the directory on which this file system must get mounted, and fstype to ocfs2 for obvious reasons. In a clone file system RA (resource agent), any other value is forbidden because OCFS2 is the only supported cluster aware file system at this stage.
Load the resource into the cluster:
host1:~ # cibadmin -C -o resources -x configpoolcloneset.xml
Without EVMS volumes, this resource wouldn't be able to start hence we have to make sure that EVMS starts first, making OCFS2 resource dependent of EVMS resource:
host1:~ # vi configpool_to_evms_order.xml
<rsc_order id="configpool_depends_evms" from="configpool" to="evms" score="0"/>
Something to mention here about the scoring at the end... It's very important since version 2.1.3 and above. Without it, after successful live migration, domUs can randomly restart. Reading the documentation, I realized that it's harmless and it should be in the CIB anyway.
Reading:
http://www.clusterlabs.org/wiki/images/a/ae/Ordering_Explained_-_White.pdf
http://www.gossamer-threads.com/lists/linuxha/users/52913
Load the policy into the cluster:
host1:~ # cibadmin C o constraints x configpool_to_evms_order.xml
More OCFS issues:
This is the stage where I had issues with fileimage based domUs. I had basically the 100GB LUN created with one big OCFS2 file system, mounted with HA in similar fashion. It was a while ago, I must admit that perhaps it is fixed by now but when domUs started migrating to the other node things went wrong.
According to my observations and logs, it was like a locking issue with live migration, OCFS2 couldn't handle over the lock to the other node or release the image file when the handover finished by XEN and the writing operation to the image file was going to continue on the other node.
Of course it was a severe error and the nodes started receiving STONITH actions until things came right. Without live migration, HA would have stopped the domUs first then started them up on the new location which would have worked perfectly, the original HASI was based on this idea. I just couldn't afford this for storage which holds user data and mounted (multiple times) all the time and at last but not least it's not what I want for a production environment.
FYI:
Another interesting issue we found recently is that if you have findutils-locate package installed, it would not work very well on OCFS2. There's a cron job running every day which builds the database for all files found on the system but when it reaches OCFS2 volumes, it hangs. We made a support call about this, no updates yet.
It's great file system, I like its features and the support what Novell builds into its products for it but I am not convinced yet, that it's suitable for XEN clustering and live migration in production environment.
NFS virtual machine (domU) resource:
Finally this is the time to play with virtual machines and HA. As mentioned earlier in this guide, I will only present solution here for NFS domU sharing a large disk with UNIX users but on the same idea, I run around 15 other domUs on 2 clusters hosting various UNIX services.
Creating a virtual machine is out of the scope of this guide, there's plenty on the net about it. In fact I don't install domU anymore, I maintain, run a plain copy on one of my clusters and clone that when I need a new one.
Note: the XEN domU configuration is still text based, not xenstore. It's the only way of doing XEN clustering according to my knowledge because it's tricky to sync the xenstore database amongst all nodes at this time of writing.
Related reading:
http://wiki.xensource.com/xenwiki/XenStore
For a reference here it is my NFS XEN domU configuration:
host1:~ # vi /etc/xen/vm/nfs.xm
ostype="sles10"
name="nfs"
memory=512
vcpus=1
uuid="86436fde-1613-4e12-8a94-093d1c3f962e"
on_crash="destroy"
on_poweroff="destroy"
on_reboot="restart"
localtime=0
builder="linux"
bootloader="/usr/lib/xen/boot/domUloader.py"
bootargs="--entry=xvda1:/boot/vmlinuz-xenpae,/boot/initrd-xenpae"
extra="TERM=xterm"
disk = [ 'phy:/dev/evms/san2/vm2,xvda,w', 'phy:/dev/mapper/mpath1,xvdc,w' ]
vif = [ 'mac=00:16:3e:32:8b:12', ]
vfb = [ "type=vnc,vncunused=1" ]
I assigned a fairly small (5GB) EVMS volume to this domU and added the 2T LUN as well, as is, without any modification as it comes off the dm layer with its alias name mpath1. The 5GB might look a bit tight but my domUs are crafted for their purpose, run purely in runlevel 3 (no GUI) and really just a cut-down SLES copy. At a time when I built this cluster, there was no JEOS release available, today it may be good choice for domU:
http://www.novell.com/it-it/linux/appliance
Create XEN domU resource:
host1:~ # vi xenvmnfs.xm
<primitive id="nfs" class="ocf" type="Xen" provider="heartbeat">
<operations>
<op id="xen-nfs-op-01" name="start" timeout="60s"/>
<op id="xen-nfs-op-02" name="stop" timeout="90s"/>
<op id="xen-nfs-op-03" name="monitor" timeout="60s" interval="10s"/>
<op id="xen-nfs-op-04" name="migrate_to" timeout="90s"/>
</operations>
<instance_attributes id="nfs-ia">
<attributes>
<nvpair id="xen-nfs-ia-01" name="xmfile" value="/etc/xen/vm/nfs.xm"/>
</attributes>
</instance_attributes>
<meta_attributes id="nfs-ma">
<attributes>
<nvpair id="xen-nfs-ma-01" name="allow_migrate" value="true"/>
</attributes>
</meta_attributes>
</primitive>
We set the standard start, stop, monitor, migration operations with timings, then the location of the XEN domU configuration and enabled live migration (needs to be meta attribute).
Instance attribute: settings about the cluster resource for the local (on the node) resource (ie. what IP address to use)
Meta attribute: settings about the resource for the cluster (ie. should the resource be started or not)
One of the features of this HASI is the hypervisor friendly resource timing what I can best explain with a second domU configuration running within the same cluster:
<primitive id="sles" class="ocf" type="Xen" provider="heartbeat">
<operations>
<op id="xen-sles-op-01" name="start" timeout="60s" start_delay="10s"/>
<op id="xen-sles-op-02" name="stop" timeout="60s" start_delay="10s"/>
<op id="xen-sles-op-03" name="monitor" timeout="60s" interval="10s" start_delay="10s"/>
<op id="xen-sles-op-04" name="migrate_to" timeout="90s" start_delay="10s"/>
</operations>
<instance_attributes id="sles-ia">
<attributes>
<nvpair id="xen-sles-ia-01" name="xmfile" value="/etc/xen/vm/sles.xm"/>
</attributes>
</instance_attributes>
<meta_attributes id="sles-ma">
<attributes>
<nvpair id="xen-sles-ma-01" name="allow_migrate" value="true"/>
</attributes>
</meta_attributes>
</primitive>
The second domU above has 10 seconds start_delay set for all its operations. When HA starts up it starts all resources in the cluster according to ordering and only wait for dependencies to complete. If I had 20 domUs in my cluster it would hammer the system and its hypervisor and could lead some domUs to crash. I understand that SLES and xend has protection against this sort of problem however I had issues when some of my heavily loaded domUs started migrating all at the same time, some crashed occasionally.
http://wiki.linux-ha.org/ClusterInformationBase/Actions
This little time will delay the second domU operations and mostly 10 seconds is enough for start, stop, migration to complete unless it's heavily loaded or has large RAM allocated. Mind you 10 seconds for each resource could cause significant time delay for a cluster loaded with 20 domUs, the last resource would have more than 3 minutes delay, therefore it's your call to make how you configure or adjust these delays. You have to craft it for your environment, I cannot give you “one fits all” solution.
You could though:
- reduce the time and only delay each domU for 5 seconds if you have many
- delay a pair of domUs (or more) with either similar purpose or characteristics
- delay just the ones you know being resource intensive or busy
- don't use timing if you had well tested your cluster and had no issues
You have to make sure that the domU runs, the configuration is typo free and the disk descriptions are correct before loading it into the cluster.
You could simply test the domU outside of the cluster (start it up with the traditional XEN utilities on one of the member nodes) or better yet to test it in a test environment if you can afford one. If you load it into your HA cluster and it doesn't start or crash after a while, HA will issue STONITH for that node (reboot) which could affect other, already running, working services. It may not be something you want on a production system.
If you are ready, load it into the cluster:
host1:~ # cibadmin -C -o resources -x xenvmnfs.xm
Now we create a policy to make this domU resource dependent of EVMS and configpool. Since they are already dependent of each other, it makes sense to make domU dependent of configpool resource:
host1:~ # vi nfs_to_configpool_order.xml
<rsc_order id="nfs_depends_configpool" from="nfs" to="configpool" score="0"/>
It has no affect on an already running domU resource, safe to load into the cluster while operating, the scoring was already discussed at the bottom of page 27.
Create another policy to make the NFS domU dependent of working networking on eth0:
host1:~ # vi nfs_to_pingd_constraint.xml
<rsc_location id="nfs:connected" rsc="nfs">
<rule id="nfs:connected-rule-01" score="-INFINITY" boolean_op="or">
<expression id="nfs:connected-rule-01-expr-01"
attribute="pingd" operation="not_defined"/>
<expression id="nfs:connected-rule-01-expr-02"
attribute="pingd" operation="lte" value="0"/>
</rule>
</rsc_location>
pingd is already working and scoring, this rule instructs the cluster not to run (stop) this domU resource anywhere in the cluster, where there's no networking. (pingd connectivity)
Related reading:
http://wiki.linux-ha.org/CIB/Idioms/PingdStopOnConnectivityLoss
Certainly in our scenario, we have live migration enabled hence the cluster will not just stop the resources as mentioned on the page above but it would either start on another node (assuming there's network connectivity) or just live migrate if there's some other Ethernet connectivity between the cluster member nodes. (should be!)
It is not the best way of handling network connectivity scoring as written on the page above, I'm well aware of that but the other method (preference scoring for better connectivity) could cause domUs moving between nodes perhaps often depending on load. I don't want that, I wanted my resources to stay where they are as long as there's networking available.
Should you lose networking on all your nodes? HA will shutdown all domUs and keep scoring continuously in the background. Once it came back, it would start them up again although I haven't tested this behavior.
The last thing we need to do is load these policies into the cluster:
host1:~ # cibadmin -C -o constraints -x nfs_to_configpool_order.xml
host1:~ # cibadmin -C -o constraints -x nfs_to_pingd_constraint.xml
Remember that these policies will need to be individually configured for each domU resource you plan to run within an HA cluster.
Operating Hints
Caveats:
HA is our central database, it tracks and monitors each resource in the cluster, informs all member nodes about changes and synchronizes the CIB (cluster information base) amongst all nodes.
You must stop using any traditional XEN domU management utility including:
- virsh (libvirt)
- virtmanager (libvirt GUI)
- xm
Anything you do with the domUs must be done with informing HA. If you stop one of your domUs with any traditional utility, HA would not know what happened to it and would start the resource up but what if you do the same just at the same time? Yes, corrupted storage. XEN will happily start multiple instances of the same domU, will not warn you or complain, it's what it does but all your files written at the same time without cluster file system will be corrupted.
You can only use those above for monitoring, gathering information, perhaps testing domUs out of the cluster, nothing else. The cluster's job is to keep them running, should you want to change that status, tell the cluster by its built in, HA aware utilities.
To make it easy from the CLI, I wrote a basic script which translates basic xm management commands to HAaware commands:
http://www.novell.com/communities/node/2573/xen-ocf-resource-management-script-ha-stack
The other common mistake is typo in the configuration files, particularly when you don't install the domU just clone it. Some will be harmless but some can be quite destructive hence:
You have to make sure that the XEN domU configuration has the correct disk descriptions and they point to the right device. It's a must regardless you use EVMS setup or fileimages on OCFS2.
Failing to do so you could forget to change the disk line of the configuration after cloning and when you start it up, there's a very good chance to corrupt an already running production domU disk...
Crafting as a word appearing in this guide quite often. This is what makes the difference, create harmony in your cluster and you don't need to do much against defaults for it.
Basic principals: turn off unused services, no firewall, no AppArmor, patch regularly, install only the softwares packages you need, keep an eye on RAM and CPU usage, no unnecessary accounts, use runlevel 3, etc.
Heartbeat GUI:
You should see lots of nice little green lamps in your cluster now so let's talk a bit about the GUI interface. It's very basic, can do certain things, gets better by every release but I use it mainly to take an overview at the services or for basic operations and I strongly recommend you doing the same.
To authenticate, you either have to reset the the password for hacluster system user (nonsense) or have to make your account to be the member of the haclient group (better):
host1:~ # groupmod -A <yourusername> haclient
host2:~ # groupmod -A <yourusername> haclient
Note: this is something what you will need to do on all cluster member nodes unless you are using centralized user management. (
LDAP) The
GUI can be started from any member node within the cluster and will display, work, behave the same way.
host1:~ # hb_gui &
or
host2:~ # hb_gui &
Now you should be able to authenticate with your credentials, learn and get used to the interface:
Start, stop resources with HA:
You can configure the status of a resource in the XML file before loading it in for example:
<nvpair id="xen-sles-op-05" name="target_role" value="stopped"/>
or
<nvpair id="xen-sles-op-05" name="target_role" value="started"/>
In the CIB it becomes the default for that particular resource. Frankly, I don't see any point loading something into the CIB with stopped status, doing it with started is nonsense because that's the HA default action anyway.
But when we stop, start a resource either with the GUI or with my script we in fact insert an interim attribute into the CIB with a generated id without making it default.
It's important because if you want to start the stopped resource again you could select start:
But then you just replace the stopped interim attribute with the started one, you still leave interim bits in your CIB. To do it properly select default which actually removes the interim attributes and apply the HAdefault status to the resource which was started:
Deleting the target_role attribute inserted by GUI from the right panel has the same affect as default. I don't personally like interim entries in my CIB, I like to keep it nice and clean. Note: at this stage my script doesn't have option for default!
Safe way of testing new resource configurations:
Essential for production environment even if you are certain that things would work. The best way is to tell the cluster not to manage the resource in the initial XML file:
<nvpair id="xen-sles-op-05" name="is_managed" value="false"/>
Should the new resource not to start up, crash after a while? The cluster will ignore the failure and will not issue fencing for the node where the failure occurred. Once it became stable you can remove this attribute with the GUI or edit the CIB with cibadmin (later on).
Disable monitoring operation:
You may need this at some stage, I haven't used it so far. It's just an extra attribute:
<op id="xen-sles-op-03" name="monitor" timeout="60s" interval="10s" enabled="false"/>
Notice that I removed the monitor delay from the previous example just to avoid breaking the line and maintain readability otherwise it would be there. You can remove this as the previous one.
Editing the CIB:
You can only use the GUI or the cibadmin utility. You must not edit the CIB by hand even if you know where it is located on the file system.
The GUI is simple, you just edit or delete the particular part you need but it's different from the command line. Assume you saved all the XML files you loaded into the CIB then just make the change you like and load it back into the cluster but use replace -R option instead of create -C for example:
host1:~ # cibadmin -R -o resources -x xenvmnfs.xm
Backup, Restore the CIB:
You can backup the entire CIB:
host1:~ # cibadmin -Q > cib.bak.xml
It includes all the LRM parts what you wouldn't need normally. LRM stands for local resource manager, basically the part which handles all things to the corresponding node locally. CRM is replicated across all nodes whereas LRM is the component which does the local actions on each node. Hence I prefer to backup my CIB by object type instead:
host1:~ # cibadmin -Q -o resources > resources.bak.xml
host1:~ # cibadmin -Q -o constraints > constraints.bak.xml
Should you need to restore them:
host1:~ # cibadmin -R -o resources -x resources.bak.xml
host1:~ # cibadmin -R -o constraints -x constraints.bak.xml
Migration with HA:
It's a little bit different therefore I would like to talk about it. In HA when you tell your cluster to do something (stop, start) you either set target_role or set a certain preference. (migration)Preference is managed by scoring what you may have realized already and when you migrate a resource, you actually instruct the cluster NOT to prefer (score -INFINITY) or prefer it the MOST (score +INFINITY) a certain resource on a particular node.
Using my script to migrate or “right click” on any resource on the GUI and select “migrate resource” option are the same, they apply interim scoring and insert a rule into the cluster. This nature of HA applies to all resource types not just to virtual machines!
Ultimately you are messing up the CIB what you will need to clean up at some point. As you can see it's already built into the interface (option down below) and my script also includes a subcommand for this but it's still the not preferred way of doing this at least by me.
Standby is the way to go:
In last 2 years, since I am running these clusters I have only had to migrate resources when I wanted to update (patching) or do maintenance on the servers. The best is to put the node into standby. It's designed for this and basically what it does is migrates all resources which are migration capable (and need to be running), stops and starts the others which are not then stops the remaining resources which don't need to be running.
As usual my script includes a built in option to do this or you could use the GUI:
It takes some time depending on your setup, set time delays, number of resources and their load so just be patient. Once HA reports that the node is in running-standby, resources stopped, domUs running on another node, you can basically do whatever you feel like. You can patch the XEN host, upgrade, shut it down for maintenance, upgrade firmware and so forth.
When you finished just make it active node again, GUI or the script it's your choice:
Backups:
How do I actually backup these complex systems? What would the restore be like? We use CommVault7 to back up data partitions (within domU) but I use tar for everything else. The dom0 is simple, it logs to a remote server as well, barely changes hence I back it up (full) once a week to a remote backup server via NFS where it gets stored onto tape.
The restore would be simple too. Should the system fail to boot? I would boot from the original SLES install DVD, select rescue mode, configure networking then restore from the remote NFS server. We actually keep lots of system backups online on the remote backup server's disk cache.
For the domU I run tar based daily differential backups and full once a week. The restore should be more challenging due to the nature of the setup but it's actually fairly easy. You can fix, restore, backup, modify or whatever you need doing on any domU disk from the XEN dom0 host regardless you use EVMS, partition or fileimage based setup. I already published a guide how to access domU disks from the host which is the key in this solution:
http://www.novell.com/communities/node/2697/mountaccess-files-residing-xen-virtual-machines
I tested this solution many times when I accidentally deleted domU disks or corrupted them during development. It actually takes less than 10 minutes to fully recover a domU this way which is pretty good.
If you can afford domU being offline, you can create, clone any domU disk with dd utility, create an image backup, daily snapshot or whatever you want.
Update regime:
Due to the nature of the system it makes sense not to be version freak and update all the time when a patch is released. Simply follow Murphy, don't try to fix the non existing problem although it's good to do patching occasionally to save yourself from software bugs.
I recommend doing the following, keep the common sense amongst them:
- avoid upgrading during working hours, the load needs to be as small as possible
- sign up for patch notification emails
- always read them thoroughly, understand what is fixed, see if you could be affected
- update regularly but not too often (I update usually between 8-12 weeks)
- unless I am affected or the patch is really important for healthy environment
I'm not too concerned about security fixes, my systems protected by various levels of tools, firewalling, etc. but you may need them urgently if a package is affected by a critical bug and it could remotely affect your systems providing public services over the INTERNET.
The updating as a process is more challenging in this environment especially when software components receive feature updates, newer versions, etc.
For an HA managed XEN cluster, the standard procedure would be as follows:
- put host1 node into standby
- apply patches and reboot (assume within 8-12 weeks you receive kernel update)
- put host1 node back into active mode (no domU should be running there)
- observe the system for a while, monitor logs, ensure full functionality
- update the least important domU first then stop it (running on host2)
- start the least important domU up immediately (it should start up on host1)
- monitor behavior for a while, may be a day or so depending on your requirements
- if things look good put host2 into standby BUT don't patch just yet
- all domUs will move to the newly patched host1 (should be no issues)
- host2 remains in standby until all domUs are fully patched and restarted
- proceed with the patching of the rest of the domUs running on host1
- once all domUs are fully patched, restarted on host1 and fully functional for a while proceed with host2 patching
- reboot host2 then put it back to active mode if behaves well for a while
This is the safest method I had figured out over the years and it always worked except one occasion: when SUSE updated HA from 2.0.7 to 2.1.3 and my configuration at that time didn't have certain scoring settings. It's already discussed briefly at the bottom of page 27.
I had odd OCFS2 issues as well when a new version got released (1.4.x). For clustering solutions it's quite common that nodes cannot establish connection with other nodes with different software versions, it's just the way it is.
You have to pick the right time for upgrade and maintenance. The load needs to be the smallest possible to avoid issues. For example: once I updated one of my cluster nodes during working hours and when I put the node back to active mode, the EVMS resource failed due to timeout. I must have had big load either on the SAN or on the volumes presented to the nodes. As a result STONITH actions were fired, reboots, etc.
RAM usage generally:
This idea assumes that you count the amount of RAM you consume, you limit the usage of your dom0 and all your domUs for a certain amount and you never run more resources than one node can handle. It's one of the reasons why a 2 node cluster is inefficient, one is pretty much waste because you can only run as much as one node can handle.
Of course if you had more nodes, it's a lot easier to dump some domUs onto many, alternatively you could set up some policies to shut down certain resources to accommodate the extra load but it's out of the scope of this guide.
Persistent device names:
It's a bit off topic, take it as an optional reading by the way it should not affect users who run SP1 or SP2 SLES10 installations. I upgraded mine since GA release and at that time this wasn't the default during installation. I made some notes how to do this by hand.
http://www.novell.com/communities/node/6691/create-convert-disks-persistent-device-names
Limiting dom0's RAM:
I think the maximum limit is still a good idea and I'm still doing it:
http://www.novell.com/documentation/sles10/xen_admin/data/sec_xen_config_bootloader.html
The maximum limit is not enough itself, I do recommend setting the minimum limit as well. This way your dom0 (your controller) always gets what it needs and according to my tests 1G is enough for even the busiest environments. Smaller systems could try 512M for a start:
host1:~ # vi /etc/xen/xend-config.sxp
-snip-
(dom0-min-mem 1024)
-snip-
host1:~ # rcxend restart
Be careful restarting xend on production systems, I haven't tested this during normal operation, it might cause trouble.
Drawing dependency graph:
It's very useful for visualizing the HA cluster, its dependencies and it should be part of your cluster documentation. I published this separately:
http://www.novell.com/communities/node/5880/visualizing-heartbeat-ha-managed-resource-dependencies
Unlimited shell history:
I found it useful especially when I forgot how I did certain things in the past. You could track your root account's activity during the system's lifetime:
http://www.novell.com/communities/node/6658/unlimited-bash-history
Managing GUI applications remotely:
It's an article I published recently and it could be very useful for virtual environments. The GUI component takes a lot of resources and actually the server doesn't need to run these so we can turn them off yet use and take advantage of the GUI tools developed for SuSE.
http://www.novell.com/communities/node/6669/rdp-linux-managing-gui-displays-remotely
Monitoring HP hardware:
Protect your XEN host from hardware failure and monitor it by HP provided tools. It may not be necessary for initial setups but for production environments.
http://www.novell.com/communities/node/6690/hpproliantsupportpacksles10
Relocation host settings:
You will need this for live migration. If you are unfamiliar with this please visit this page:
http://www.novell.com/documentation/sles10/xen_admin/index.html?page=/documentation/sles10/xen_admin/data/bookinfo.html
XEN networking:
It's a very important and sensitive topic, pay attention to it when you design your cluster. I personally prefer bridges since it's a layer 2 operation on the OSI model, easy to set up and doesn't cause too much overhead on the host even if it's software based but it may not suit your environment therefor you will need routing.
If you decided to use bridging and want to use multiple ones for more than one NIC, you can create a small wrapper script to manage them when xend starts up:
host1:~ # vi /etc/xen/scripts/multi-network-bridge
#!/bin/sh
# PRELOADER (Wrapper) SCRIPT FOR 'network-bridge'
# Modified script from wiki.xensource.org (for more
# than 1 bridge) IAV, Sep 2006
#
# Modified to suit SLES10-SP2. IAV, Aug 2008
dir=$(dirname "$0")
"$dir/network-bridge""$@" netdev=eth0
"$dir/network-bridge""$@" netdev=eth1
Modify your xend configuration to reflect this change:
host1:~ # vi /etc/xen/xend-config.sxp
-snip-
(network-script multi-network-bridge)
-snip-
host1:~ # rcxend restart
It will basically run the standard bridge-script multiple times for every NIC specified. You will need this to be set same way on all dom0 hosts. We use this to separate private LAN and DMZ traffic for infrastructure domUs requiring access to both.
Common problems with bridges:
http://www.novell.com/support/php/search.do?cmd=displayKC&docType=kc&externalId=7001989&sliceId=1&docTypeID=DT_TID_1_1&dialogID=18715317&stateId=0%200%2018707874
Issues with multiple NICs:
http://www.novell.com/support/php/search.do?cmd=displayKC&docType=kc&externalId=7000058&sliceId=1&docTypeID=DT_TID_1_1&dialogID=18715317&stateId=0%200%2018707874
Upgrading from SLES10 SP1 to SP2:
http://www.novell.com/support/php/search.do?cmd=displayKC&docType=kc&externalId=7000608&sliceId=1&docTypeID=DT_TID_1_1&dialogID=18715317&stateId=0%200%2018707874
XEN knowledge base master reference:
http://www.novell.com/support/php/search.do?cmd=displayKC&docType=ex&bbid=TSEBB_1221753215744&url=&stateId=0%200%2018707874&dialogID=18715317&docTypeID=DT_TID_1_1&externalId=7001362&sliceId=2&rfId=
Cluster status via web:
hb_gui is great but what if you don't have it in hand? It's possible through HTTP protocol with web browser, crm_mon command line utility is capable of creating output in HTML format. Web server could be running on your nodes but it's pointless, my preference is to run the web server somewhere else equipped with a small cgi script to retrieve the output from the nodes remotely. You could ask any nodes, they return the same output but your cgi script must ask another node if one of them is down for maintenance. This solution is designed for 2 node cluster and might not be the best but works, based on that you already created with your favorite tool a user for example "monitor" on both hosts.
Generate keys then copy them to the other node, same location:
monitor@host1:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/monitor/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/monitor/.ssh/id_rsa.
Your public key has been saved in /home/monitor/.ssh/id_rsa.pub.
The key fingerprint is:
05:d7:71:82:a5:14:d6:a1:c6:1c:2e:3a:ba:5b:9c:cc monitor@host1
monitor@host1:~> cd .ssh
monitor@host1:~/.ssh> ln -s id_rsa.pub authorized_keys
monitor@host1:~/.ssh> cd ..
monitor@host1:~> scp -r .ssh host2:~
Password:
authorized_keys 100% 926 0.9KB/s 00:00
id_rsa 100% 1675 1.6KB/s 00:00
id_rsa.pub 100% 395 0.4KB/s 00:00
known_hosts 100% 395 0.4KB/s 00:00
monitor@host1:~>
Both nodes should be ready now for remote passwordless login. (monitor user only)
My webserver is not a SLES box at this stage but it should be very similar, the difference is just the running user and maybe the locations. On the webserver we have to unlock the account for the running user by giving it a valid shell. At completion we copy the private key across and put it to the default $HOME/.ssh location:
webserver:~ # getent passwd apache
apache:x:48:48:Apache:/var/www:/sbin/nologin
webserver:~ # usermod -s /bin/bash apache
webserver:~ # getent passwd apache
apache:x:48:48:Apache:/var/www:/bin/bash
webserver:~ # su - apache
apache@webserver:~> pwd
/var/www
apache@webserver:~> mkdir .ssh
apache@webserver:~> scp monitor@host1:~/.ssh/id_rsa .ssh/
Password:
id_rsa 100% 1675 1.5KB/s 00:00
Execute some remote test commands from the webserver as the running user on both nodes. It's necessary to test the authentication and to accept the initial authenticity for both systems.
apache@webserver:~> ssh -l monitor host1.domain.co.nz 'uptime'
2:41pm up 5 days 18:08, 4 users, load average: 0.09, 0.08, 0.02
apache@webserver:~> ssh -l monitor host2.domain.co.nz 'uptime'
2:42pm up 5 days 23:26, 6 users, load average: 0.38, 0.36, 0.29
When you finished you can log out then lock the running user's account back, it doesn't need valid shell anymore:
apache@webserver:~> exit
webserver:~ # usermod -s /sbin/nologin apache
webserver:~ # getent passwd apache
apache:x:48:48:Apache:/var/www:/sbin/nologin
webserver:~ # su - apache
This account is currently not available.
Create a shell script in your cgi-bin directory depending on your webserver's configuration with the following content:
#!/bin/sh
NODE1="host1"
NODE2="host2"
KEY=/var/www/.ssh/id_rsa
USER="monitor"
DOMAIN="domain.co.nz"
CMD="/usr/sbin/crm_mon -1 -w"
# This function is testing the node availability
ping_node() {
/bin/ping -c3 -w5 -q $1 > /dev/null
/bin/echo $?
}
# This function is used to randomize NODE variable
randomize() {
i=1
while [ $i -ne 0 ]; do
SECONDS=`/bin/date +%S`
TEST=$(( $SECONDS % 2 ))
if [ $TEST -ne 0 ]; then
NODE=$NODE1
else
NODE=$NODE2
fi
i=`ping_node $NODE`
done
}
randomize && /usr/bin/ssh -l $USER -i $KEY $NODE.$DOMAIN $CMD
The webserver configuration is out of the scope of this document and remember it may take few seconds to load the page. This delay is caused by the host checking (ping) method built into the cgi script.
Output should look like this:
Testing the cluster
It's just as important as any other safety feature we built into the cluster. I hope you had read the links provided by this guide by now and they were all working. Jo's original HASI discussed the testing in some ways, I'm not planning to duplicate but you should:
Test multipath:
It depends on your hardware hence I cannot give you solution. In my case I did have some unplanned outages, controller failures which allowed me to actually test the service in real life. At some point one of my servers lost both paths to the SAN, the other for some reason didn't. Interestingly HA must have noticed something and the following day I found all my domUs running on the healthy node. There's nothing in my cluster monitoring the SAN but something obviously did something which actually saved all my resources from becoming unavailable what I didn't noticed at that time since it was after hours...
Test STONITH:
This is the most important part, ensure it's operating 100%. You can kill the HA process and see if resources move to the other node and the one you killed the process on actually reboots (after a while depending on your check interval, deadtime, etc.):
host2:~ # pkill heartbeat
It could take some time if ssh agent was chosen so be patient, monitor logs in the meantime:
host2:~ # tail -f /var/log/ha-debug
You can test the iLO agent like this from the CLI:
host2:~ # stonith -t external/riloe -p hostlist ilo_hostname ilo_user
ilo_password ilo_can_reset ilo_protocol ilo_powerdown_method -T reset
It's a one line command, you can use the parameters hardcoded into your XML file. It should cold reboot the node within few seconds.
Note: the STONITH action is considered emergency hence the iLO agent will just pull the cord, your journaling filesystem should take care of the unclean shutdown. The ssh agent is different, it's executing a reboot command within the OS which would do clean shutdown therefore requires significant amount of time to complete depending on your setup.
Remember: if you have implemented timing delays as explained in this guide, with many domUs, starting heartbeat process at boot and stopping at shutdown will take some time. Don't force it, it will complete, it's just the nature of the cluster.
I do recommend stopping the ssh STONITH agent and do the proper test (killing HA process as above) to ensure that iLO STONITH work as well.
Resources:
Stop some domUs randomly with xm or virt-manager, see if HA brings them back up after while, monitor logs.
Network failover:
Test the cluster against network issues. The best and easiest is to pull the cord out of one of your member nodes. (eth0 only, we monitor only the NIC going back to our private LAN) Resources should start restarting or migrating onto the other node after few moments. If you don't have access to the hardware or don't want to pull the cord, you can block the returning ping packets to your node which should have the same affect:
host2:~ # iptables -I INPUT -p icmp -s 192.168.1.254 -d 192.168.1.2 -j DROP
The -s parameter is my gateway, the returning packet's source, the -d is the server itself and the destination for the packets. It filters -p only icmp protocol and just inserts -I this filter to the standard INPUT chain. You could add more options to create the closest match to these packets but I think it should do the job safely without risking you blocking something you didn't mean to. The server should not reboot, your domU resources should move to another node. To get rid of this interim firewall rule, you could restart the XEN host or issue the following command:
host2:~ # iptables -D INPUT -p icmp -s 192.168.1.254 -d 192.168.1.2 -j DROP
Standby mode:
Test the cluster against standby mode, see resources moving. Reboot the nodes, do some maintenance, etc. Once finished, put it back to active mode, resources should stay where they are, logs should contain no errors.
After few reboots on both nodes, ensure that the DC (designated controller) role does change over in the HA cluster, EVMS volumes get discovered and activated well on all nodes.
Check the configuration time to time:
It's good idea to do ad-hoc configuration checks particularly when it changed:
host1:~ # crm_verify -LV
host1:~ #
Empty prompt is good sign, everything else is displayed...
Proof of concept
The NFS domU is hosting user home directories exported by NFS service to store user data, for testing it's set with 512MB of RAM. The domU is running on host1, everything is as presented earlier in this document.
Writing 208MB file to the NFS export from my desktop:
geeko@workstation:~> ls -lh /private/ISO/i386cd-3.1.iso
-rw-r--r-- 1 geeko geeko 208M Nov 3 2006 /private/ISO/i386cd-3.1.iso
geeko@workstation:~> md5sum /private/ISO/i386cd-3.1.iso
b4d4bb353693e6008f2fc48cd25958ed /private/ISO/i386cd-3.1.iso
geeko@workstation:~> mount -t nfs -o rsize=8196,wsize=8196 nfs:/home/geeko /mnt
geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt
real 0m20.918s
user 0m0.015s
sys 0m0.737s
It wasn't very fast due to my uplink was limited to 100Mbit/s at that time but it's not what we are concerned about right now. Redo the test but migrate the domain (put host1 into standby) while writing the same file to the NFS export:
geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt
real 0m41.221s
user 0m0.020s
sys 0m0.772s
Meanwhile on host1 (snippet):
host1:~ # xentop
xentop - 12:02:23 Xen 3.0.4_13138-0.47
2 domains: 1 running, 0 blocked, 0 paused, 0 crashed, 0 dying, 1 shutdown
Mem: 14677976k total, 1167488k used, 13510488k free CPUs: 4 @ 3000MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k)
Domain-0 -----r 2754 47.6 524288 3.6 no limit n/a 4 4 1282795 4132024
migrating-nfs -s---- 8 0.0 524288 3.6 532480 3.6 1 1 17813
It was twice as long but:
nfs:~ # md5sum /home/geeko/i386cd-3.1.iso
b4d4bb353693e6008f2fc48cd25958ed /home/geeko/i386cd-3.1.iso
The md5 hash matches up and that is what I wanted to see from the NFS domU. I tested the file system just in case: (NFS domU itself used LVM2 on top of xvdc (mpath1) with XFS)
nfs:~ # umount /home
nfs:~ # xfs_check /dev/mapper/san1-nfshome
nfs:~ #
No corruption found.
For the record:
As of today the NFS domU uses 1G RAM and sharing the home space amongst many users. The disk is still using XFS filesystem due to the CommVault7 backup software we use for the data partition by the way it's the best for this purpose. Apart from minor adjustments (added more RAM, increased the number of nfsd processes as we added more users to it) the system uses nearly default settings.
We needed to tune the mount options for the users' NFS clients too. These were necessary to ensure data safety and instant reconnection if the NFS server rebooted for some reason such as kernel update. The mount command (one line) would look like this:
geeko@sled:~> mount -t nfs -o rw,hard,intr,proto=tcp,rsize=32768,wsize=32768,nfsvers=3,timeo=14 nfs:/home/geeko /home
All UNIX systems around nowadays configured with autofs which retrieves a special record from our LDAP about the user's home location then mounts it instantly. It has been working absolutely hassle free, the domU was migrated many times amongst the nodes while heavy access without any issues. The filesystem is clean, the data is safe, the shared storage is considerably efficient although NFS is not my favorite protocol.
I/O Statistics:
Native SLES10 SP2 on HP DL585G2 hardware with EVA6000 SAN
Write: ~156M/s
Read: ~140M/s
CommVault7 best backup throughput was ~130G/hour
CommVault7 best restore throughput was ~75G/hour
XEN SLES10 SP2 domU on HP DL360G5 hardware with EVA6000 SAN
Write: ~132M/s
Read: ~121M/s
CommVault7 best backup throughput was ~116G/hour
CommVault7 best restore throughput was ~61G/hour
NFS performance from SLED10 SP2 desktop (not as good but still fast as a local disk)
Write: ~49M/s
Read: ~27M/s
These measurements were done mostly after hours without much load, showing an average at that time. The SAN has a fairly large cache hence the throughput may be bogus.
XEN host stats with average load across 8 domUs:
Load average: 0.4-0.8
I/O: ~3000 blocks/sec
Interrupts: ~2200/sec
Context switch: ~2500/sec
Idle: 91%
Conclusion
There may be an argument for many between XEN versus VMware, no doubt both has its strengths and place on the market. XEN is the child of Linux, no question about its strengths or efficiency and cost effectiveness but unfortunately it's lacking of all the fancy features and tools, utilities what VMware offers.
XEN tools are getting better, evolve fast but it will take significant amount of time for developers to catch upon others who have been doing this for a long time ago. Novell's efforts to improve on this is quite clear, the support we get is very good and I am very happy about what I managed to work out over the last 2 years.
For us, the deciding factor was the efficiency, stability not particularly the cost. At that time we had verbal agreement to purchase ESX for some infrastructure developments, I believe we still own one copy but it's not being used as far as I know.
I don't have VirtualCenter, P2V, dynamic provisioning and many other great features but the reality is that I don't need it. The main components are built into this solution:
- live migration without service interruption
- clustering and high availability
- auto failover
- efficient, centralized resource management, etc.
We needed static virtualization, domUs with dedicated purpose and that's what most small, mid sized businesses would want at least for a start. XEN can serve these well in a very good, very efficient, cost effective way.
I have to admit that I didn't plan using SLES for this project at the early days and it's not marketing here for any flavor, just an individual opinion.
I tested Debian Linux, Fedora Core 5 and NetBSD3 for both dom0 and domU but SLES turned out to be the best for the dom0. It's what matters after all, the dom0 has to be rock solid and perfect. The components used in this guide also developed by SuSE people as individual projects hence no doubt that it's really the best you can use for this sort of thing.
Unlike for the dom0, I still prefer INTERNET facing core systems on Debian Linux and I wouldn't use anything else. You wonder why? Cleanness of code, community support, vast amount of packages, discipline and at last but not least their standard policy of no feature upgrades within release.
For example our SMTP gateway domU consumes 387M storage space, it's the installed system without logs, keeps away daily an average 20K SPAM (most rejected on the spot meaning that these don't waste my CPU time), exchanges zillions, checks viruses and at last but not least our false positive ratio is very low. Half of the filters would not be available for SLES and maintaining these could become unnecessary hassle or overhead after a while.
Thanks to
In no particular order:
Jo De Baer
Andrew Beekhof
Lars Marowsky Bree
Dejan Muhamedagic
Alan Robertson
Novell support
Users from linux-ha mailing list