Setting up and maintaining an OpenMosix Cluster with lunar.


Table of Contents

Introduction
Concepts
Setting up the chroot
Building in the chroot
Compiling the OpenMosix kernel
NFS or not?
Filesystems
Transferring /usr to nfs
Going rsync!
Getting it on the nodes
And then what?
Setting up remote access
Maintenance
Appendix I - Discussed components:
Appendix II - shell scripts

Written by Auke Kok, <sofar@lunar-linux.org>. Copyright 2004 Auke Kok under the The GNU Free Documentation License.

Introduction

This document describes the method that have been used to setup, control and maintain a large (16 nodes) OpenMosix cluster with Lunar-Linux.

The cluster can be administrated and operated and feels like a single machine. This makes the OpenMosix cluster truly behave like a transparent cluster instead of a beowulf cluster.

The concepts discussed in this document are not per se needed to build an OpenMosix cluster. As a matter of fact, they may all be completely omitted in building an OpenMosix cluster. However, the benefits of the chosen methods are considerabley in large-scale operations and provide you with enormous flexibility.

Note that the technical aspects of many of the discussed steps are not illustrated in depth. You are supposed to aquire the appropriate documents for those subcomponents yourself. There is plenty of documentation around for all used components.

Concepts

The cluster discussed revolves around three basic components next to OpenMosix itself. It can be even considered that the OpenMosix component is largely unimportant in the process, since the process of installing and rolling out nodes requires much more preparation than booting a specially compiled kernel.

The three used mechanism can theoretically be replaced by any available mechanism that provides the same functionality. It is an arbitrary choice, just like python vs perl, oh wait, no perl is definately better. Go make your own well-informed choices.

  • Rsync - the remote synchronization tool. Rsync is widely used on the internet to synchronize mirrors and minimize bandwidth. For this reason rsync is perfect to roll out and upgrade images to cluster nodes. As rsync compares the target with the original, it makes sure not to waste any bandwidth on transferring unneeded changes. This is not only a timesaver, but also ensures that your master image is highly available to nodes. Rsync can be used in both the install phase and the maintenance phase.

  • NFS - Networked file system. NFS is only one of many networked filesystems available and certainly not the most secure one. However, it's tight integration in the current linux kernels, and its easy-to-setup components make it a viable solution to provide your nodes with applications. The lack of security is often not a problem considering that your cluster will most likely not be internet-connected.Actually NFS is only one of many subsystems used to provide basic networking functionality to the nodes. YP/NIS can provide authentication, or even LDAP or krb5, depending on the situation. The cluster discussed is operating in a NIS environment, there fore NFS/NIS was an easy choice.

  • Chroot. Chroot environments provide you with the ability to build a system completely independant of it's hardware. It it a perfect way of doing development without disturbing actual systems. Without chroot, many applications would not be possible. Chroot is used both in the installation and maintenance phase of the discussed cluster.

The work order that was followed consisted of 3 phases:

  • The build phase. In the build phase you will prepare the image. This image is basically something that would sit on an install ISO, or even a knoppix ISO. As a matter of fact there is a knoppix OpenMosix ISO out there... Since we want a little bit more out of our cluster however we use the strength of lunar and compile everything ourselves.For this phase you only need a reasonably new and fast lunar installation, preferably running the same linux kernel version as the one that the OpenMosix patchset applies to. Throw in the latest lunar ISO and you are all set!

  • The rollout phase. In the rollout phase you initialize all cluster nodes. This will probably go wrong about a hundred times at first, so start to setup just a single node cluster at first. If you get it to install a single node cluster without touching the keyboard, you know you are done with tweaking the installation, and you can run the image on all nodes.

  • The maintenance phase. Here's when you start doing upgrades to your cluster software, perhaps a kernel upgrade, perform local tweaks and other stuff like that. There are no specal tricks in this phase anymore, except for making sure you use the tools available well, since they will save you a lot of time. Here's when you really will appreciate the hard work from the build phase.

Setting up the chroot

To setup a chroot build environment you will need a linux box running the same architecture (not necessarily same hardware!) as your cluster nodes. Basically any lunar installation will do. Make sure you have installed and booted the same kernel version or newer than the OpenMosix kernel version release.

Get the latest lunar ISO. There are other ways of starting your chroot with applications you will need to compile and install stuff on your nodes, but a lunar ISO is small yet filled enough to feed your compile needs.

# make the chroot base directory:
$ mkdir /sandbox
# mount the lunar ISO
$ mount /dev/cdrom /mnt/cdrom
# copy the ISO contents into the sandbox
$ cp -a /mnt/cdrom/* /sandbox

You have now got a fully functional chroot environment. However, you will see that you will need to do a lot of things before it's actually in any way usefull. At this time you will need to pull out everything you have and start compiling:

Building in the chroot

# enter the chroot:
$ chroot /sandbox
bash-2.05$ _

In theory you could skip this phase, but this is where your new cluster will be as you want it. Run `lunar optimize` and set your optimizations, `lin moonbase`, `lin theedge` to make sure your lunar setup in the chroot environment is up to date. Next thing we will do is recompile all installed applications optimized for the architecture you have chosen:

$ lin gcc glibc binutils

The list goes on after this, start with modules in devel/ compilers/ and utils/. After those modules you can just randomly rebuild the rest, but skip the kernel until the end.

While rebuilding, it will be a *smart* thing to not turn on any init scripts or xinetd confs for modules. As lunar tracks these, they might accidentally be removed when you upgrade a module later on. Instead, turn all of them off and mark the ones down you want to have running eventually on your cluster nodes.

When you are done rebuilding the base components you can additionally install modules that you require to be present on your cluster. In this list may fall nfs-utils, portmap etc, but also clustering additions as lam-mpi, although some of them may need to link specifically against your kernel.

Compiling the OpenMosix kernel

$ lin linux-openmosix

Installing the OpenMosix kernel is no more complicated than any other linux kernel. Make sure you enable the specific OpenMosix features that you need, if you need them. Mfs may look nice for instance, but I have yet to use it. Surprisingly that's all I'm gonna say about this.

Note that you cannot run lilo right now, as you would seriously hurt your MBR. Equally, grub won't do either. The only thing you can do is copy the kernel image over to /boot and leave it in there until we get to the next part.

Only after you are done installing the OpenMosix kernel you can compile the openmosix tools. Technically, you should:

$ lin glibc

This will recompile your glibc against the OM (OpenMosix) kernel, and install the proper kernel headers in /usr/include. Now you are ready to:

$ lin openmosix-tools

This completes the base chroot build process. Of course there are many local additions adn tweaks you may want to do from here, but these are up to your own insight and wishes.

NFS or not?

The OM nodes will need to run applications, therefore access disk space. This disk space can be duplicated on every node. This results in faster operation, but means you have to check and synchronize much more thoroughly than when most of your applications reside on a network filesystem.

For this reason I chose to mount /usr on NFS. The main advantage of this setup is that the root filesystem is incredibly small. In my case it's roughly about 50mb, which includes copies of files in /usr that need to be accessible before /usr is mounted. You will stumble into this yourself if you choose this path, but it's rather easy to get around this.

Having /usr on NFS requires that you have a NFS server available. The setup of an NFS server is not discussed in this manual of course, however we do need to transfer data from the chroot environment onto the NFS filesystem. This brings us to the disk layout problem.

Filesystems

Here are the filesystems I have planned for my cluster:

rootfs - 1gb - local, ext3
swap - 2gb - local, swapfs
/usr - - nfs (lunar modules mostly)

Of course procfs is mounted, as well as tmpfs on /tmp, /var/tmp, /var/lock and a few other locations that I really want to be clean after a reboot. I chose to have a plain-jane /dev, which results in a daemon less in memory, and also since I can tweak permissions in my chroot environment without the need to edit permissions on all nodes.

Transferring /usr to nfs

As long as you are online, you should be able to test-mount /usr on a temporary mountpoint and transfer the contents to the network share. As soon as you have done this however, you will have a non-functional chroot setup. This is troublesome, so we need some form of re-entry tool to work in the chroot environment. For this I wrote a special script that does the following:

#!/bin/bash
mount -t nfs server:/export/usr /sandbox/usr &&
chroot /sandbox &&
umount /sandbox/usr

By calling this script, I 'enter' the sandbox and automatically mount /usr from the proper server. If I enable root write access, I can also write on the nfs filesystem as if it were local. As soon as I type 'exit' the chroot gets cleaned up. This script will grow over time, so keep it handy. I named it 'enter'... as counterpart to 'exit'.

Now you can re-enter the sandbox and install modules in /usr just as on a normal box. You'll notice you want /proc mounted to inside the chroot, so we add mount and umount calls to our chroot 'enter' script:

#!/bin/bash
mount -t nfs server:/export/usr /sandbox/usr &&
mount -t proc procfs /sandbox/proc &&
chroot /sandbox &&
umount /sandbox/proc &&
umount /sandbox/usr

You can probably see where this is going now. Time for the next step!

Going rsync!

We will need to rollout the root filesystem to your nodes, and there is a fundamental difference between the contents of the nodes' root filesystems and the contents of your chroot environment. First thing we need to do is make sure some stuff doesn't get transferred to your nodes. The only way to do this is 'enter' the chroot and list all the stuff that's in /etc and should not be transferred to the nodes:

/etc/hostname
/etc/config.d/network
/etc/ssh/*key*
...etc...

This also means you will have to use alternative ways to configure these files on your nodes. You will have to adjust all relevant init.d scripts to make sure they do exactly what you want it to.

Next thing to do is setup a public rsync server that allows root to synchronize onto and out of the rsync repository. This is needed to make sure file permissions and special files are rsynced as well. We again edit the 'enter' script to synchronize the chroot, but now add the rsync calls: (Note that I'm omitting the proper rsync options for brevity here)

#!/bin/bash
rsync root@server::sandbox /sandbox &&
mount -t nfs server:/export/usr /sandbox/usr &&
mount -t proc procfs /sandbox/proc &&
chroot /sandbox &&
umount /sandbox/proc &&
rm /sandbox/etc/ssh/*key* &&
rsync /sandbox root@server::sandbox

Before we call this script the first time we'll have to fill the rsync repository first by issuing the last rsync call manually once, after you have deleted the unwanted stuff of course from your sandbox.

Getting it on the nodes

Once you are done you only need to perform two tasks:

  1. Getting it on the nodes! This is easy. Just use your favorite imaging tool. I used systemimager, especially since it uses rsync. It just seemlessly fits in the discussed installation method. You'll have to write a 'install' script that creates partitions, filesystems, rsyncs the rootfs and finally runs lilo on the newly imaged node.

  2. Getting it to boot! This is where you see that you have made a typo in your systemimager script and correct it, recreate the bootfloppy or adjust the script in the systemimager files, reboot, rinse, lather, repeat! This step took me about 30 reboots or so, but the results are of quality: I can now roll out a new node in under 5 minutes without using a keyboard or mouse.The nasty thing is not the installation of the rsync copy, but to get your node boot, configure networking, mount /usr and do all the main vital tasks in the right order. Having /usr on NFS makes it a bit more complicated than needed unfortunately.

And then what?

Well, if you're here already you obviously have your first node tell you it's done booting. Possibly now you find out that you haven't set the root password, or it didn't startup a vitally important service. Of course, this is not really nice so we need some form of adjustment on the node rather than having to boot that ISO on the node every time.

In short, we want to transfer our rootfs from the chroot working environment to the nodes as directly as possibly. With the ISO the scheme looks like this:

chroot -> rsync -> rollout iso -> node

But nothing keeps us from bypassing the ISO stage as soon as our node is networked and capable of doing an rsync:

chroot -> rsync -> node

Now you are getting to the point where you will need some form of remote control to your cluster. As long as you have one box you'll be fine, especially if you have a keyboard and a monitor hooked up to it. Imagine however having 4 boxes... or 40, or maybe 400! There's no way you can log into every box, type the root password, execute a few commands, etc...

The most important thing to create is a script that synchronizes your node with the central rsync copy of your sandbox. You might want to consider alternative techniques, but I chose for a GET-like algorithm that makes every node synchronize itself with the master image. This script would reside on every node and when called, would make every node check to see if it needs an update and if so, perform it by rsyncing to the appropriate server.

In this process you might actually want to consider running lilo automatically, in order to assure that your boot sequence is in order if you put a new kernel in /boot, and possibly some other features that you might require to be performed at boot.

Setting up remote access

In order to control a large group of machines you can use several solutions that will allow you to execute remote commands. The two major remote shell tools available are rsh (module netkit-rsh) and ssh. Ssh offers you some major advantages but rsh may be needed for other clustering tools as well so check them out both before you make a decision.

Having remote access to one node is nice, but you need to pipeline access to all nodes. For this reason you can write shellscripts that loop over all your nodes, or use something like dsh to send a command to all your nodes at the same time. dsh will attempt to connect to all nodes and returns information in case some of the nodes cannot be reached.

You also might want to install sudo and/or suauth features, this will allow you to work as a normal users and specify that a limited group of people are allowed to perform maintenance tasks, like rebooting the cluster, or performing an rsync upgrade. In combination with the previous two techniques you will be able to perform emergency upgrades of your cluster in a matter of minutes.

Maintenance

Maintenance is nothing more than tweaking your box until the hardware has become obsolete. Fortunately lunar allows you to upgrade your boxes almost forever. Software lives, hardware dies. This will also happen to your nodes, but! As OpenMosix allows you to join and remote nodes while your cluster is running, and as lunar provides you a way of performing upgrades to your cluster, you can combine these two and have your cluster live forever, while your nodes 'die' and new ones get moved into the cluster.

Having a proper rootfs image and your toolchain handy will allow you to outperform any clustering solution easily. Most of all it requires a solid design, and wellinformed choices before you choose a certain solution. It may not even be a bad choice to design a custom system for your specific situation, because in the end it is you who will need to do the maintenance.

Appendix I - Discussed components:

NIS/YP - Network Information System/Yellow pages

NIS/YP allows users to be authenticated using a remote user/password administration.

NFS - Networked File system

NFS provides networked filesystems. This way you can provide applications and user files across all you nodes on demand.

Ssh/rsh - shell clients

Provide remote execution, either encrypted or not.

Dsh - distributed shell

Dsh provides an interface between many nodes and rsh/ssh.

Rsync - remote synchronization

This allows you to syncronize filesystems and files with great efficiency and speed, as no unneeded changes are transferred.

Chroot - change root

Chroot allows you to build a system on another machine without actually booting it. It protects your host system also against any program running 'inside' the chroot environment.

SystemImager - distribution bootstrapping tool

SystemImager uses rsync to initialize machines. It can easily install any customized linux distribution.

Appendix II - shell scripts

The 'enter' script is an easy way to maintain an rsync copy of a fully functional build tree/image without the need for an actual machine to run it on. It can be generated on any PC by extracting the contents out of the rsync serer.

In this particular script the chroot environment is placed in /scratch/openmosix and the rsync server has a full copy of it defined in $RSYNCHOST. The script stamps the serial version automatically to make sure your nodes are synced and you don't accidentally miss one. It also cleans out stuff we do not want written to our nodes at all times

#!/bin/bash

RSYNCHOST=root@server:/vol/openmosix-root

echo "Syncing with rsync server: "
if [ -e /scratch/openmosix/dirty ] ; then
  echo "COPY IS DIRTY, NOT RSYNCING!!!"
else
  rsync -vrlpogDt -x --delete --links --numeric-ids --rsh=ssh $RSYNCHOST /scratch
  touch /scratch/openmosix/dirty
fi

mount none /scratch/openmosix/proc -t proc
mount appserver:/export/usr /scratch/openmosix/usr -t nfs -o rw,nfsvers=2
mount appserver:/export/local /scratch/openmosix/usr/local -t nfs -o ro,nfsvers=2

echo "**** root setup"
chroot /scratch/openmosix /bin/bash
echo "**** exited chroot"

SERIAL=$(cat /scratch/openmosix/serial)
((SERIAL++))
echo ""
echo "NEW SERIAL == $SERIAL"
echo ""
echo "$SERIAL" > /scratch/openmosix/serial

chroot /scratch/openmosix /sbin/ldconfig

rm -rf /scratch/openmosix/etc/ssh/*key*
rm -rf /scratch/openmosix/root/.cpan
rm -rf /scratch/openmosix/tmp/*
rm /scratch/openmosix/dirty          &&
umount /scratch/openmosix/usr/local  &&
umount /scratch/openmosix/usr        &&
umount /scratch/openmosix/proc       &&

echo "Syncing with rsync server: "
rsync -vrlpogDt -x --delete --links --numeric-ids --rsh=ssh /scratch/openmosix $RSYNCHOST

The 'rsync' script is placed in /root/rsync and can be executed on my systems by a user who has explicit sudo permissions. This user is defined in /etc/sudoers. The script is only readable to root which protects it a bit. when using dsh it can be called as `dsh 'sudo /root/rsync'`. All nodes will attempt to synchronize immediately, which is why you need to use rsync, using tarballs or other mechanisms would stress the server too much as you add more nodes to your cluster.

#!/bin/bash
SERVER=192.168.0.1
/usr/bin/rsync $SERVER::openmosix/serial /tmp/serial
echo "SERIAL=$(cat /tmp/serial)"
if [ "$(cat /tmp/serial)" != "$(cat /serial)" ] ; then
  rm /tmp/serial
  echo 'Renewing host:'
  /usr/bin/rsync -vrlpogDt --links --numeric-ids --exclude=lost+found/ --exclude=/usr --exclude=/mfs $SERVER::openmosix /
	# we use lilo, and need to do this if the update installed a new kernel or lilo.conf:
  lilo
else
  rm /tmp/serial
  echo 'no renewal needed'
fi

You can expand the number of scripts to run your system as you see fit. A reboot script and a 'tweak' script could help perform easy administration tasks. However, the base rsync script should always give you the exact configuration you need. This will allow you to introduce new nodes without the need to patch them immediately after installation.