Last update: 2007-02-12

// C O N T R O L - A L T - D E L . O R G

"Fate strikes down the strong man, Everyone weep with me " - O Fortuna
NAVIGATION>>
HOME PICS ABOUT NEWS CODE BOOKS MAIL
:: CODE
The three chief virtues of a programmer are: Laziness, Impatience and Hubris
-- Larry Wall
CONTACT
:. FREP - A cluster file replication / synchronization application

The problem:

You've got 100+ servers in your web server cluster, and would like to keep them all synchronized from a centralized location. Furthermore, you'd like that replication to be instantaneous.

The solution: FREP

 
:. Introduction

FREP was born out of a need I had to keep a cluster of server synchronized to a central location with instant file replication (one-way replication).

I realized that polling for file changes was going to be out of the question since the application would get progressively slower the more files were added.

I wanted something that wasn't going to be a ressource hog, and since I was looking at replicating over 200,000 files I wanted something that would scale too. When inotify was released in the vanilla linux kernel, I knew I had found the tool I needed to accomplish this job.

The next part of the puzzle that I needed to figure out what the communication mechanism that I was going to need to have in place to communicate changes to a large number of clients. With IP broadcast and multicast capabilities, the Spread cluster message toolkit was a natrual fit. Using Spread's multicasting capabilities, FREP should be able to scale to hundreds (if not thousands) of nodes.

The final piece of the puzzle was figuring out a bandwidth efficient mechanism for file transfers. I initially wanted to implement bits of the rsync protocol, but instead settled on a combination of Zlib compression on file chunks combined with using the generic diff format specification for partial data transfers. This combination should minimize the amount of bandwidth necessary to have file changes replicated to the cluster.

The system consists of several components:

  • FREP_Server: The file monitor daemon
  • FREP_Resync: The resynchronization daemon
  • FREP_FileServer: A generic file server
  • FREP_Client: The client installed on the nodes
  • FREP_ResyncClient: A resynchronization client installed on the nodes as a failsafe mechanism
  • FREP_Mon: Testing application
 
:. Document conventions

Commands to run use the Courier New font and are highlighted in orange:

/path/to/a/command/to/run --options

 


File contents are always displayed within a text area field:

 

(sorry firefox folks, copy-paste button only works with IE)


Explanations use Arial – This text.

 


 
:. Assumptions
  • Working Linux server
  • You need to be root to install these packages
  • Linux kernel with inotify compiled into the kernel (2.6.13+)
  • Perl 5.6+ (tested with 5.8)
  • Libevent installed and working properly. I highly recommend that you ensure that libevent can use the epoll() syscall mechanism.
  • Daemontools installed
  • Working Spread network. The Spread configuration depends on the number of nodes you'll have in the cluster and the data throughput (filechange bandwidth) you are expecting. More on this later.
  • You'll need to have an idea of how many files you are gong to want to watch. More on this later.

 
:. Version history

13/02/2007 - Version 1.0 Initial public release

15/02/2007 - Version 1.1 Bug fixes, configuration management

:. Architectural considerations

This application was designed with the following goals in mind:

  • Speed
  • Reliability/fault tolerance
  • Efficiency

These goals have been achieved by implementing the following features:

  • Intelligent file transfers using a generic binary difference algorithm
  • Compressed data transfers.
  • Reliable message transport mechanism which can leverage broadcast and multicast IP using the Spread cluster messaging toolkit.
  • Automatic resynchronization with a stream-based approach. File lists are sent to clients by traversing the filesystem hierarchy incrementally.
  • Kernel based file change notification using Inotify.
  • Edge-level event notification mechanisms using libevent (which in turn uses the epoll() kernel call).
  • Client automatically drops root privileges on start up, all files created by the client owned by an unprivileged user.
  • Several levels of fault-tolerance using application retries and automatic resynchronization.
  • When unicast traffic is necessary, the system uses the kernel sendfile() functionality for zero-copy file transfer operations.
  • Messages are split between several synchronization groups. (server roles dictate which files will be synchronized).
  • When possible, expensive operations are cached.

Here's a visio diagram of the client and server operation

 

 

:. Protocol specifications

Synchronization application: FREP_Server


The following messages are sent by the synchronization application using the spread toolkit.

 

Each message is sent to a specific synchronization group as defined by the configuration hash and the file path.

 

File creation

 

Command:CREATE\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
FilePath:<filepath>\n
FileSig:<file signature>\n
__DATA__\n
<Compressed data>__DATAEND__

 

File modification

Command:MODIFY\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
FilePath:<filepath>\n
FileSig:<file signature>\n
OriginFileSig:<shadow copy file signature>\n
__DATA__\n
<Compressed delta data>__DATAEND__

 

File deletion


Command:DELETE\n
FilePath:<filepath>\n

 

File move

Command:MOVE\n
FilePath:<old filepath>\n
MovedTo:<new filepath>\n

 

Folder move


Command:MOVEDIR\n
FilePath:<old filepath>\n
MovedTo:<new filepath>\n

 

Heartbeat

Command: HEARTBEAT\n


Resynchronization application : FREP_Resync


The resynchronization application sends out the following messages using the Spread toolkit to specific synchronization groups.

 

Re-synchronization

Command:Sync\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
__DATA__\n
<Compressed synchronization data>__DATAEND__

 

The synchronization data format is:

DIR:/file/path/1\n
Fileentry1\n
Fileentry2\n
FileentryN\n
DIR:/file/path/2\n
FileEntry1\n
DIR:/file/path/N\n
FileEntryN\n

 

Note: The file entries are relative to the last directory entry.

 

Synchronization client : FREP_Client


The synchronization client uses the following commands when communicating with the FREP_FileServer.

 

Retrieve file


GET <filepath>\n

 

Retrieve file signature


SIG <filepath>\n

 

File server: FREP_FileServer

 

Signature format  (response)

<md5 hex hash>|<filesize in bytes>|<file modification time>\n

 

File Transfer

 

The files are transferred as binary data with no encapsulation or header information.

:. Installation - Server side applications

The following instructions are for the installation of the server side code (the server that will be the main file repository).

Download and install prerequisite perl modules.

perl -MCPAN -e shell
cpan> install Bundle::CPAN IO::Stringy Compress::Zlib
cpan> install Sys::Syscall Digest::Adler32 Digest::MD5
cpan> install Linux::Inotify2 File::Path Event::Lib Spread
cpan> install Algorithm::GDiffDelta BerkeleyDB MLDBM Storable

 

Create some run scripts for daemontools

mkdir /etc/FREP_Server /etc/FREP_FileServer /etc/FREP_Resync
cat <<EOF >/etc/FREP_Resync/run
#!/bin/sh
exec > /dev/null
exec 2>&1
exec /usr/sbin/FREP_Resync
EOF
chmod +x /etc/FREP_Resync/run
cat <<EOF >/etc/FREP_Server/run
#!/bin/sh
exec > /dev/null
exec 2>&1
exec /usr/sbin/FREP_Server
EOF
chmod +x /etc/FREP_Server/run
cat <<EOF >/etc/FREP_FileServer/run
#!/bin/sh
exec > /dev/null
exec 2>&1
exec /usr/sbin/FREP_FileServer
EOF
chmod +x /etc/FREP_FileServer/run

 

Next up, the applications.

 

Edit /usr/sbin/FREP_Server:

 

Edit /usr/sbin/FREP_FileServer :

 

Edit /usr/sbin/FREP_Resync:

 

Make them execuatble:

chmod +x /usr/sbin/FREP_FileServer
chmod +x /usr/sbin/FREP_Server
chmod +x /usr/sbin/FREP_Resync

 

Before going any further...

 

BIG SECURITY NOTICE:

***************************************************

THESE SERVICES ARE NOT DESIGNED TO FACE THE INTERNET. YOU SHOULD HAVE A FIREWALL BETWEEN THIS APPLICATION AND THE INTERNET.

***************************************************

 

That said, on to the configuration.

 

Edit /etc/FREP.conf:

 

There's basically six different things you need to configure to get FREP up and running. They are...

 

Spread configuration:

 

<!-- Spread configuration -->
<Spread>
<Port>4803</Port>
<Address>192.168.1.100</Address>
</Spread>

 

This defines the address and port of the Spread daemon. This should be localhost, but you could run the spread daemon on another box if you wanted to.

 

Pools:

 

<!-- Pool configuration -->
<!-- Note: all watched dirs must be located on the same device -->
<Pool name="Dynamic">
<Dir>/www/mysite.com/ad/</Dir>
<Dir>/www/mysite.com/images/</Dir>
<Dir>/www/mysite.com/php/</Dir>
<Dir>/www/Smarty/</Dir>
</Pool>
<Pool name="Static">
<Dir>/www/mysite.com/images/</Dir>
<Dir>/www/mysite.com/html/</Dir>
</Pool>

 

Each pool represents a group of severs that serve a specific function. A pool has to have a name, and a list of folders that we'll want to watch.

 

Ignore list:

 

<!-- Ignore list -->
<IgnoreList>
<Dir>/www/mysite.com/ad/ignoreme</Dir>

<Dir>/www/mysite.com/html/topsecret</Dir>
</IgnoreList>

 

The ignore list is a set of paths that we _DON'T_ want to synchronize or replicate.

 

Synchronization configuration:

 

<!-- synchronization configuration -->
<Sync>
<MaxWatches>524288</MaxWatches>
<ShadowRoot>/www/ShadowRoot</ShadowRoot> <!-- MUST BE ON SAME DEVICE AS WATCHED FILES -->
<HeartbeatTimer>3</HeartbeatTimer>
<QueueTimer>5</QueueTimer>
<Debug>0</Debug>
</Sync>

 

This section defines the maximun number of of files+folders (MaxWatches), if you try to watch more than this number of files+folders, it'll barf.

 

We also have the ShadowRoot configuration directive, which defines where FREP will create a copy of all the watched files/folders. This is used to be able to do partial file transfers (send only the changed parts of a file when it's modified). The ShadowRoot file path MUST be on the same device as the watched files.

 

The HeartBeatTimer directive tells FREP how often it should be sending out heartbeats to the client nodes. The clients will use this variable to determine if they need to reconnect to the spread network.

 

The QueueTimer defines how long FREP_Server will keep the received events in the queue before starting to process them. This is to avoid duplicate events triggering needless IO. For example, if you are writing a file in 5 chunks, inotify will be sending multiple file modification events. The queue is designed to schlep these all together. It also handles sequential operations. For example, suppose you open a file, write to it, close it, then delete it FREP_Server will only send 1 event (delete).

 

File server configuration:

 

<!-- FileServer configuration -->
<FileServer>
<Port>9000</Port>
<Address>192.168.1.100</Address>
<ValidPathRegex>^/www/</ValidPathRegex>
<Debug>0</Debug>
<SignatureCache>/tmp/FileCache.db</SignatureCache>
<ConnectionTimeout>5</ConnectionTimeout>
</FileServer>

 

The fileserver's address should be the IP address of the server that the FREP_Server sync daemon is running on. The port should be any unused port.

 

The ValidPathRegex is a safety measure that defines a perl-compatible regular expression that will be matched to see if the file signature or transfer will be allowed. As an example, if you're synchronization folders are /www/mysite.com and /files/syncme, you'd set the ValidPathRegex to: (?:/www/mysite.com|/files/syncme).

 

The FREP_FileServer keeps a BerkeleyDB cache of file signatures to speed things up. the SignatureCache directive tells the file server where to store it.

 

The ConnectionTimeout is the amount of time the fileserver will wait for data on a socket (for reading or writing). 5 seconds should be fine.

 

Resynchronization configuration:

 

<!-- Resynchronization configuration -->
<Resync>
<ResyncTimer>300</ResyncTimer>
<Debug>0</Debug>
</Resync>

 

The FREP_Resync daemon only has one tweakable setting, the ResyncTimer, which is the frequency at which it'll broadcast the full file list to all the clients.

 

Note:

The debug flag controls if debugging to syslog is enabled. These daemons are pretty verbose when the debug flag is on, so leave it off unless you're debugging :)

 

Start up the daemons:

 

cd /service
ln -s /etc/FREP_Server
ln -s /etc/FREP_FileServer
ln -s /etc/FREP_Resync

 

Wait a few seconds, daemontools should fire them all up. Check to make sure they are running with:

 

svstat /service/FREP*

 

That's all there is to do for the server side applications. Now onto the clients.

 
 
:. Installation - Client applications

These are the instructions for the installation of the client applications.

First, the perl modules

perl -MCPAN -e shell
cpan> install Bundle::CPAN IO::Stringy Compress::Zlib
cpan> install Digest::Adler32 Digest::MD5
cpan> install File::Path Event::Lib Spread
cpan> install Algorithm::GDiffDelta Getopt::Long


Create some run scripts for daemontools

mkdir /etc/FREP_Client /etc/FREP_ResyncClient
cat <<EOF >/etc/FREP_Client/run
#!/bin/sh
exec > /dev/null
exec 2>&1
exec /usr/sbin/FREP_Client --pool=Static \
--userid=1500 --groupid=100 --spread-port=4803 \
--spread-address=192.168.1.100 --fs-port=9000 \
--fs-address=192.168.1.100
EOF
chmod +x /etc/FREP_Client/run
cat <<EOF >/etc/FREP_ResyncClient/run
#!/bin/sh
exec > /dev/null
exec 2>&1
exec /usr/sbin/FREP_ResyncClient --pool=Static \
--userid=1500 --groupid=100 --spread-port=4803 \
--spread-address=192.168.1.100 --fs-port=9000 \
--fs-address=192.168.1.100
EOF
chmod +x /etc/FREP_ResyncClient/run



Next, the file replication client. Edit /usr/sbin/FREP_Client:

 

Edit /usr/sbin/FREP_ResyncClient:

 

Edit /usr/sbin/FREP_Mon:

Make the files executable:

chmod +x /usr/sbin/FREP_Client
chmod +x /usr/sbin/FREP_ResyncClient
chmod +x /usr/sbin/FREP_Mon

 

Configuration options for FREP_Client:

Turn on debugging: --debug
Name of server pool: --pool=<Pool name>

UserID to run as: --userid=<uid>

GroupID run as: --groupid=<gid>
Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>
Port that the file server is running on: --fs-port=<port>
IP address of the file server: --fs-address=<ip>
Heartbeat timer (default 20): --heartbeat=<seconds>
Queue timer (default 5): --queue-timer=<seconds>

 

Configuration options for FREP_ResyncClient:

Turn on debugging: --debug
Name of server pool: --pool=<Pool name>

UserID to run as: --userid=<uid>

GroupID run as: --groupid=<gid>
Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>
Port that the file server is running on: --fs-port=<port>
IP address of the file server: --fs-address=<ip>

Configuration options for FREP_Mon:

Name of server pool: --pool=<Pool name>

Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>


Note: The userid and groupid's should match on the FREP_Client and FREP_ResyncClient.

 

 

Start the services

 

cd /service
ln -s /etc/FREP_Client
ln -s /etc/FREP_ResyncClient

 

Wait a few seconds, daemontools should fire them all up. Check to make sure they are running with:

 

svstat /service/FREP*

 

:. TODO

Some of the things I'll get around to eventually:

  • Add flag in client to specify temporary folder (rename()'s across devices will fail, see bugs below).
  • More testing maybe.
:. Spread configuration

The beauty of using Spread as a messaging toolkit is it's ability to use IP broadcast or mutlicast for sending out it's messages.

If you've got a modest number of nodes, I'd recommend making every client node part of the Spread network (and have the client connect to localhost).

For larger setups, you probably want to configure one Spread server every 20 clients or so.

In theory, this solution should scale to hundreds or thousands of clients since we can configure the Spread network to multicast on several VLANs. It may also even be possible to use this replication mechanism across WAN links by tunnelling multicast traffic.

:. References

The Spread Toolkit

Generic diff format specification

Algorithm::GdiffDelta perl module

Zlib compression library

Libevent

:. Bugs / Limitations

This system works well with relatively small file sizes, I haven't done very extensive tests on large files.

I am almost 100% sure that trying to keep large (>1 meg) files synched up won't be a problem UNLESS you are modifying them frequently.

There is a 2 gig limitation on the perl Algorithm::GdiffDelta module. Also the bigger the file, the longer it takes to calculate the differences when the files are modified.

The environment I've deployed FREP in involves many (300,000+ files/folders) that are changed at a rate of about 2 files per second. The average file size is probably around 15kb.

I haven't tested this out with Spread 4.X, your mileage may vary.

There's a bug in the client code, I don't check if the folder I'm synching to is on the same device as the system's default temporary folder path (/tmp on linux). This means that if your files are on a different device, the rename() call will fail, and nothing will work. I'll fix this eventually if anyone asks.

© copyright 2007 Mark Steele