:. FREP - A cluster file replication / synchronization application
The problem:
You've got 100+ servers in your web server cluster, and would like to keep them all synchronized from a centralized location. Furthermore, you'd like that replication to be instantaneous.
The solution: FREP
:. Introduction
FREP was born out of a need I had to keep a cluster of server synchronized to a central location with instant file replication (one-way replication).
I realized that polling for file changes was going to be out of the question since the application would get progressively slower the more files were added.
I wanted something that wasn't going to be a ressource hog, and since I was looking at replicating over 200,000 files I wanted something that would scale too. When inotify was released in the vanilla linux kernel, I knew I had found the tool I needed to accomplish this job.
The next part of the puzzle that I needed to figure out what the communication mechanism that I was going to need to have in place to communicate changes to a large number of clients. With IP broadcast and multicast capabilities, the Spread cluster message toolkit was a natrual fit.
Using Spread's multicasting capabilities, FREP should be able to scale to hundreds (if not thousands) of nodes.
The final piece of the puzzle was figuring out a bandwidth efficient mechanism for file transfers. I initially wanted to implement bits of the rsync protocol, but instead settled on a combination of Zlib compression on file chunks combined with using the generic diff format specification for partial data transfers. This combination should minimize the amount of bandwidth necessary to have file changes replicated to the cluster.
The system consists of several components:
FREP_Server: The file monitor daemon
FREP_Resync: The resynchronization daemon
FREP_FileServer: A generic file server
FREP_Client: The client installed on the nodes
FREP_ResyncClient: A resynchronization client installed on the nodes as a failsafe mechanism
FREP_Mon: Testing application
:. Document conventions
Commands to run use the Courier New font and are highlighted in orange:
/path/to/a/command/to/run --options
File contents are always displayed within a text area field:
(sorry firefox folks, copy-paste button only works with IE)
Explanations use Arial – This text.
:. Assumptions
Working Linux server
You need to be root to install these packages
Linux kernel with inotify compiled into the kernel (2.6.13+)
Working Spread network. The Spread configuration depends on the number of nodes you'll have in the cluster and the data throughput (filechange bandwidth) you are expecting. More on this later.
You'll need to have an idea of how many files you are gong to want to watch. More on this later.
:. Version history
13/02/2007 - Version 1.0 Initial public release
15/02/2007 - Version 1.1 Bug fixes, configuration management
:. Architectural considerations
This application was designed with the following goals in mind:
Speed
Reliability/fault tolerance
Efficiency
These goals have been achieved by implementing the following features:
Intelligent file transfers using a generic binary difference algorithm
Compressed data transfers.
Reliable message transport mechanism which can leverage broadcast and multicast IP using the Spread cluster messaging toolkit.
Automatic resynchronization with a stream-based approach. File lists are sent to clients by traversing the filesystem hierarchy incrementally.
Kernel based file change notification using Inotify.
Edge-level event notification mechanisms using libevent (which in turn uses the epoll() kernel call).
Client automatically drops root privileges on start up, all files created by the client owned by an unprivileged user.
Several levels of fault-tolerance using application retries and automatic resynchronization.
When unicast traffic is necessary, the system uses the kernel sendfile() functionality for zero-copy file transfer operations.
Messages are split between several synchronization groups. (server roles dictate which files will be synchronized).
When possible, expensive operations are cached.
Here's a visio diagram of the client and server operation
:. Protocol specifications
Synchronization application: FREP_Server
The following messages are sent by the synchronization application using the spread toolkit.
Each message is sent to a specific synchronization group as defined by the configuration hash and the file path.
File creation
Command:CREATE\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
FilePath:<filepath>\n
FileSig:<file signature>\n
__DATA__\n
<Compressed data>__DATAEND__
File modification
Command:MODIFY\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
FilePath:<filepath>\n
FileSig:<file signature>\n
OriginFileSig:<shadow copy file signature>\n
__DATA__\n
<Compressed delta data>__DATAEND__
The resynchronization application sends out the following messages using the Spread toolkit to specific synchronization groups.
Re-synchronization
Command:Sync\n
Chunk:<data chunk count>\n
LastChucnk:<set to 1 when current chunk is the last chunk>\n
Cookie:<cookie value>\n
__DATA__\n
<Compressed synchronization data>__DATAEND__
This defines the address and port of the Spread daemon. This should be localhost, but you could run the spread daemon on another box if you wanted to.
Pools:
<!-- Pool configuration -->
<!-- Note: all watched dirs must be located on the same device -->
<Pool name="Dynamic">
<Dir>/www/mysite.com/ad/</Dir>
<Dir>/www/mysite.com/images/</Dir>
<Dir>/www/mysite.com/php/</Dir>
<Dir>/www/Smarty/</Dir>
</Pool>
<Pool name="Static">
<Dir>/www/mysite.com/images/</Dir>
<Dir>/www/mysite.com/html/</Dir>
</Pool>
Each pool represents a group of severs that serve a specific function. A pool has to have a name, and a list of folders that we'll want to watch.
Ignore list:
<!-- Ignore list -->
<IgnoreList>
<Dir>/www/mysite.com/ad/ignoreme</Dir>
The ignore list is a set of paths that we _DON'T_ want to synchronize or replicate.
Synchronization configuration:
<!-- synchronization configuration -->
<Sync>
<MaxWatches>524288</MaxWatches>
<ShadowRoot>/www/ShadowRoot</ShadowRoot> <!-- MUST BE ON SAME DEVICE AS WATCHED FILES -->
<HeartbeatTimer>3</HeartbeatTimer>
<QueueTimer>5</QueueTimer>
<Debug>0</Debug>
</Sync>
This section defines the maximun number of of files+folders (MaxWatches), if you try to watch more than this number of files+folders, it'll barf.
We also have the ShadowRoot configuration directive, which defines where FREP will create a copy of all the watched files/folders. This is used to be able to do partial file transfers (send only the changed parts of a file when it's modified). The ShadowRoot file path MUST be on the same device as the watched files.
The HeartBeatTimer directive tells FREP how often it should be sending out heartbeats to the client nodes. The clients will use this variable to determine if they need to reconnect to the spread network.
The QueueTimer defines how long FREP_Server will keep the received events in the queue before starting to process them. This is to avoid duplicate events triggering needless IO. For example, if you are writing a file in 5 chunks, inotify will be sending multiple file modification events. The queue is designed to schlep these all together. It also handles sequential operations. For example, suppose you open a file, write to it, close it, then delete it FREP_Server will only send 1 event (delete).
The fileserver's address should be the IP address of the server that the FREP_Server sync daemon is running on. The port should be any unused port.
The ValidPathRegex is a safety measure that defines a perl-compatible regular expression that will be matched to see if the file signature or transfer will be allowed. As an example, if you're synchronization folders are /www/mysite.com and /files/syncme, you'd set the ValidPathRegex to: (?:/www/mysite.com|/files/syncme).
The FREP_FileServer keeps a BerkeleyDB cache of file signatures to speed things up. the SignatureCache directive tells the file server where to store it.
The ConnectionTimeout is the amount of time the fileserver will wait for data on a socket (for reading or writing). 5 seconds should be fine.
The FREP_Resync daemon only has one tweakable setting, the ResyncTimer, which is the frequency at which it'll broadcast the full file list to all the clients.
Note:
The debug flag controls if debugging to syslog is enabled. These daemons are pretty verbose when the debug flag is on, so leave it off unless you're debugging :)
Turn on debugging: --debug
Name of server pool: --pool=<Pool name>
UserID to run as:
--userid=<uid>
GroupID run as:
--groupid=<gid>
Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>
Port that the file server is running on: --fs-port=<port>
IP address of the file server: --fs-address=<ip>
Heartbeat timer (default 20): --heartbeat=<seconds>
Queue timer (default 5): --queue-timer=<seconds>
Configuration options for FREP_ResyncClient:
Turn on debugging: --debug
Name of server pool: --pool=<Pool name>
UserID to run as:
--userid=<uid>
GroupID run as:
--groupid=<gid>
Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>
Port that the file server is running on: --fs-port=<port>
IP address of the file server: --fs-address=<ip>
Configuration options for FREP_Mon:
Name of server pool: --pool=<Pool name>
Spread port (usually 4803): --spread-port=<port>
Spread server IP address: --spread-address=<ip>
Note: The userid and groupid's should match on the FREP_Client and FREP_ResyncClient.
Start the services
cd /service
ln -s /etc/FREP_Client
ln -s /etc/FREP_ResyncClient
Wait a few seconds, daemontools should fire them all up. Check to make sure they are running with:
svstat /service/FREP*
:. TODO
Some of the things I'll get around to eventually:
Add flag in client to specify temporary folder (rename()'s across devices will fail, see bugs below).
More testing maybe.
:. Spread configuration
The beauty of using Spread as a messaging toolkit is it's ability to use IP broadcast or mutlicast for sending out it's messages.
If you've got a modest number of nodes, I'd recommend making every client node part of the Spread network (and have the client connect to localhost).
For larger setups, you probably want to configure one Spread server every 20 clients or so.
In theory, this solution should scale to hundreds or thousands of clients since we can configure the Spread network to multicast on several VLANs. It may also even be possible to use this replication mechanism across WAN links by tunnelling multicast traffic.
This system works well with relatively small file sizes, I haven't done very extensive tests on large files.
I am almost 100% sure that trying to keep large (>1 meg) files synched up won't be a problem UNLESS you are modifying them frequently.
There is a 2 gig limitation on the perl Algorithm::GdiffDelta module. Also the bigger the file, the longer it takes to calculate the differences when the files are modified.
The environment I've deployed FREP in involves many (300,000+ files/folders) that are changed at a rate of about 2 files per second. The average file size is probably around 15kb.
I haven't tested this out with Spread 4.X, your mileage may vary.
There's a bug in the client code, I don't check if the folder I'm synching to is on the same device as the system's default temporary folder path (/tmp on linux). This means that if your files are on a different device, the rename() call will fail, and nothing will work. I'll fix this eventually if anyone asks.