[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] rdiff/librsync deltas (was: rsync link behavior)

kern and I have been discussing this in depth in the last 2 days.
The solution can be done from within bacula and will require no other
external tools.

The idea is to use librsync as the basis for the modification requirements
IN the FD and SD. 
It may also involve a new volume format and new columns in the catalog for
storing relevant new information.

The rsync support will be a first building block for a far more out reaching
deduplication support.

I don't want to go into the gore details myself as Kern is currently working
on getting the exact design on paper.


-----Original Message-----
From: Robin O'Leary [mailto:robin@xxxxxxxxxxxx] 
Sent: Tuesday, October 14, 2008 5:52 PM
To: bacula-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: [Bacula-devel] rdiff/librsync deltas (was: rsync link behavior)

On Wednesday, October 01, 2008 6:25 PM Kern Sibbald wrote:
> On Wednesday 01 October 2008 17:50:24 Eli Shemer wrote:
> > I wanted to know whether someone actually got to implement an 
> > incremental backup to act like rsync does upon file changes ?
> It is a project we are currently discussing as possible in a release 
> following 3.0.0 scheduled around the end of the year -- i.e. we hope 
> to begin the project sometime in 2009. If there is commercial funding 
> of this project (already one company interested) we could probably speed
it up.

We've had a preliminary look at doing this in an off-list discussion; I
don't know Kern's latest thoughts, but this prompts me to think it would be
good for us to have further discussion here.

> > I'm looking to write a patch to the save_file callback in the 
> > bacula-fd's backup procedure to utilize the gnu's diff and patches 
> > and to send those over the line to the bacula-sd. Of course a 
> > corresponding implementation in the storage device will also be
> ...
> For it to work, you would need both the original file and the new 
> modified file, and I am not sure how you will accomplish that.
This is the real problem, and pretty much excludes things like diff or
xdelta.  But rsync (and rdiff, as I'll mention later) gets around that need
in an ingenious way.

rsync assumes that you have a new version of a file locally and an old
version of the file at the remote end.  It then uses a clever rolling
checksum algorithm to efficiently find identical blocks at any byte offset
common to the new file and the old file, so only unique new blocks need be
sent across the wire.  The effect is to efficiently update the remote copy
to be identical with the local copy.

The least possible intrusion to bacula's way of working would be to have the
client use rsync as a pre-processing step to efficiently mirror the files of
interest on to the machine hosting the storage daemon, where they could be
backed-up as normal (presumably using a file daemon running on the storage
machine but pretending to be the client).
Slightly better integration would be to still run the bacula file daemon on
the client, but use rsync to shift those files across to the mirror area on
the storage device, which would then continue the normal file-daemon
processing from the copy.

A further improvement would be for the storage daemon to extract the last
version of each file from an old backup just as required; rsync would update
it, then the new version would be stored in the new backup.
This removes the need to have a complete mirror, but at the expense of
having the director work out which old file needs to be retrieved and the
cost of extracting it.

All these options seem quite easy to do, but they have various drawbacks:
the storage device has to have space for a mirror of the client's files;
those files are unencrypted; we might have to worry about preserving
metadata separately (rsync will do its best, but that may not be good enough
if the source and destination have different filesystem types); things like
VSS support might be tricky.

> If you do figure out some clever way to have both files, then it would 
> be
> *much* better to add some code that uses the librsync library calls to 
> effectively implement what rsync does.

Using rdiff (or librsync, on which it is built) gives us a couple more
options.  rdiff has the ability to extract the rolling checksum 'signature'
of the first version of a file and save it for later use; this is typically
about 10% of the file size, though you can tune it up or down in a trade-off
between signature file size and compression efficiency.
This signature can then be used with a later version of the file to work out
the 'delta'---the minimal set of blocks that need to be sent over to the
remote end.  If the client file daemon keeps a local copy of the signature
file, it can compute and send the delta over to the storage daemon without
any further input from the storage daemon or director.
But a better alternative would be to hold the signature file on another
storage daemon (ideally not the remote storage daemon, but there could also
be a local storage daemon available) or on the director.

When the time comes to backup the latest version of the file, the client
retrieves the appropriate signature file according to instructions from the
director which can determine which signature file should be the basis for
this new delta by knowing the level (full, differential or incremental) of
the new backup.  The client then uses librsync to compute the new delta from
the new file and the old signature, and sends it to the remote storage

Having sent the delta across to the remote storage daemon, we now have a
choice of what to do with it.  Again, the less intrusive method from a
bacula code point of view is for the storage daemon to apply the delta to a
previous version of the file it has mirrored or extracted from an old
backup; the reconstructed new version is then backed up as normal.

A much more appealing possibility is to store only the delta in the backup.
This also has the advantage that the delta can be encrypted by the client.
However, this changes a fundamental assumption in bacula and that is
probably going to have consequences I haven't foreseen.
Restoring a single file must now consist of retrieving the first version of
the file, plus all the deltas and applying them in the right order.
It's also fragile in the sense that if any of the intermediate backups is
unavailable, the chain of deltas is broken and the file cannot be

Still, I'm confident this is the best approach: to get the client to produce
new rdiff signature files (kept on the client if there's space, or on some
other storage local to the client, or on the director), to retrieve
appropriate old rdiff signature files, to compute and send deltas to the
remote storage daemon, to store deltas in the backup, to consolidate virtual
backups (or to restore) by grouping the basis file and all required patches,
to restore by sending the basis+deltas to the client file daemon, which has
to apply the deltas in sequence.

If everyone agrees this is the best way forward, I'd be happy to flesh out a
proper project plan and see how we and others can help contribute.

Robin O'Leary.
email: robin@xxxxxxxxxxxx    Equiinet Ltd., Edison Road, Dorcan,
Tel.:  +44 1793 603708       Swindon, SN3 5JX, U.K.  51.5558N,1.7286W

This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.