[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] rdiff/librsync deltas (was: rsync link behavior)

On Wednesday, October 01, 2008 6:25 PM Kern Sibbald wrote:
> On Wednesday 01 October 2008 17:50:24 Eli Shemer wrote:
> > I wanted to know whether someone actually got to implement an incremental
> > backup to act like rsync does upon file changes ?
> It is a project we are currently discussing as possible in a release following
> 3.0.0 scheduled around the end of the year -- i.e. we hope to begin the
> project sometime in 2009. If there is commercial funding of this project
> (already one company interested) we could probably speed it up.

We've had a preliminary look at doing this in an off-list discussion;
I don't know Kern's latest thoughts, but this prompts me to think it
would be good for us to have further discussion here.

> > I'm looking to write a patch to the save_file callback in the bacula-fd's
> > backup procedure to utilize the gnu's diff and patches and to send those
> > over the line to the bacula-sd. Of course a corresponding implementation in
> > the storage device will also be required.
> ...
> For it to work, you would need both the original file and the new
> modified file, and I am not sure how you will accomplish that.
This is the real problem, and pretty much excludes things like diff
or xdelta.  But rsync (and rdiff, as I'll mention later) gets around that
need in an ingenious way.

rsync assumes that you have a new version of a file locally and an old
version of the file at the remote end.  It then uses a clever rolling
checksum algorithm to efficiently find identical blocks at any byte
offset common to the new file and the old file, so only unique new blocks
need be sent across the wire.  The effect is to efficiently update the
remote copy to be identical with the local copy.

The least possible intrusion to bacula's way of working would be to have
the client use rsync as a pre-processing step to efficiently mirror
the files of interest on to the machine hosting the storage daemon,
where they could be backed-up as normal (presumably using a file daemon
running on the storage machine but pretending to be the client).
Slightly better integration would be to still run the bacula file daemon
on the client, but use rsync to shift those files across to the mirror area
on the storage device, which would then continue the normal file-daemon
processing from the copy.

A further improvement would be for the storage daemon to extract the
last version of each file from an old backup just as required; rsync
would update it, then the new version would be stored in the new backup.
This removes the need to have a complete mirror, but at the expense of
having the director work out which old file needs to be retrieved and
the cost of extracting it.

All these options seem quite easy to do, but they have various drawbacks:
the storage device has to have space for a mirror of the client's files;
those files are unencrypted; we might have to worry about preserving
metadata separately (rsync will do its best, but that may not be good
enough if the source and destination have different filesystem types);
things like VSS support might be tricky.

> If you do figure out some clever way to have both files, then it would be
> *much* better to add some code that uses the librsync library calls to
> effectively implement what rsync does.

Using rdiff (or librsync, on which it is built) gives us a couple
more options.  rdiff has the ability to extract the rolling checksum
'signature' of the first version of a file and save it for later use; this
is typically about 10% of the file size, though you can tune it up or down
in a trade-off between signature file size and compression efficiency.
This signature can then be used with a later version of the file to work
out the 'delta'---the minimal set of blocks that need to be sent over
to the remote end.  If the client file daemon keeps a local copy of the
signature file, it can compute and send the delta over to the storage
daemon without any further input from the storage daemon or director.
But a better alternative would be to hold the signature file on another
storage daemon (ideally not the remote storage daemon, but there could
also be a local storage daemon available) or on the director.

When the time comes to backup the latest version of the file, the client
retrieves the appropriate signature file according to instructions
from the director which can determine which signature file should be
the basis for this new delta by knowing the level (full, differential
or incremental) of the new backup.  The client then uses librsync to
compute the new delta from the new file and the old signature,
and sends it to the remote storage daemon.

Having sent the delta across to the remote storage daemon, we now have a
choice of what to do with it.  Again, the less intrusive method from a
bacula code point of view is for the storage daemon to apply the delta
to a previous version of the file it has mirrored or extracted from an
old backup; the reconstructed new version is then backed up as normal.

A much more appealing possibility is to store only the delta in the
backup.  This also has the advantage that the delta can be encrypted by
the client.  However, this changes a fundamental assumption in bacula
and that is probably going to have consequences I haven't foreseen.
Restoring a single file must now consist of retrieving the first version
of the file, plus all the deltas and applying them in the right order.
It's also fragile in the sense that if any of the intermediate backups
is unavailable, the chain of deltas is broken and the file cannot be

Still, I'm confident this is the best approach: to get the client to
produce new rdiff signature files (kept on the client if there's space,
or on some other storage local to the client, or on the director), to
retrieve appropriate old rdiff signature files, to compute and send
deltas to the remote storage daemon, to store deltas in the backup,
to consolidate virtual backups (or to restore) by grouping the basis
file and all required patches, to restore by sending the basis+deltas
to the client file daemon, which has to apply the deltas in sequence.

If everyone agrees this is the best way forward, I'd be happy to flesh
out a proper project plan and see how we and others can help contribute.

Robin O'Leary.
email: robin@xxxxxxxxxxxx    Equiinet Ltd., Edison Road, Dorcan,
Tel.:  +44 1793 603708       Swindon, SN3 5JX, U.K.  51.5558N,1.7286W

Attachment: signature.asc
Description: Digital signature

This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.