[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Director bug when using two storage daemons?

On Mon, Nov 17, 2008 at 02:05:53PM +0100, Kern Sibbald wrote:
> Hello,
> I've analyzed your problem, and I know what is happening, but don't yet know 
> how to resolve the problem.
> The problem:
> - You begin writing a giant job onto a Volume
> - There are no other volumes available for writing (big mistake).
> - You start a second job that needs a Volume to write on.
> - The second job see that there are no JobMedia records associated with the 
> Volume (not yet written), so it purges the Volume.
> - A sort of chaos then follows.

The above is an accurate description of the problem, except for the second
point. I just re-ran broken-media-bug-test, but added two lines to it
(the full test and output is attached to this email):

label storage=File volume=TestVolume0001
label storage=File volume=TestVolume0002

This creates two volumes.
The first job starts writing to TestVolume0001, and the second job comes along,
purges it and starts using it too.

> Workaround:
> - Stop running Bacula on the bleeding edge of Volume availablity. This is
> always a bad idea and frequently leads to unexpected situations.

I am running it with two jobs in close proximity like this in order to quickly
demonstrate the bug. I initially saw the bug happening when it was far less
on a razor edge, with more than 45 minutes between the two jobs starting.
It would have been a waste of time to produce a test that took an hour to
complete, when the bug can be shown with a test that takes a fraction of the

> - Don't only have a single volume available if that Volume can only store one
> Job (according to your settings), and you start a second job while the first
> is running but before it has written at least 1GB of data.

I have told bacula that it can have an unlimited number of volumes on the disk,
and that it should automatically label them.
I have also shown that having spare volumes available doesn't help.
To do what you suggest would mean that I cannot schedule any jobs, as I have to
watch very precisely what is going on to be sure that no two jobs overlap.
If you were thinking that I was testing some unlikely scenario where I
want to backup the same machine more than once with overlapping jobs,
the problem happens if I have two separate clients running the overlapping

> Solution:
> - I don't have one, because we have no way to "lock" a volume from being
> purged.  Any thing we might do would be prone to errors if the SD should fail
> while the volume was "locked".
> Bottom line: it is easy to work around this problem, and unless we are lucky
> and come up with a good idea, I don't see that there is any easy way to
> resolve the problem.

I do not think that it is easy to work around this problem. I also think that
the problem is very serious and that it is quite likely that other people
have triggered it without noticing - it is hard to realise that it has
happened if you are not watching very closely indeed.

I have a suggestion for fixing it:
The JobMedia record appears to be created when 1GB of data is written, even
if the job hasn't finished.
How about creating the JobMedia record just before (or at the same time) the
Media record is created?

This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.