[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Director bug when using two storage daemons?

On Monday 17 November 2008 16:51:31 Graham Keeling wrote:
> On Mon, Nov 17, 2008 at 02:05:53PM +0100, Kern Sibbald wrote:
> > Hello,
> >
> > I've analyzed your problem, and I know what is happening, but don't yet
> > know how to resolve the problem.
> >
> > The problem:
> > - You begin writing a giant job onto a Volume
> > - There are no other volumes available for writing (big mistake).
> > - You start a second job that needs a Volume to write on.
> > - The second job see that there are no JobMedia records associated with
> > the Volume (not yet written), so it purges the Volume.
> > - A sort of chaos then follows.
> Hello,
> The above is an accurate description of the problem, except for the second
> point. I just re-ran broken-media-bug-test, but added two lines to it
> (the full test and output is attached to this email):
> label storage=File volume=TestVolume0001
> label storage=File volume=TestVolume0002
> This creates two volumes.
> The first job starts writing to TestVolume0001, and the second job comes
> along, purges it and starts using it too.

As long as there is a Volume available, Bacula will never purge a volume, and 
in general, it won't even use a Purged volume until all Appendable volumes 
are used.  In this case, it still tries to go ahead and purge TestVolume001 
because it cannot use TestVolume0002 because the only drive is being used.

> > Workaround:
> > - Stop running Bacula on the bleeding edge of Volume availablity. This is
> > always a bad idea and frequently leads to unexpected situations.
> I am running it with two jobs in close proximity like this in order to
> quickly demonstrate the bug. I initially saw the bug happening when it was
> far less on a razor edge, with more than 45 minutes between the two jobs
> starting. It would have been a waste of time to produce a test that took an
> hour to complete, when the bug can be shown with a test that takes a
> fraction of the time.

> > - Don't only have a single volume available if that Volume can only store
> > one Job (according to your settings), and you start a second job while
> > the first is running but before it has written at least 1GB of data.
> I have told bacula that it can have an unlimited number of volumes on the
> disk, and that it should automatically label them.

> I have also shown that having spare volumes available doesn't help.
> To do what you suggest would mean that I cannot schedule any jobs, as I
> have to watch very precisely what is going on to be sure that no two jobs
> overlap. If you were thinking that I was testing some unlikely scenario
> where I want to backup the same machine more than once with overlapping
> jobs, the problem happens if I have two separate clients running the
> overlapping jobs.
> > Solution:
> > - I don't have one, because we have no way to "lock" a volume from being
> > purged.  Any thing we might do would be prone to errors if the SD should
> > fail while the volume was "locked".
> >
> > Bottom line: it is easy to work around this problem, and unless we are
> > lucky and come up with a good idea, I don't see that there is any easy
> > way to resolve the problem.
> I do not think that it is easy to work around this problem. I also think
> that the problem is very serious and that it is quite likely that other
> people have triggered it without noticing - it is hard to realise that it
> has happened if you are not watching very closely indeed.

You have certainly found a bug, but it is a rather artificial problem that 
virtually no one is likely to have, so I do not consider it at this point to 
be too serious.

> I have a suggestion for fixing it:
> The JobMedia record appears to be created when 1GB of data is written, even
> if the job hasn't finished.
> How about creating the JobMedia record just before (or at the same time)
> the Media record is created?

Were it that easy, I would have proposed that as a solution.  The problem is 
the JobMedia record cannot be written until 1GB has been written or the job 

If you want this problem resolved, please open a bug report on it, and just 
make reference to your test program, which I have now put into the SVN (not 
integrated into the normal regression tests because it always fails).  
However, this problem is not very high on my current list of priorities until 
someone figures out a good way to fix it.



This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.