[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bacula-devel] Patch: Migration jobmedia table insert incomplete
01.03.2008 05:42, Peter Much wrote:
> Hello Kern!
ok, I admit I'm not Kern, but I hope I can still contribute a bit here :-)
> <kern@xxxxxxxxxxx> aka Kern Sibbald schrieb
> mit Datum Tue, 26 Feb 2008 21:57:10 +0100 in m2n.bacula.devel:
> |Yes, there are some problems with migration, but you don't explicitly mention
> |what deficiencies you expect to run into.
> Ups... should I?
> To be honest, I am a little bit worried, because I do not at all want
> to frustrate You.
Actually, I don't think Kern is that easily frustrated :-)
Of course, sometimes he makes it quite clear that he doesn't want to
discuss certain thinks, or he is annoyed because thinking through all
this takes time he prefers spending on actually *doing* things on
Bacula, but in general, he like to know about even small issues you
have... after all, improving Bacula is what wants to do :-)
> And I am in a dilemma: on one hand Bacula is something like a
> dream-come-true: I often thought of fetching a copy of IBM's TSM for
> my installation (I think I could get an employees evaluation copy),
> but 1.) it is just too bulky for a couple of home-computers, and 2.)
> it doesnt run on FreeBSD. So now I have what I wished to have...
> On the other hand, when trying to get the most out of Bacula, I find
> many of these small things that might still need a little fix-up.
> Now, if I do report all of these, then I am the guy who is always
> criticising. :-/
As long as you can take the occasional rough reply, I don't see a
problem. Of course you are aware that not every report will lead to an
immediate fix :-)
> But ok, now just four things that come to my mind:
> 1. Everytime when a job is migrated, the "Run=" directives in
> the job ressource are executed again. This is almost never what
> one wants to happen, and in fact tends to disrupt backup cycles
I would consider that a bug, but one hat might need quite a bit of
redesign to fix, and that affects only a limited number of users.
> 2. This is the thing that I have been worrying the most about. I
> have been following various theories about what might happen
> there, yet to no avail. The last of my theories was that it might
> have to do with the migrations, but currently I tend to dismiss
> this theory also. In fact, I am still clueless.
> What happens is that the Director puts all jobs (and all newly
> started jobs) into either "waiting on max Storage jobs" or
> "waiting execution", while there is no job running on any client
> and no job running on the SD. It just does nothing and has to
> be restarted.
That definitely qualifies as a bug... have you tried looking at the
debug output, once the DIR is in this state?
> What I have learned from reading bacula-users, is that most
> people do not run such quantities of jobs as I do. So maybe this
> is the reason.
Might be... how many jobs are you running in parallel?
> 3. When running a migration that will move multiple jobs, there is
> a kind of "envelope" job: the "g" job that is started first will
> start all the other "g" jobs that are needed. After this, this
> "envelope" job itself will also do one of the migrations. But
> occasionally this job just disappears silently and it's activity
> is not to be found in the logfile.
Again something to investigate, and most probably a real bug.
> On one occasion it gave me a sig-11, which might give some hint
> at what is going on there. From the logfile:
> 25-Feb 08:56 BxDir JobId 9595: The following 163 JobIds were chosen to
> be migrated: 7705,7714,7723,7732,7741,7750,7759,...
> 25-Feb 08:56 BxDir JobId 9595: Job queued. JobId=9596
> 25-Feb 08:56 BxDir JobId 9595: Migration JobId 9596 started.
> 25-Feb 08:56 BxDir JobId 9595: Job queued. JobId=9597
> 25-Feb 08:56 BxDir JobId 9595: Migration JobId 9597 started.
> The interesting thing here is that this output is not retained
> until job 9595 would finish, instead it is dropped to the logfile
> immediately at start of the job. And it ends in the middle of a
That's probably OS output buffering. I like to run the DIR with output
to the console in such a case.
Or just issue some console commands that create debug output, so the
buffers get flushed.
> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9742 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9743
> 25-Feb 08:57 BxDir Jo25-Feb 08:57 BxDir JobId 9773: The following 163
> JobIds were chosen to be migrated: 7706,7715,7724,7733,...
> 25-Feb 08:57 BxDir JobId 9773: Job queued. JobId=9774
> 25-Feb 08:57 BxDir JobId 9773: Migration JobId 9774 started.
> 25-Feb 08:57 BxDir JobId 9773: Job queued. JobId=9775
> 25-Feb 08:57 BxDir JobId 9773: Migration JobId 9775 started.
> The remaining part of the log of job 9595 follows a couple
> of hours later:
> 25-Feb 10:52 BxDir: Fatal Error because: Bacula interrupted by signal
> 11: Segmentation violation
> bId 9595: Migration JobId 9743 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9744
> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9744 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9745
> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9745 started.
> At that point I decided that there is some problem, but that it is
> not all too easy to find and fix. So I decided that for now to
> postpone the issue (indefinitely), and instead redesign my schedules
> so that they would create a lesser amount of jobs. (I was saving
> database redo-logs via a Bacula schedule, which means to check
> every quarter of an hour if there are any to save - which every time
> does create an empty job that will qualify for later migration - and
> that will nicely disappear during that migration. Now I have allowed
> the database to call bconsole on demand only after it has batched
> up a couple of logs.)
> 4. When migrating from disk to tape, there should be no need to do
> SD data spooling - as the data is already packed up, it will flow
> quickly to the tape, and data spooling would only slow down the
> But in that case it is likely possible that multiple jobs write
> simultanously to the tape. When later restoring such jobs, each
> job must be restored by a separate restore command, which can
> make the process very slow.
Good point... we might need a way to to disable job concurrency in
that case, making sure migration jobs going to the same storage device
are not multiplexed, even if that storage device allows multiple
> If not, that is, if multiple jobs that
> have intermingled on tape are restored by one and the same restore
> command, then the names of the restored files will all be correct,
> but the sizes may be wrong and the contents may be garbage.
> So, this is more or less the background which led me to my statement
> that pervasive use of migration would currently show some
> deficiencies... I hope You understand...
I at least do. And I thank you for your insights - I'm quite sure
there are some hours of research behind it.
> best regards,
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list
This mailing list archive is a service of Copilotco.