[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Patch: Migration jobmedia table insert incomplete


01.03.2008 05:42, Peter Much wrote:
> Hello Kern!

ok, I admit I'm not Kern, but I hope I can still contribute a bit here :-)

> <kern@xxxxxxxxxxx> aka Kern Sibbald  schrieb
> mit Datum Tue, 26 Feb 2008 21:57:10 +0100 in m2n.bacula.devel:
> |Yes, there are some problems with migration, but you don't explicitly mention 
> |what deficiencies you expect to run into.
> Ups... should I? 
> To be honest, I am a little bit worried, because I do not at all want
> to frustrate You.

Actually, I don't think Kern is that easily frustrated :-)

Of course, sometimes he makes it quite clear that he doesn't want to 
discuss certain thinks, or he is annoyed because thinking through all 
this takes time he prefers spending on actually *doing* things on 
Bacula, but in general, he like to know about even small issues you 
have... after all, improving Bacula is what wants to do :-)

> And I am in a dilemma: on one hand Bacula is something like a
> dream-come-true: I often thought of fetching a copy of IBM's TSM for
> my installation (I think I could get an employees evaluation copy),
> but 1.) it is just too bulky for a couple of home-computers, and 2.)
> it doesnt run on FreeBSD. So now I have what I wished to have...
> On the other hand, when trying to get the most out of Bacula, I find
> many of these small things that might still need a little fix-up. 
> Now, if I do report all of these, then I am the guy who is always
> criticising. :-/

As long as you can take the occasional rough reply, I don't see a 
problem. Of course you are aware that not every report will lead to an 
immediate fix :-)

> But ok, now just four things that come to my mind:
> 1. Everytime when a job is migrated, the "Run=" directives in
>    the job ressource are executed again. This is almost never what
>    one wants to happen, and in fact tends to disrupt backup cycles
>    severely.

I would consider that a bug, but one hat might need quite a bit of 
redesign to fix, and that affects only a limited number of users.

> 2. This is the thing that I have been worrying the most about. I
>    have been following various theories about what might happen
>    there, yet to no avail. The last of my theories was that it might
>    have to do with the migrations, but currently I tend to dismiss
>    this theory also. In fact, I am still clueless.
>    What happens is that the Director puts all jobs (and all newly
>    started jobs) into either "waiting on max Storage jobs" or
>    "waiting execution", while there is no job running on any client
>    and no job running on the SD. It just does nothing and has to
>    be restarted.

That definitely qualifies as a bug... have you tried looking at the 
debug output, once the DIR is in this state?

>    What I have learned from reading bacula-users, is that most 
>    people do not run such quantities of jobs as I do. So maybe this
>    is the reason.

Might be... how many jobs are you running in parallel?

> 3. When running a migration that will move multiple jobs, there is
>    a kind of "envelope" job: the "g" job that is started first will
>    start all the other "g" jobs that are needed. After this, this
>    "envelope" job itself will also do one of the migrations. But
>    occasionally this job just disappears silently and it's activity
>    is not to be found in the logfile.

Again something to investigate, and most probably a real bug.

>    On one occasion it gave me a sig-11, which might give some hint
>    at what is going on there. From the logfile:
> 25-Feb 08:56 BxDir JobId 9595: The following 163 JobIds were chosen to 
> 	be migrated: 7705,7714,7723,7732,7741,7750,7759,...
> 25-Feb 08:56 BxDir JobId 9595: Job queued. JobId=9596
> 25-Feb 08:56 BxDir JobId 9595: Migration JobId 9596 started.
> 25-Feb 08:56 BxDir JobId 9595: Job queued. JobId=9597
> 25-Feb 08:56 BxDir JobId 9595: Migration JobId 9597 started.
> ..
>    The interesting thing here is that this output is not retained
>    until job 9595 would finish, instead it is dropped to the logfile
>    immediately at start of the job. And it ends in the middle of a
>    line:

That's probably OS output buffering. I like to run the DIR with output 
to the console in such a case.

Or just issue some console commands that create debug output, so the 
buffers get flushed.

> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9742 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9743
> 25-Feb 08:57 BxDir Jo25-Feb 08:57 BxDir JobId 9773: The following 163
> 	JobIds were chosen to be migrated: 7706,7715,7724,7733,...
> 25-Feb 08:57 BxDir JobId 9773: Job queued. JobId=9774
> 25-Feb 08:57 BxDir JobId 9773: Migration JobId 9774 started.
> 25-Feb 08:57 BxDir JobId 9773: Job queued. JobId=9775
> 25-Feb 08:57 BxDir JobId 9773: Migration JobId 9775 started.
> ..
>    The remaining part of the log of job 9595 follows a couple
>    of hours later:
> 25-Feb 10:52 BxDir: Fatal Error because: Bacula interrupted by signal
> 	11: Segmentation violation
> bId 9595: Migration JobId 9743 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9744
> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9744 started.
> 25-Feb 08:57 BxDir JobId 9595: Job queued. JobId=9745
> 25-Feb 08:57 BxDir JobId 9595: Migration JobId 9745 started.
> ..
>    At that point I decided that there is some problem, but that it is
>    not all too easy to find and fix. So I decided that for now to 
>    postpone the issue (indefinitely), and instead redesign my schedules
>    so that they would create a lesser amount of jobs. (I was saving
>    database redo-logs via a Bacula schedule, which means to check
>    every quarter of an hour if there are any to save - which every time 
>    does create an empty job that will qualify for later migration - and
>    that will nicely disappear during that migration. Now I have allowed
>    the database to call bconsole on demand only after it has batched 
>    up a couple of logs.)
> 4. When migrating from disk to tape, there should be no need to do
>    SD data spooling - as the data is already packed up, it will flow
>    quickly to the tape, and data spooling would only slow down the
>    process.
>    But in that case it is likely possible that multiple jobs write
>    simultanously to the tape. When later restoring such jobs, each 
>    job must be restored by a separate restore command, which can
>    make the process very slow.

Good point... we might need a way to to disable job concurrency in 
that case, making sure migration jobs going to the same storage device 
are not multiplexed, even if that storage device allows multiple 
cuncurrent jobs...

> If not, that is, if multiple jobs that
>    have intermingled on tape are restored by one and the same restore
>    command, then the names of the restored files will all be correct,
>    but the sizes may be wrong and the contents may be garbage.
> So, this is more or less the background which led me to my statement
> that pervasive use of migration would currently show some
> deficiencies... I hope You understand...

I at least do. And I thank you for your insights - I'm quite sure 
there are some hours of research behind it.


> best regards,
> PMc

Arno Lehmann
IT-Service Lehmann

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.