[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] Autoloader issue (was: Patch: Migration jobmedia table insert incomplete)


<al@xxxxxxxxxxxxxx> aka Arno Lehmann  schrieb
mit Datum Sun, 02 Mar 2008 12:50:17 +0100 in m2n.bacula.devel:

|> 2. This is the thing that I have been worrying the most about. I
|>    have been following various theories about what might happen
|>    there, yet to no avail. The last of my theories was that it might
|>    have to do with the migrations, but currently I tend to dismiss
|>    this theory also. In fact, I am still clueless.
|>    What happens is that the Director puts all jobs (and all newly
|>    started jobs) into either "waiting on max Storage jobs" or
|>    "waiting execution", while there is no job running on any client
|>    and no job running on the SD. It just does nothing and has to
|>    be restarted.
|
|That definitely qualifies as a bug... have you tried looking at the 
|debug output, once the DIR is in this state?

This was a good hint. The debug shows this:

>BxDir: jcr.c:603-0 OnEntry JobStatus=s set=s
>BxDir: jcr.c:623-0 OnExit JobStatus=s set=s
>BxDir: jobq.c:701-0 Wstore=Files
>BxDir: jobq.c:723-0 Fail wncj=-2

And what I also have seen is rncj=-2, and rncj=3.

Looking into jobq.c, I find that rncj is never supposed to take any
value except 0 and 1 (maximum one read job per device).
OTOH, I find that rncj is not a unique entity - it is just the 
NumConcurrentJobs of any Storage device.

So, this seems not to be a migration issue, it seems to be a problem
with multidrive autoloaders. 
According to the manual, since Bacula version 1.whatever an
autoloader has to be defined as a single device in the DIR. 
So, if this autoloader has multiple drives, it is well possible
that these drives are used for reading AND writing at the same time.

And this seems to break the rncj/wncj logic. My current most likely
interpretation runs that way: Suppose we have one restore running: 
rncj=1. Then we get two backups running: wncj=rncj=3. Then the 
restore terminates and sets rncj=0. So, when the two backup
jobs terminate, it goes to -2  - and this is where the show ends.

I am now trying the following as a fix, and see if it helps.

rgds, 
PMc

--- src/dird/jobq.c.orig        Mon Dec 10 18:54:41 2007
+++ src/dird/jobq.c     Sun Mar  9 00:27:02 2008
@@ -478,7 +478,8 @@
           */
          if (jcr->acquired_resource_locks) {
             if (jcr->rstore) {
-               jcr->rstore->NumConcurrentJobs = 0;
+               if (jcr->rstore->NumConcurrentJobs > 0)
+                  jcr->rstore->NumConcurrentJobs--;
                Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs);
             }
             if (jcr->wstore) {
@@ -738,7 +739,8 @@
          Dmsg1(200, "Dec wncj=%d\n", jcr->wstore->NumConcurrentJobs);
       }
       if (jcr->rstore) {
-         jcr->rstore->NumConcurrentJobs = 0;
+         if(jcr->rstore->NumConcurrentJobs > 0);
+            jcr->rstore->NumConcurrentJobs--;
          Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs);
       }
       set_jcr_job_status(jcr, JS_WaitClientRes);
@@ -753,7 +755,8 @@
          Dmsg1(200, "Dec wncj=%d\n", jcr->wstore->NumConcurrentJobs);
       }
       if (jcr->rstore) {
-         jcr->rstore->NumConcurrentJobs = 0;
+         if(jcr->rstore->NumConcurrentJobs > 0);
+            jcr->rstore->NumConcurrentJobs--;
          Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs);
       }
       jcr->client->NumConcurrentJobs--;


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.