[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] A bug in 2.2.8-jobmedia.patch?


I wrote:

> That should fix the most common case of the multiple small jobs starting
> spooling at the same time.  I've applied it, and will check that this
> does, in fact, happen, and report back.

It worked as expected.  Multiple jobs spooling at the same time, and
despooling sequentially, now have their tape usage properly reflected in
the JOBMEDIA table.  Here's an example:

 jobmediaid | jobid | mediaid | firstindex | lastindex | startfile | endfile | startblock | endblock 
------------+-------+---------+------------+-----------+-----------+---------+------------+----------
        184 |    87 |       3 |          1 |       743 |        58 |      58 |          0 |      192
        185 |    88 |       3 |          1 |       103 |        58 |      58 |        193 |      843
        186 |    89 |       3 |          1 |       282 |        58 |      58 |        844 |     6341
        187 |    90 |       3 |          1 |        76 |        58 |      58 |       6342 |     6389

Kern Sibbald <kern@xxxxxxxxxxx> writes:

> Still I have some doubts about what gets put in the jobmedia for the start of 
> a second despooling call.

That's the other failure scenario, and no, your fix doesn't address it.

> The session labels don't contain any position dependent data, so that is not 
> important.

Ah, great!  I didn't read the code carefully; I assumed that since the
function that creates the label copies the tape position into the dcr,
it must also need it.

> If you can convince/show me that something is going wrong with multiple 
> spoolings, then your suggestion becomes much more important.

The sequence of events that causes that error is quite simple.  It will
"bite" when two jobs are spooling at the same time, and the one that
despools first is big enough to need two rounds of spooling.

I'll give a run-down of what happens (the example is from a run before
your latest fix, but that fix cannot possibly have any effect, since the
fix involves StartFile/StartBlock, not EndFile/EndBlock).

Two jobs are shown.  Job 1 is a large job from a fast host, job 2 is a
somewhat smaller one from a slower host.  Both start spooling at the
same time, then job 1 hits its limit, and despools.  By the time it is
done despooling the first batch, job 2 is ready to despool.  While job 2
despools, job 1 completes its second spooling run, and waits for access
to the tape before despooling.

The problem is in the call write_session_label() that job 1 performs at
the end of its second spooling run, to generate the end label.  This
causes EndFile and EndBlock in job 1's dcr to be set to the current tape
position (while job 2 is despooling).  The reason this is a problem is
that this overwrites the information about where job 1's first
despooling actually finished, and this data is still needed: the last
JOBMEDIA record for the first despooling run has not been written yet,
but will be written when the second despooling run starts.

First, the relevant rows from the FILEMEDIA table (the bogus row is the
one that has jobmediaid 28):

 jobmediaid | jobid | mediaid | firstindex | lastindex | startfile | endfile | startblock | endblock
------------+-------+---------+------------+-----------+-----------+---------+------------+---------
          1 |     1 |       1 |          1 |      5959 |         0 |       0 |          1 |    15499
          2 |     1 |       1 |       5959 |      9885 |         1 |       1 |          0 |    15499
          3 |     1 |       1 |       9885 |     15463 |         2 |       2 |          0 |    15499
          4 |     1 |       1 |      15463 |     15673 |         3 |       3 |          0 |    15499
          5 |     1 |       1 |      15673 |     15903 |         4 |       4 |          0 |    15499
          6 |     1 |       1 |      15903 |     16161 |         5 |       5 |          0 |    15499
          7 |     1 |       1 |      16161 |     16386 |         6 |       6 |          0 |    15499
          8 |     1 |       1 |      16386 |     16642 |         7 |       7 |          0 |    15499
          9 |     1 |       1 |      16642 |     16863 |         8 |       8 |          0 |    15499
         10 |     1 |       1 |      16863 |     17048 |         9 |       9 |          0 |    15499
         11 |     1 |       1 |      17048 |     17254 |        10 |      10 |          0 |    15499
         12 |     1 |       1 |      17254 |     17472 |        11 |      11 |          0 |    15499
         13 |     1 |       1 |      17472 |     17682 |        12 |      12 |          0 |    15499
         14 |     1 |       1 |      17682 |     17904 |        13 |      13 |          0 |    15499
         15 |     1 |       1 |      17904 |     21904 |        14 |      14 |          0 |    15499
         16 |     1 |       1 |      21904 |     28550 |        15 |      15 |          0 |    15499
         17 |     1 |       1 |      28550 |     28556 |        16 |      16 |          0 |    15499
         18 |     2 |       1 |          1 |      4229 |        17 |      17 |       2756 |    15499
         19 |     2 |       1 |       4229 |      4905 |        18 |      18 |          0 |    15499
         20 |     2 |       1 |       4905 |      9033 |        19 |      19 |          0 |    15499
         21 |     2 |       1 |       9033 |     11691 |        20 |      20 |          0 |    15499
         22 |     2 |       1 |      11691 |     14411 |        21 |      21 |          0 |    15499
         23 |     2 |       1 |      14411 |     15215 |        22 |      22 |          0 |    15499
         24 |     2 |       1 |      15215 |     15942 |        23 |      23 |          0 |    15499
         25 |     2 |       1 |      15942 |     16795 |        24 |      24 |          0 |    15499
         26 |     2 |       1 |      16795 |     17586 |        25 |      25 |          0 |    15499
         27 |     2 |       1 |      17586 |     17589 |        26 |      26 |          0 |    11094
         28 |     1 |       1 |      28556 |     28558 |        17 |      20 |          0 |     3397
         29 |     1 |       1 |      28558 |     28561 |        26 |      26 |      11095 |    15499
         30 |     1 |       1 |      28561 |     28707 |        27 |      27 |          0 |    15499
         31 |     1 |       1 |      28707 |     31670 |        28 |      28 |          0 |    15499
         32 |     1 |       1 |      31670 |     31684 |        29 |      29 |          0 |    15499
         33 |     1 |       1 |      31684 |     31812 |        30 |      30 |          0 |    15499
         34 |     1 |       1 |      31812 |     32824 |        31 |      31 |          0 |    15499
         35 |     1 |       1 |      32824 |     56664 |        32 |      32 |          0 |    15499
         36 |     1 |       1 |      56664 |     56668 |        33 |      33 |          0 |      717

Here are some entries from the log file, with a couple of comments interspersed:

24-Mar 17:06 sirius-dir JobId 1: Start Backup JobId 1, Job=athene_User.2008-03-24_17.06.03
24-Mar 17:06 sirius-sd JobId 1: Spooling data ...
24-Mar 17:06 sirius-dir JobId 2: Start Backup JobId 2, Job=barsoom_User.2008-03-24_17.06.04
24-Mar 17:06 sirius-sd JobId 2: Spooling data ...
24-Mar 17:34 sirius-sd JobId 1: User specified spool size reached.
24-Mar 17:34 sirius-sd JobId 1: Writing spooled data to Volume. Despooling 17,179,901,776 bytes ...
  *** job 1 logs correct tape positions
24-Mar 17:56 sirius-sd JobId 2: Job write elapsed time = 00:50:10, Transfer rate = 3.166 M bytes/second
24-Mar 17:56 sirius-sd JobId 2: Committing spooled data to Volume "DLT0001". Despooling 9,539,116,533 bytes ...
24-Mar 18:36 sirius-sd JobId 1: Despooling elapsed time = 01:02:06, Transfer rate = 4.610 M bytes/second
24-Mar 18:36 sirius-sd JobId 1: Spooling data again ...
24-Mar 18:36 sirius-sd JobId 2: [starts physically despooling to tape]
  *** job 2 logs correct tape positions
24-Mar 18:46 sirius-sd JobId 1: Job write elapsed time = 01:40:20, Transfer rate = 3.901 M bytes/second
  *** job 1 erroneously sets end file/block in dcr to current tape position
  *** this is done by the write_session_label() call at stored/append.c:291
  *** this will cause the old, non-flushed jobmedia record to get the wrong data when it gets flushed later
24-Mar 18:46 sirius-sd JobId 1: Committing spooled data to Volume "DLT0001". Despooling 6,331,234,514 bytes ...
24-Mar 19:08 sirius-sd JobId 2: Despooling elapsed time = 00:32:30, Transfer rate = 4.891 M bytes/second
24-Mar 19:08 sirius-sd JobId 1: [starts physically despooling to tape]
  *** job 1 flushes that old record, giving it the wrong end file/block from the dcr
  *** job 1 then logs correct tape positions
24-Mar 19:31 sirius-sd JobId 1: Despooling elapsed time = 00:22:48, Transfer rate = 4.628 M bytes/second

> However, I have always been unhappy that capturing the Volume position was 
> done in the session label subroutine (it is sort of hidden there), so I would 
> be more inclined to move the end code out into a new subroutine, like I have 
> already done for the set_start_vol_position(), then remove it from the 
> session_label() subroutine, and call the two subroutines at the appropriate 
> times -- that makes what is actually going on much clearer.

That would be the correct fix, of course.  Put it on your to-do list!  ;)

Meanwhile, another quick fix (and a sufficient one) would be to add code
to write that final JOBMEDIA row to the database at the end of the first
despooling round.  In the case where a large job spools in two or more
rounds, this would lead to a slight inelegancy, in that there would be
two rows together showing that the job used all of the tape file it was
in at the intersection between spooling rounds.  However, that data
would still be correct, whereas the current code produces erroneous data.

-tih
-- 
Self documenting code isn't. User application constraints don't. --Ed Prochak

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.