[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Patch: Migration jobmedia table insert incomplete


Hello,

Nice investigation work!

This is certainly an interesting bug.  Fortunately, the only apparent 
consequences are that getting to the first migration record is slow (i.e. the 
index is not properly setup).

In principle, this bug shouldn't exist because migration uses all the same 
output routines that a backup uses -- except that as you point out, a 
migration job skips creating session start/end records because they are 
simply copied.

Your preliminary patch probably corrects a good number of problems, but it 
also suffers from two things that I can see:

1. dev can change during execution of the code if the device is switched.  
Therefore, at this top level routine, it must be reloaded each time it is 
referenced -- this points out a bug that I should declare it volatile in the 
dcr packet to prevent the compiler from being too smart in optimization.

2. the same bookkeeping that is done in creating a session start label is done 
in the session end label routine to get an accurate end of all the records.  
If this is not done properly, it will be more serious in that Bacula might 
stop prematurely during a restore.

The cleanest solution that I can think for this problem is to make the 
migration code call the session label subroutine, and let the subroutine sort 
it out.  I.e. we need to ensure that for migration jobs the subroutine 
doesn't actually write a record this can probably be best done by adding a 
new argument that tells it to write or not.

Another possibility would be to move the index bookkeeping code out of the 
session label routine -- it would make it less likely for such problems to 
occur in the future at the expense of some additional complications ...

What do you think?

On Sunday 10 February 2008 04.52:26 Peter Much wrote:
> Abstract:
> ---------
> Migration jobs do not insert the jobmedia.startfile attribute
> in the catalog.
>
>
> Impact:
> -------
> Very slow restore.
>
>
> Example:
> --------
> bacula=> select jobid,name,type,jobfiles,poolid,priorjobid from job
>          where jobid=1728 or jobid=1729 or jobid=1748 or jobid=1749
>          order by jobid;
>  jobid |        name        | type | jobfiles | poolid | priorjobid
> -------+--------------------+------+----------+--------+------------
>   1728 | HomeDirsDisp       | B    |     4398 |      5 |          0
>   1729 | HomeDirsDisp       | M    |     4398 |      2 |          0
>   1748 | MoveToTapeInternal | g    |        0 |      5 |          0
>   1749 | HomeDirsDisp       | B    |     4398 |      5 |       1729
> (4 rows)
>
>  -> job 1728 saves directly to pool 5. Job 1729 saves the same data
>     to pool 2, and migrating job 1748/1749 moves it on to pool 5.
>
> bacula=> select jobid,firstindex,lastindex,startfile,endfile,
>                 startblock,endblock from jobmedia
>          where jobid=1728 or jobid=1729 or jobid=1748 or jobid=1749
>          order by jobid;
>  jobid | firstindex | lastindex | startfile | endfile | startblock |
> endblock
> -------+------------+-----------+-----------+---------+------------+-------
>---- 1728 |          1 |      4398 |        55 |      55 |          0 |     
> 3466 1729 |          1 |      4398 |         0 |       0 |        217 |
> 223662044 1749 |          1 |      4398 |         0 |      56 |          0
> |      3466 (3 rows)
>
>  -> Job 1728 sets startfile attribute, migrate job 1749 sets just 0.
>
>
> Comment:
> --------
> 1. I have tried to fix the StartFile issue, so that my restores
>    run better. I have not further checked how the other data are
>    derived on Migration and if there may be more problems.
>    Usually I do such, but I was tired at that point. ;)

Yes, it looks like you fixed the StartFile issue, but the same problem may 
exist on the end file position.

>
> 2. The existence of jcr->dcr is checked twice in that function,
>    once right after entry, and once at label "ok_out:". Could it
>    be something "stealing" this object inflight?

The DCR should have a fixed address and should not be stolen inflight.  The 
bulk of the code was copied from append.c, so it is possible that there are 
some inefficiencies.

>    If so, then my earlier using of "dev" as a shorthand for
>    jcr->dcr->dev is illegitim, and things need to be written
>    more cumbersomely.

In many routines, I set dev at the top of the subroutine and use it as 
shorthand.  However, at this high a level, one cannot do that because dev can 
change during the backup if the device changes.  E.g. when reading files that 
were written to two devices, dev will change, and (currently not implemented) 
during backup if the end of a tape is reached and Bacula decides to change 
drives, dev will also change.

>    (I have obviousely not gone into the layout of the whole
>    software conception already; therefore these patches should
>    be considered more as suggestions to where the problem seems
>    to arise, and not so much as readymade fixes!)

Well, it is a very good start.  

Would you please submit a bug report on this with basically the text that you 
wrote here (you can obviously change it taking into account what I have 
written, if you wish) -- that will guarantee that the problem is resolved and 
ensure that everyone who is interested sees it ...

Best regards,

Kern

>
> Patch:
> ------
> --- src/stored/mac.c.orig       Wed Oct  3 13:36:47 2007
> +++ src/stored/mac.c    Sun Feb 10 04:28:54 2008
> @@ -89,7 +89,7 @@
>     }
>
>     Dmsg3(200, "Found %d volumes names for %s. First=%s\n",
> jcr->NumReadVolumes, -      jcr->VolList->VolumeName, Type);
> +      Type, jcr->VolList->VolumeName);
>
>     /* Ready devices for reading and writing */
>     if (!acquire_device_for_read(jcr->read_dcr) ||
> @@ -98,8 +98,26 @@
>        goto bail_out;
>     }
>
> -   Dmsg2(200, "===== After acquire pos %u:%u\n", jcr->dcr->dev->file,
> jcr->dcr->dev->block_num); +   dev = jcr->dcr->dev;
> +   Dmsg2(200, "===== After acquire pos %u:%u\n", dev->file,
> dev->block_num);
>
> +   /* On MIGRATE we must setup some dcr jobmedia data. On BACKUP
> +    * this is done in write_session_label().
> +    */
> +   switch(jcr->JobType) {
> +   case JT_MIGRATE:
> +      if (dev->is_tape()) {
> +         jcr->dcr->StartBlock = dev->block_num;
> +         jcr->dcr->StartFile  = dev->file;
> +      } else {
> +         jcr->dcr->StartBlock = (uint32_t)dev->file_addr;
> +         jcr->dcr->StartFile = (uint32_t)(dev->file_addr >> 32);
> +      }
> +      break;
> +   default:
> +      break;
> +   }
> +
>     set_jcr_job_status(jcr, JS_Running);
>     dir_send_job_status(jcr);
>
> @@ -117,7 +135,6 @@
>
>  ok_out:
>     if (jcr->dcr) {
> -      dev = jcr->dcr->dev;
>        if (ok || dev->can_write()) {
>           /* Flush out final partial block of this session */
>           if (!write_block_to_device(jcr->dcr)) {
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Bacula-devel mailing list
> Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/bacula-devel



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilot Consulting.