[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] Backup Corruption with Certain Types of Drive

Hi, all,

After noticing a curious warning message from a catalog backup, I have
been doing a little investigation and I now think that, with certain
types of drive, a backup which spans more than one tape can corrupt the
data being written at the point where the tape change takes place.

When a tape reaches the end of medium, there are two ways in which the
drive may act:

1) The drive writes until it hits the end of the tape, i.e. the write
  fails because the drive cannot pull any more tape past the heads.
  In this case the write does not take place, and an EOM sense code
  is returned.

2) The drive detects *impending* end of medium. In this case, the
  write takes place, but an EOM is still returned.

In case 1, everything is fine, and the backup proceeds as expected (DLT
and LTO both appear to behave in this way). In case 2, however, the same
data appears to be written twice: as the last block of the first tape,
and then again as the first block of the second.

Here is what happened to me:

My catalog backup produced this output:

23-Nov 22:03 gershwin-dir JobId 506: Using Device "DDS3-0"
23-Nov 22:06 gershwin-sd JobId 506: End of Volume "MainCatalog-004" at 57:4963 on device "DDS3-0" (/dev/rmt/1cbn). Write of 64512 bytes got 0.
23-Nov 22:06 gershwin-sd JobId 506: Error: Re-read of last block OK, but block numbers differ. Last block=4963 Current block=4963.
23-Nov 22:06 gershwin-sd JobId 506: End of medium on Volume "MainCatalog-004" Bytes=28,148,843,520 Blocks=436,334 at 23-Nov-2008 22:06.
23-Nov 22:06 gershwin-sd JobId 506: 3307 Issuing autochanger "unload slot 4, drive 0" command.

And, after my initial confusion at being told that the block numbers differed
because the last block was 4963 instead of 4963 ;-) I decided to investigate,
and I found from the source code that the SD was actually expecting to find
block 4962 at the end of the tape, hence the error.

I ran btape on the volume and got this:

Block: 4959 size=64512
Block: 4960 size=64512
Block: 4961 size=64512
Block: 4962 size=64512
Block: 4963 size=64512
26-Nov 13:44 bls JobId 0: End of file 58 on device "dds3" (/dev/rmt/1cbn), Volume "MainCatalog-004"

So it looks as if block 4963 has indeed been successfully written to the tape.
The next step was to restore the job which spanned the two tapes:

29-Nov 03:14 gershwin-sd JobId 510: End of all volumes.
29-Nov 03:14 gershwin-fd JobId 510: Error: attribs.c:421 File size of restored file /tmp/bacula-restores/var/bacula/bacula.sql not correct. Original 538130868, restored 538195332.

The difference between the two sizes is 64464. I am guessing this would be a
64512 byte block, minus a 48 byte header. Does anyone know is this is correct?

It looks as if one block of data has indeed been written twice, resulting in
"bad" data being backed up. If there is a general agreement that I am correct
here, then I think I will have a go at fixing it. It is an easy condition to
detect, since the last block number read back from the tape tells us whether
the current block has been written or not.


PS I ran a btape test on a pair of blank tapes, and it passed the test, although
it actually generated the same error as I got when the backup changed tapes:

Wrote blk_block=150000, dev_blk_num=6500 VolBytes=9,676,735,488 rate=841.6 KB/s
Wrote blk_block=155000, dev_blk_num=11500 VolBytes=9,999,295,488 rate=842.6 KB/s
04-Dec 20:56 btape JobId 0: End of Volume "TestVolume1" at 14:13639 on device "DDS3-0" (/dev/rmt/1cbn). Write of 64512 bytes got 0.
04-Dec 20:56 btape JobId 0: Error: Re-read of last block OK, but block numbers differ. Last block=157139 Current block=157139.
btape: btape.c:2349 Last block at: 14:13638 this_dev_block_num=13639
btape: btape.c:2383 End of tape 14:0. VolumeCapacity=10,137,286,656. Write rate = 842.5 KB/s
20:59:34 Done filling tapes at 0:13. Now beginning re-read of first tape ...
04-Dec 20:59 btape JobId 0: 3307 Issuing autochanger "unload slot 2, drive 0" command.
04-Dec 21:00 btape JobId 0: 3304 Issuing autochanger "load slot 1, drive 0" command.
04-Dec 21:00 btape JobId 0: 3305 Autochanger "load slot 1, drive 0", status is OK.
04-Dec 21:01 btape JobId 0: Ready to read from volume "TestVolume1" on device "DDS3-0" (/dev/rmt/1cbn).
Reading the first 10000 records from 0:0.
10000 records read now at 1:5084
Reposition from 1:5084 to 14:13638
Reading block 13638.

The last block of the first tape matches.
Reposition from 0:0 to 0:1
Reading block 1.

The first block on the second tape matches.

Reposition from 0:2 to 0:11
Reading block 11.

The last block on the second tape matches. Test succeeded.

PPS I was lucky, since this was a catalog backup. I simply ran the backup again
manually. I'd still like to fix it, though! :-)


SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.