[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] bacula-sd hanging after tape gets full + unload (2.5.19)


Hello Pasi,

Le Thursday 04 December 2008 13:13:56 Pasi Kärkkäinen, vous avez écrit :
> On Thu, Nov 13, 2008 at 05:37:06PM +0100, Eric Bollengier wrote:
> > Hello,
> >
> > Le Thursday 13 November 2008 17:03:10 Pasi Kärkkäinen, vous avez écrit :
> > > Hello list!
> > >
> > > I'm using Bacula 2.5.19 and trying 'copy jobs' feature to copy jobs
> > > from disk volumes/pools to tape.
> > >
> > > Sometimes bacula-sd seems to get stuck.. it hangs without doing
> > > anything. Now it happened when tape got full and Bacula started to
> > > change the tape on the drive (using autoloader):
> > >
> > > bacula-sd JobId 3082: Start Copying JobId 3082,
> > > Job=CopyPool4UncopiedToTape.2008-11-13_10.53.04.54 bacula-sd JobId
> > > 3082: Using Device "IBM-LTO3-Drive"
> > > bacula-sd JobId 3082: Ready to read from volume "Pool4-Vol-0127" on
> > > device "FSDevice4" (/mnt/backup1/pool04). bacula-sd JobId 3082: Forward
> > > spacing Volume "Pool4-Vol-0127" to file:block 0:218. bacula-sd JobId
> > > 3082: End of Volume "756NNNL3" at 764:10067 on device "IBM-LTO3-Drive"
> > > (/dev/nst0). Write of 64512 bytes got -1. bacula-sd JobId 3082: Re-read
> > > of last block succeeded.
> > > bacula-sd JobId 3082: End of medium on Volume "756NNNL3"
> > > Bytes=725,237,130,240 Blocks=11,241,894 at 13-Nov-2008 11:51. bacula-sd
> > > JobId 3082: 3307 Issuing autochanger "unload slot 3, drive 0" command.
> > >
> > > <nothing happens after this>
> > >
> > >
> > > *sta
> > > Status available for:
> > >      1: Director
> > >      2: Storage
> > >      3: Client
> > >      4: All
> > > Select daemon type for status (1-4): 2
> > >
> > > ...
> > >
> > > Device status:
> > > Autochanger "IBM-LTO3-AutoChanger" with devices:
> > >    "IBM-LTO3-Drive" (/dev/nst0)
> > > Device "FSDevice0" (/mnt/backup1/pool00) is not open.
> > > Device "FSDevice1" (/mnt/backup1/pool01) is not open.
> > > Device "FSDevice2" (/mnt/backup1/pool02) is not open.
> > > Device "FSDevice3" (/mnt/backup1/pool03) is not open.
> > > Device "FSDevice4" (/mnt/backup1/pool04) is mounted with:
> > >     Volume:      Pool4-Vol-0127
> > >     Pool:        Pool4
> > >     Media type:  File4
> > >     Total Bytes Read=1,649,507,328 Blocks Read=25,569
> > > Bytes/block=64,512 Positioned at File=0 Block=1,649,507,534
> > > Device "IBM-LTO3-Drive" (/dev/nst0) is not open.
> > >     Device is being initialized.
> > >     Drive 0 is not loaded.
> > > ====
> > >
> > > Used Volume status:
> > >
> > > <hangs here and nothing happens>
> > >
> > >
> > > I can exit bconsole by pressing CTRL+C multiple times.. if I restart
> > > bconsole and run that again, it gets stuck again..
> > >
> > > I tried 'strace -p <pid>' to see what bacula-sd is doing:
> > >
> > > # strace -p 7339
> > > Process 7339 attached - interrupt to quit
> > > select(5, [4], NULL, NULL, NULL <unfinished ...>
> > > Process 7339 detached
> > >
> > > So.. bacula-sd seems to be stuck on select() ..
> > >
> > > Running 'mtx' seems to work fine.. at the same time when bacula-sd is
> > > stuck.
> > >
> > > # mtx -f /dev/sg3 status
> > >   Storage Changer /dev/sg3:1 Drives, 8 Slots ( 0 Import/Export )
> > > Data Transfer Element 0:Empty
> > >       Storage Element 1:Full :VolumeTag=179MMML3
> > >       Storage Element 2:Full :VolumeTag=658NNNL3
> > >       Storage Element 3:Full :VolumeTag=756NNNL3
> > >       Storage Element 4:Full :VolumeTag=177MMML3
> > >       Storage Element 5:Full :VolumeTag=655NNNL3
> > >       Storage Element 6:Full :VolumeTag=656NNNL3
> > >       Storage Element 7:Full :VolumeTag=657NNNL3
> > >       Storage Element 8:Full :VolumeTag=CLNU38L1
> > >
> > >
> > > Any ideas how to fix this? Other than restarting Bacula..
> >
> > Could you stop all daemons with a sigsegv to force a backtrace ?
> > killall -SEGV bacula-sd bacula-dir
> >
> > (you will find 2 kind of file, *traceback and *bactrace in working
> > directory)
> >
> > After, if you can put results to pastbin, it will give information about
> > your problem.
>
> Ok, problems again.. here are the tracebacks:
>
> http://pasik.reaktio.net/bacula/debug/bacula-sd-traceback.txt
> http://pasik.reaktio.net/bacula/debug/bacula-dir-traceback.txt
>
> Here's what I did to make bacula-sd hang:
>
> 1. Rebooted the bacula server and the tape library
> 2. Fresh after the reboot made sure mtx and bacula mtx-changer work OK.
> 3. Started bacula
> 4. Ran a job that copies jobs from disk pool to tape pool
> 5. Bacula starts a bunch of jobs, but nothing happens.. bacula-sd is stuck.
>
> Any ideas how to debug this further?

Thanks for this traceback, it's very useful, i have found a problem in the 
code.

in bool DCR::can_i_write_volume() we have :
   lock_read_volumes();
   vol = find_read_volume(VolumeName);

And the first step of find_read_volume() is to call lock_read_volumes(). And 
this lock is not recursive.

Now, i will take a look.

Bye

> Atm I'm running Bacula 2.5.20 (svn rev 8083) on CentOS 5.2 x86 32bit.
>
> I also tried applying 2.4.3-sd-deadlock.patch (from bug #1192) but it
> didn't seem to help.
>
> -- Pasi



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.