[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] Three issues on shutdown

For long, I experienced the irritation of the Bacula-DIR not being
killable at once on my system. 
(FreeBSD 5.5 and 6.3, Bacula 2.2.7 - 2.2.9-b2).

Now I gave it a try to exactly figure out what is giong on (or going
wrong). And by that I found three things, none of them seems to have 
an easy solution. :(

[All following code snippets are taken from 2.2.9-b3, as this is
just what I have here.]

When orderly signalled to shutdown via SigTERM, the DIR will 
invoke the function terminate_dird() in file dird/dird.c. 
Here various cleanup-functions are invoked; one of them is
term_scheduler(). This function is where the shutdown will
stop and the DIR will start to loop forever.

term_scheduler() in dird/scheduler.c looks like this:

	void term_scheduler()
	  if (jobs_to_run) {
	     job_item *je;
	     /* Release all queued job entries to be run */
	     foreach_dlist(je, jobs_to_run) {
	     delete jobs_to_run;

This basically does not work. 
What I am experiencing is that the "foreach" loop is executed only 
twice, and the second time the address that is given to free() 
reads "0xAAAAAAAA" - which is obviousely not a correct address.
And then it hangs.
And if I comment out the "free()", then it does not hang.

So, what we are doing here is practically something like this:

	while(je = next(je)) 

we free a memory-chunk and *then* use this very memory-chunk's 
address as a reference to figure out our next memory-chunk.

This would not be a problem if the address were not contained
in the memory-chunk itself. But as it seems, it is.

It would also not be a problem if we would use a genuine Unix
free(), since this does only free the memory-chunk, while the
data contained therein can still be used.
But in fact we have rewired free() to point to sm_free() in
lib/smartall.c. And *this* free() does a memset(target, 0xAA)
 - which explains a bit.

Now, for the solution: there are a couple of possibilities to
rearrange that algorithm in a way that would avoid this effect.
Most of them seem to work, but all which I have tried show another
misbehaviour: there is always one memory-chunk remaining, which
then gets reported as "orphaned" at the end of the shutdown.

It seems, this chunk does not even show up when walking the 
scheduler dlist. I would suppose that it has become orphaned
already earlier in the program run by some other effect.

The second interesting question is, why does the DIR begin to 
loop forever from that point on?

Actually the sm_free() should detect that something is wrong
with the 0xAAAAAAAA address, and should create an ABORT condition -
which then gets propagated to a segmentation violation.

In fact it possibly tries to do that - but this will not work:
Due to the shutdown-initiating SigTERM we are in a signal handler!
And our respective sa_mask had been set (from init_signals() in
lib/signal.c) by sigfillmask() - that means: block all signals!

Now we have a funny condition: our process has segfaulted and
is likely no longer runnable - but it postpones the acceptance 
of the SEGV signal.
I do not think it is well defined how a kernel should handle such
situation. Mine continues to process sigtraps as spare CPU allows.
Others may simply get rid of the crap.

Again, a solution is not all too simple. And there is another
problem: the BSD manpage says this about signal handlers:

> [certain number of Unix library functions deleted]
> All functions not in the above lists are considered to be unsafe
> with respect to signals.  That is to say, the behaviour of such
> functions when called from a signal handler is undefined.  In 
> general though, signal handlers should do little more than set a 
> flag; most other actions are not safe.

Now, in the DIR, as far as I understand it, the whole elaborate
shutdown process, calling lots of functions not mentioned in that 
list, is all done within the signal handler.

Therefore, I would not consider it useful to now create a suitable
sigaction configuration for all demands; because chances are that
the thing would just not behave as expected.

For now, I have changed the sigfillmask() to sigemptymask() - this
seemingly does not provide handling for the SigSEGV, but it does
terminate the process at the point.

While investigating these things, perchance I was wondering about
why the Director's PID-file did not get deleted on termination.
Now there is a simple explanation: after switching to operation
priviledges (which happens after the creation of the pid file)
the director does no longer have the right to delete it.

But again, solution is not simple. Because, if it were created
after switching the credentials, usually there would be lack of
permissions to create it in the standard /var/run directory, and
an exclusive subdirectory would be needed. 
The latter seems to be de-facto standard for credential-switching
daemons, but it adds another step to installation processing.


This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.