[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] patch for presence of file daemon

"Kern Sibbald" a écrit le 19/08/2008 17:13 :

I'm off on vacation essentially now until Saturday evening, so for this email, 
I will attempt to give some points now, and if there is more discussion, we 
will need to do it next week.

First, I have to say that I have a bit of bad news for you, but I would 
strongly urge you to take careful note of my reasoning and read this 
carefully through to the end, because I think you can make a very important 
contribution in this area -- though slightly modified from what you have 
currently proposed.  

Concerning the concept of the Director contacting all the FDs at the beginning 
and the FDs contacting the Director for the purposes to know if they are 
online and to reduce error or warning messages.  I believe that the potential 
problems associated with this proposed feature far outweigh any advantages it 
would give, and I even have a problem seeing any advantage.
perhaps our explanations are not clear.
advantages are clear (for us): when some FDs are not online, their jobs are postponed (with times parametered) until FDs connect. and all messages/status are clear: FD is present or not and job is ran or postponed (or canceled if timeout on presence is gone). if FD is not present, we don't have, like today, plenty of unuseful lines in logs. today we don't know  clearly if there is a communication problem to contact FD or if  FD is not online

It seems to me that that particular feature does nothing but reduce the number 
of messages and as designed can even lead to incorrect messages being 
printed.  As a consequence, I cannot accept this feature.  If you want I will 
list all the reasons, but I don't think that will be necessary.

well, we'll need to maintain our feature separately if you don't accept it. we'll try it for our clients and see if our way of doing it, is efficient or not.

I think we'll post a patch version "as is" for 2.4.2 on ML for people interested in it

Possible New code:
On the other hand, what I would accept, which will IMO accomplish the same 
thing is to reduce the default retry time to 2 minutes, and to make Bacula 
say try once every 30 seconds within those 2 minutes (i.e. either 4 or 5 
times) and then give up with a single error message.   This would probably 
take minimal changes to Bacula.
don't think this is as much efficient as we propose but when reading what you've said after, this seems to be linked to our way of using bacula (our DIRs fast never restart like we use some dynamic files for configuration)

Concerning having the FD transfer the IP address to the Director. That feature 
already exists as the SETIP command, and it is already (as far as I know) 
secure.  It is also tested and being used.
Currently the SETIP command can be sent from a console, and a simple shell 
script can automatically accomplish it.  There is no need for the user to 
even know his IP address.

Possible New code:
What is really missing in the FD is item 2 in the current projects list "Allow 
FD to initiate a backup".  What I had planned here is the following: modify 
the FD so that it can be contacted by a local console and the FD (not Dir) 
can be asked to do a Backup, or simply to send its IP address.   The FD would 
then simulate a console (possibly identifying itself as the FD) and do a 
SETIP command, and possibly ask for a backup.  Note, both of these are 
already controlled by restricted consoles, so the security is pretty much 

What is different about this is that if the FD requests a backup, the 
Director's console handler would start the job but pass the open console 
connection it has with the FD to the backup command, and the backup would 
proceed over that connection.  That permits implementing project item 2 and 
it also allows a work around for certain firewall problems where the Director 
cannot contact a particular client (maybe the client is behind a firewall or 
NATed in another network) so the Director can use the connection made by the 

I also would not be opposed to adding a directive to the FD that tells it to 
send its IP address to the Director when the FD starts, but it would do so by 
using the console "emulation" and sending the SETIP command.

So, I think what I proposed above will essentially give you what you have 
implemented, but a bit simpler, and it will also implement project item 2, 
which will take a bit of additional work.

If this interests you, we will need to discuss a few of the details so I can 
show you how we can identify and if we want multiplex the FD, which is quite 
easy to do.

Best regards,


PS: though it is probably not necessary, I have made a few comments below ...

we'll see if this can be included in our calendar, but there is few chances.

for this feature (item 2), FDs must maintain a socket with DIR to go through firewalls and NATs

On Tuesday 19 August 2008 10:21:48 Jean-Sébastien Hederer wrote:
"Kern Sibbald" a écrit le 14/08/2008 21:28 :

I have a few questions, please see below ...

On Thursday 14 August 2008 17:23:39 Jean-Sébastien Hederer wrote:


Maxime Rousseau has created a new feature for bacula. This patch has
been created in order to optimize the communications between the File
daemon and the Director daemon. It has been written for 2.4.0. All
regression tests and function tests have been passed for 2.4.0. Patch is
ready for 2.4.0. Maxime is making patch for 2.4.2 and trunk before
sending it. Here are some explanations:

With the new features, Bacula can backup clients who change their IP
like laptops.
Bacula already has a means to backup clients, which change their IP
address.  However, I admit that it could be optimized a bit.

Changing IP is dynamic. Information is given from FD to DIR when
signaling presence. We can have a parameter to enable/disable this
feature. Should this parameter be on director  ressource  for DIR or
client ressource for DIR?

   I had never seen SETIP It's  DIR that can control the change
of IP for an FD whatever how console is defined on FD.

Well the FD or the console running on the FD machine just needs to connect to 
the console port, to be properly authorized and to send in the SETIP command.  
It is very simple.

There are less error messages when a job is canceled because of the
absence of the File Daemon.      
Yes, there are probably too many error messages, but then it depends
on how you configure Bacula ...
yes but here, if file daemon is not here, we'll be able to send a clear
message  on it's status because we know he's not available.(and not a
message saying job has been cancelled on network communication failure)

You only know at a single instant in time when the FD is not there -- in the 
next second it can be there, and if you have any comm problems your state 
information in the Director will be out of date and will simply keep the 
Director from contacting the FD.

The correct way to do this (IMO) is to use the current design and simply 
reduce the default retry time to a very small number (e.g. 2 minutes).  This 
creates very little overhead and ensures that the Director will connect when 
he wants and nothing will ever be blocked.


The communication between FD and DIR become bidirectional so
connections are more frequent.
Maybe I am misunderstanding you, but more frequent connections are
bad not good.     
yes there are a few more connections. when DIR starts and tries to
detect FDs with parameter set and when FD starts/stops to say to DIR
that he's available. but it permits not to have connections from DIR to
FD that pollutes network when FD is not available.

The Director never pollutes any network, and if the FD is not there, the extra 
load on the network is absolutely trivial ...   Other than the far too long 
current 30 minute default timeout, I think the current scheme is very 

   for all these communications, we open a socket, make the
communication and close the socket.

   we have reused standard bacula functions as much as possible for
all this feature.


New features for the DIR:
        - when the DIR start, he tries to connect to the FD. If the
connection is
successful, a presence parameter in the Client ressource change to
"yes". Else the presence parameter keep his value "no".
What happens if the Director has 2,000 clients?     
he will try to reach all clients for which the presence parameter is
set. we took in mind that DIR stops/starts only very few times in a
year. this is how work our clients. this optimizes number of network

For me, the Director starts and stops hundreds of times per day, and many 
users take their Director down much more often than you do because of the 
need to add new clients or make other config changes.  So even for smaller 
shops that have 60 or so clients, if the Director tries for 2 minutes for 
each client, it will be blocked for 2 hours -- that is unacceptable.

Does the DIR stall until it contacts them all?  How many resources
does it take to contact them?
the DIR contacts all the FDs for which parameter is set before continuing.

As I mentioned above, given the downside, any advantage of this scheme doesn't 
seem worthwhile to me ...

we'll see how to "parallelize" FD communications.

How long will it take for it to contact the last of the 2,000 clients?   
timers are parametered.

Yes, that is good, but for me not sufficient.

If a scheduled job starts for the 2,000th client before the 2,000th
client is contacted by the startup routine, will the job be retarded
in starting?

What happens if I have clients that I don't want the director
contacting because they are very infrequently used and the jobs are
only manually started?
you don't set presence parameter or you put it to false (default value
corresponds to actual behavior)

I don't think it is really necessary as I have explained above.


- when the DIR is going to start a new job, he checks the presence
parameter. If the client is
present, the DIR starts the job, else he waits for him during a time
specified in the Client ressource in the bacula-dir.conf (this parameter
is named "WaitTimer"). He checks if the client is connected at each
interval of a time (attribute "PresenceTimer" in bacula-dir.conf). If
the client never connect himself during the "WaitTimer" time, the job is
marked as "JSAutomaticallyCanceled" in the Catalog.
"JSAutomaticallyCanceled" is a new parameter defined in jcr.h and it
means that the job is canceled because the File daemon has never been
Is it possible to turn off this behavior?  I don't want it for my
setup, because it is not always possible for my clients to contact
the Director.
yes, sure. ever  made. old configurations are fully compatible without
changing behavior. 



- I have created a new file named fd_server.c. It allow the DIR to
listen to the File Daemon
connections (the default port is 9104, parameter DIRportFD in Director
ressource of bacula.dir.conf). The parameter MaxClientsPresence defined
in Director ressource in bacula-dir.conf decide how many File Daemons
the DIR can listen simultaneously. - Authentifications fonctions are
also implemented in authenticate.c in src/dird and src/filed.
The IANA will never approve of a fourth port for Bacula.  There is
no reason to have the Director listening on two ports.  These
connections should be multiplexed on port 9101.
problem is we should change treatment for incoming requests  on port
9101 in order to have two types of communications: console and FD

We can simply add new commands, or we can have the FD identify itself 
differently, or we can multiplex it -- all or any of these without creating 
any problems of compatibility.  We have done this kind of thing many times in 
evolving Bacula so that it can continue to interface with old FDs while 
having new commands and interfaces for new ones.

   so, this could be not compatible with existing consoles


New features for the FD:
        - the FD must know the address of the Director which is
stocked in the
Director ressource in bacula-fd.conf.      
What happens if there are two or three Directors that can contact a
File daemon as is the case at my site?
each director ressource is separated in FD configuration file. so,
each ressource can be configured separately.

This will be needed if you wish to implement some of the ideas I have 
presented above.

there seems to be misunderstanding here: this is implemented and this is what I've explained


Also, he knows on which port he is able to contact the DIR (default

- when the FD start, he tries to connect to the DIR. If the
connection is successful, a presence parameter
in the Client ressource of the Director daemon changes to "yes". Else
the presence parameter keep his default value "no". For the
authentification he uses the existing password between the File Daemon
and the Director. The File Daemon gives his new address to the DIR so if
the client is a laptop, jobs can be run with any IP.
As I mentioned above this capability already exists with SETIP.
this is not exactly the same feature. this is not a feature for a console.

SETIP sets the IP address for a client so that the Director knows where to 
contact the client.  It is implemented via a console.   Obviously the 
implementation is different, but if the end result is not the same, then I 
did not properly understand what you have implemented.


- when the File Daemon stops, he warns the DIR he is going away.
After this      
warning, presence_parameter = 0 : the DIR

knows the client is absent.      
So let's say that the client goes away and notifies the Director,
then when the client starts again, because of some temporary problem
it cannot notify the director.  Is the client then essentially
yes. this could be upgraded in order to periodically say to the DIR
that he is present.

As mentioned, I think it is always better for the Director to try to contact 
the FD (unless the user has explicitly disabled it) each time.  The state 
information you are proposing to keep will be stale after 1 second, so is not 


This feature doesn't work on Windows system. Perhaps the FD not
finished in the same way as it stop on Linux. At least,
on Windows, bacula does not go in the fonction "terminate_filed" in
filed.c so the presence parameter keep his value at 1. ----> Perhaps
there is a possible upgrade to do.
It is possible that the Win32 FD gets some serious error on
termination so it never gets to the terminate_filed() code.  This is
also possible any time any FD crashes.


For the connections at the start of the two Daemons, there is a
retry_interval defined at 10 seconds (if connection fail, retry after 10
seconds) and a max_retry_time defined at 20 seconds (abandon connection
after 20 seconds).

Normally, the old configurations works fine even though files are

If configuration files not exist when we apply the patch, they are
created with a new configuration (Presence parameter, PresenceTimer,
WaitTimer, Address of the Director...). Else you must modify the
configuration files: if the Presence parameter in Client ressource in
bacula-dir.conf and the address attribute in Director ressource in
bacula-fd.conf not exist, bacula will run like an old configuration.

Exemple of a new configuration:

1/ In "bacula-dir.conf"

Director {                            # define myself
   Name = localhost-dir
   DIRport = 9101            # where we listen for UA connections
   DIRportFD = 9104  # where we listen for FD connections
-----------------> NEW QueryFile =
   WorkingDirectory = "/home/rousseaum/bacula/working"
   PidDirectory = "/home/rousseaum/bacula/working"
   Maximum Concurrent Jobs = 1
   Password = "6V2ghmC6A0YUfncxiF5wJJ1x+WAT2BpUD55l1tfaOury"         #
Console password Messages = Daemon

   MaxClientsPresence = 20  #How many client the DIR can listen
simultaneously -----------------> NEW
Why is this needed?
   we reused existing functions and the function reused needs a number
as argument. so, we've put it into parameters.

I don't understand the above, but perhaps it is moot ...

When the client connects to the Dir does it remain connected or does
it disconnect after announcing its presence?
it disconnects (through bnet_close)




Client {
   Name = localhost-fd
   Address = localhost
   FDPort = 9102
   Catalog = MyCatalog
   Password = "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"          #
password for FileDaemon File Retention = 30 days            # 30 days
   Job Retention = 6 months            # six months
   AutoPrune = yes                     # Prune expired Jobs/Files
   Presence = yes        # The presence parameter exist
NEW PresenceTimer = 15 # Maximum time to verify the client presence
--------> NEW WaitTimer = 60 minutes  # Maximum time to wait the client
--------------> NEW # PresenceTimer and WaitTimer are defined in second
by default. We can use minutes, hours, days... like the other # temporal
parameter in Bacula.

2/ In "bacula-fd.conf"

Director {
   Name = localhost-dir
   Address = localhost
   DIRport = 9104
---------------------------------------------------------> NEW Password
= "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"

Exemple of a typical communication between the FD and the DIR:

1/ Starting daemons:

1.1/ DIR starts before FD (most frequent situations)

DIR starts;
DIR tries to connect to FD;
if (FD connected) {
        presence_parameter = 1;
FD starts;
FD tries to connect to DIR;
if (DIR connected) {
        presence_parameter = 1;
        FD give his new address to DIR;
How does the FD pass his address to the DIR?    



1.2/ FD starts before DIR

FD starts;
FD tries to connect to DIR;
if (DIR connected) {
        presence_parameter = 1;
        FD give his new address to DIR;
DIR starts;
DIR tries to connect to FD;
if (FD connected) {
        presence_parameter = 1;

1/ Starting job (Backup, Restore):

DIR check FD presence;
if (FD hasn't got presence_parameter) {                  ----> old
configuration run job like old configuration;
else {                                                             
      ----> new configuration
        if (FD present) {
                run job;
        else {
                while (WaitTimer isn't terminate) {
                        check FD connection all the PresenceTimer
interval; if (FD connect) {
                                run job;
                Job mark at JSAutomaticallyCanceled;

*Any remarks are welcome. We hope this feature to be included in bacula,
so we made it with existing clients configuration in mind in order not
to disturb existing configurations. *
My questions are above. Aside from the one remark I made above, the
only other remark I have for the moment (until I see the answers) is
to say, it is always preferable to announce and discuss a project
prior to coding it -- it can possibly save you a lot of time
recoding it or the horrible frustration of having it rejected after
you've spent a lot of time on it.
yes, I know we should have made so. we'll try not to forget that for
next features

What I would really like is that you continue to work on some of the features 
you have developed, but simply redirect your effort in the directions I have 
indicated.  If I have understood what you have done, I think you will find in 
the end that what I am suggesting is a lot less code and will accomplish most 
everything you want, and if you decide to do the project item 2, you will 
make a lot of users happy.

Obviously even implementing project item 2 needs a bit more design work before 
implementing.   I have already partially implemented the code in the Director 
needed to transfer a FD request for backup.

Best regards,


Once I have your responses to my questions, I will make my remarks.

Best regards,



fn;quoted-printable:Jean-S=C3=A9bastien Hederer

This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.