[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] patch for presence of file daemon


Hello,

I'm off on vacation essentially now until Saturday evening, so for this email, 
I will attempt to give some points now, and if there is more discussion, we 
will need to do it next week.

First, I have to say that I have a bit of bad news for you, but I would 
strongly urge you to take careful note of my reasoning and read this 
carefully through to the end, because I think you can make a very important 
contribution in this area -- though slightly modified from what you have 
currently proposed.  

Concerning the concept of the Director contacting all the FDs at the beginning 
and the FDs contacting the Director for the purposes to know if they are 
online and to reduce error or warning messages.  I believe that the potential 
problems associated with this proposed feature far outweigh any advantages it 
would give, and I even have a problem seeing any advantage.

It seems to me that that particular feature does nothing but reduce the number 
of messages and as designed can even lead to incorrect messages being 
printed.  As a consequence, I cannot accept this feature.  If you want I will 
list all the reasons, but I don't think that will be necessary.

Possible New code:
On the other hand, what I would accept, which will IMO accomplish the same 
thing is to reduce the default retry time to 2 minutes, and to make Bacula 
say try once every 30 seconds within those 2 minutes (i.e. either 4 or 5 
times) and then give up with a single error message.   This would probably 
take minimal changes to Bacula.

Concerning having the FD transfer the IP address to the Director. That feature 
already exists as the SETIP command, and it is already (as far as I know) 
secure.  It is also tested and being used.
Currently the SETIP command can be sent from a console, and a simple shell 
script can automatically accomplish it.  There is no need for the user to 
even know his IP address.

Possible New code:
What is really missing in the FD is item 2 in the current projects list "Allow 
FD to initiate a backup".  What I had planned here is the following: modify 
the FD so that it can be contacted by a local console and the FD (not Dir) 
can be asked to do a Backup, or simply to send its IP address.   The FD would 
then simulate a console (possibly identifying itself as the FD) and do a 
SETIP command, and possibly ask for a backup.  Note, both of these are 
already controlled by restricted consoles, so the security is pretty much 
assured.

What is different about this is that if the FD requests a backup, the 
Director's console handler would start the job but pass the open console 
connection it has with the FD to the backup command, and the backup would 
proceed over that connection.  That permits implementing project item 2 and 
it also allows a work around for certain firewall problems where the Director 
cannot contact a particular client (maybe the client is behind a firewall or 
NATed in another network) so the Director can use the connection made by the 
FD.

I also would not be opposed to adding a directive to the FD that tells it to 
send its IP address to the Director when the FD starts, but it would do so by 
using the console "emulation" and sending the SETIP command.

So, I think what I proposed above will essentially give you what you have 
implemented, but a bit simpler, and it will also implement project item 2, 
which will take a bit of additional work.

If this interests you, we will need to discuss a few of the details so I can 
show you how we can identify and if we want multiplex the FD, which is quite 
easy to do.

Best regards,

Kern

PS: though it is probably not necessary, I have made a few comments below ...




On Tuesday 19 August 2008 10:21:48 Jean-Sébastien Hederer wrote:
> "Kern Sibbald" a écrit le 14/08/2008 21:28 :
> > Hello,
> >
> > I have a few questions, please see below ...
> >
> > On Thursday 14 August 2008 17:23:39 Jean-Sébastien Hederer wrote:
> >    
> >
> >> Hi,
> >>
> >> Maxime Rousseau has created a new feature for bacula. This patch has
> >> been created in order to optimize the communications between the File
> >> daemon and the Director daemon. It has been written for 2.4.0. All
> >> regression tests and function tests have been passed for 2.4.0. Patch is
> >> ready for 2.4.0. Maxime is making patch for 2.4.2 and trunk before
> >> sending it. Here are some explanations:
> >>
> >>
> >>
> >> With the new features, Bacula can backup clients who change their IP
> >> like laptops.
> >
> > Bacula already has a means to backup clients, which change their IP
> > address.  However, I admit that it could be optimized a bit.
> >
> >    
>
> Changing IP is dynamic. Information is given from FD to DIR when
> signaling presence. We can have a parameter to enable/disable this
> feature. Should this parameter be on director  ressource  for DIR or
> client ressource for DIR?
>
>    I had never seen SETIP It's  DIR that can control the change
> of IP for an FD whatever how console is defined on FD.

Well the FD or the console running on the FD machine just needs to connect to 
the console port, to be properly authorized and to send in the SETIP command.  
It is very simple.

>
> >> There are less error messages when a job is canceled because of the
> >> absence of the File Daemon.      
> >
> > Yes, there are probably too many error messages, but then it depends
> > on how you configure Bacula ...
> >    
>
> yes but here, if file daemon is not here, we'll be able to send a clear
> message  on it's status because we know he's not available.(and not a
> message saying job has been cancelled on network communication failure)

You only know at a single instant in time when the FD is not there -- in the 
next second it can be there, and if you have any comm problems your state 
information in the Director will be out of date and will simply keep the 
Director from contacting the FD.

The correct way to do this (IMO) is to use the current design and simply 
reduce the default retry time to a very small number (e.g. 2 minutes).  This 
creates very little overhead and ensures that the Director will connect when 
he wants and nothing will ever be blocked.

>
> >    
> >
> >> The communication between FD and DIR become bidirectional so
> >> connections are more frequent.
> >>      
> >
> > Maybe I am misunderstanding you, but more frequent connections are
> > bad not good.     
>
> yes there are a few more connections. when DIR starts and tries to
> detect FDs with parameter set and when FD starts/stops to say to DIR
> that he's available. but it permits not to have connections from DIR to
> FD that pollutes network when FD is not available.

The Director never pollutes any network, and if the FD is not there, the extra 
load on the network is absolutely trivial ...   Other than the far too long 
current 30 minute default timeout, I think the current scheme is very 
reasonable.

>
>    for all these communications, we open a socket, make the
> communication and close the socket.
>
>    we have reused standard bacula functions as much as possible for
> all this feature.

OK.

>
> >> New features for the DIR:
> >>         - when the DIR start, he tries to connect to the FD. If the
> >> connection is
> >> successful, a presence parameter in the Client ressource change to
> >> "yes". Else the presence parameter keep his value "no".
> >
> > What happens if the Director has 2,000 clients?     
>
> he will try to reach all clients for which the presence parameter is
> set. we took in mind that DIR stops/starts only very few times in a
> year. this is how work our clients. this optimizes number of network
> communications

For me, the Director starts and stops hundreds of times per day, and many 
users take their Director down much more often than you do because of the 
need to add new clients or make other config changes.  So even for smaller 
shops that have 60 or so clients, if the Director tries for 2 minutes for 
each client, it will be blocked for 2 hours -- that is unacceptable.

>
> > Does the DIR stall until it contacts them all?  How many resources
> > does it take to contact them?
> >    
>
> the DIR contacts all the FDs for which parameter is set before continuing.

As I mentioned above, given the downside, any advantage of this scheme doesn't 
seem worthwhile to me ...

>
> we'll see how to "parallelize" FD communications.
>
> > How long will it take for it to contact the last of the 2,000 clients?   
> >  
>
> timers are parametered.

Yes, that is good, but for me not sufficient.

>
> > If a scheduled job starts for the 2,000th client before the 2,000th
> > client is contacted by the startup routine, will the job be retarded
> > in starting?
> >    
>
> yes.
>
> > What happens if I have clients that I don't want the director
> > contacting because they are very infrequently used and the jobs are
> > only manually started?
> >    
>
> you don't set presence parameter or you put it to false (default value
> corresponds to actual behavior)

I don't think it is really necessary as I have explained above.

>
> >    
> >
> >> - when the DIR is going to start a new job, he checks the presence
> >> parameter. If the client is
> >> present, the DIR starts the job, else he waits for him during a time
> >> specified in the Client ressource in the bacula-dir.conf (this parameter
> >> is named "WaitTimer"). He checks if the client is connected at each
> >> interval of a time (attribute "PresenceTimer" in bacula-dir.conf). If
> >> the client never connect himself during the "WaitTimer" time, the job is
> >> marked as "JSAutomaticallyCanceled" in the Catalog.
> >> "JSAutomaticallyCanceled" is a new parameter defined in jcr.h and it
> >> means that the job is canceled because the File daemon has never been
> >> connected.
> >
> > Is it possible to turn off this behavior?  I don't want it for my
> > setup, because it is not always possible for my clients to contact
> > the Director.
> >    
>
> yes, sure. ever  made. old configurations are fully compatible without
> changing behavior. 

Good.

>
> >    
> >
> >> - I have created a new file named fd_server.c. It allow the DIR to
> >> listen to the File Daemon
> >> connections (the default port is 9104, parameter DIRportFD in Director
> >> ressource of bacula.dir.conf). The parameter MaxClientsPresence defined
> >> in Director ressource in bacula-dir.conf decide how many File Daemons
> >> the DIR can listen simultaneously. - Authentifications fonctions are
> >> also implemented in authenticate.c in src/dird and src/filed.
> >>      
> >
> > The IANA will never approve of a fourth port for Bacula.  There is
> > no reason to have the Director listening on two ports.  These
> > connections should be multiplexed on port 9101.
> >    
>
> problem is we should change treatment for incoming requests  on port
> 9101 in order to have two types of communications: console and FD

We can simply add new commands, or we can have the FD identify itself 
differently, or we can multiplex it -- all or any of these without creating 
any problems of compatibility.  We have done this kind of thing many times in 
evolving Bacula so that it can continue to interface with old FDs while 
having new commands and interfaces for new ones.

>
>    so, this could be not compatible with existing consoles
>
> >    
> >
> >> New features for the FD:
> >>         - the FD must know the address of the Director which is
> >> stocked in the
> >> Director ressource in bacula-fd.conf.      
> >
> > What happens if there are two or three Directors that can contact a
> > File daemon as is the case at my site?
> >    
>
> each director ressource is separated in FD configuration file. so,
> each ressource can be configured separately.

This will be needed if you wish to implement some of the ideas I have 
presented above.

>
> >    
> >
> >> Also, he knows on which port he is able to contact the DIR (default
> >> 9104).      
> >
> >    
> >
> >> - when the FD start, he tries to connect to the DIR. If the
> >> connection is successful, a presence parameter
> >> in the Client ressource of the Director daemon changes to "yes". Else
> >> the presence parameter keep his default value "no". For the
> >> authentification he uses the existing password between the File Daemon
> >> and the Director. The File Daemon gives his new address to the DIR so if
> >> the client is a laptop, jobs can be run with any IP.
> >
> > As I mentioned above this capability already exists with SETIP.
> >    
>
> this is not exactly the same feature. this is not a feature for a console.

SETIP sets the IP address for a client so that the Director knows where to 
contact the client.  It is implemented via a console.   Obviously the 
implementation is different, but if the end result is not the same, then I 
did not properly understand what you have implemented.

>
> >    
> >
> >> - when the File Daemon stops, he warns the DIR he is going away.
> >> After this      
> >
> > warning, presence_parameter = 0 : the DIR
> >    
> >
> >> knows the client is absent.      
> >
> > So let's say that the client goes away and notifies the Director,
> > then when the client starts again, because of some temporary problem
> > it cannot notify the director.  Is the client then essentially
> > disabled?
> >    
>
> yes. this could be upgraded in order to periodically say to the DIR
> that he is present.

As mentioned, I think it is always better for the Director to try to contact 
the FD (unless the user has explicitly disabled it) each time.  The state 
information you are proposing to keep will be stale after 1 second, so is not 
needed.

>
> >    
> >
> >> This feature doesn't work on Windows system. Perhaps the FD not
> >> finished in the same way as it stop on Linux. At least,
> >> on Windows, bacula does not go in the fonction "terminate_filed" in
> >> filed.c so the presence parameter keep his value at 1. ----> Perhaps
> >> there is a possible upgrade to do.
> >>      
> >
> > It is possible that the Win32 FD gets some serious error on
> > termination so it never gets to the terminate_filed() code.  This is
> > also possible any time any FD crashes.
> >
> >    
> >
> >> For the connections at the start of the two Daemons, there is a
> >> retry_interval defined at 10 seconds (if connection fail, retry after 10
> >> seconds) and a max_retry_time defined at 20 seconds (abandon connection
> >> after 20 seconds).
> >>
> >> Normally, the old configurations works fine even though files are
> >> patched.
> >>
> >> If configuration files not exist when we apply the patch, they are
> >> created with a new configuration (Presence parameter, PresenceTimer,
> >> WaitTimer, Address of the Director...). Else you must modify the
> >> configuration files: if the Presence parameter in Client ressource in
> >> bacula-dir.conf and the address attribute in Director ressource in
> >> bacula-fd.conf not exist, bacula will run like an old configuration.
> >>
> >>
> >>
> >> Exemple of a new configuration:
> >>
> >>
> >> 1/ In "bacula-dir.conf"
> >>
> >> Director {                            # define myself
> >>    Name = localhost-dir
> >>    DIRport = 9101            # where we listen for UA connections
> >>    DIRportFD = 9104  # where we listen for FD connections
> >> -----------------> NEW QueryFile =
> >> "/home/rousseaum/bacula/bin/query.sql"
> >>    WorkingDirectory = "/home/rousseaum/bacula/working"
> >>    PidDirectory = "/home/rousseaum/bacula/working"
> >>    Maximum Concurrent Jobs = 1
> >>    Password = "6V2ghmC6A0YUfncxiF5wJJ1x+WAT2BpUD55l1tfaOury"         #
> >> Console password Messages = Daemon
> >>      
> >
> >    
> >
> >>    MaxClientsPresence = 20  #How many client the DIR can listen
> >> simultaneously -----------------> NEW
> >>      
> >
> > Why is this needed?
>
>    we reused existing functions and the function reused needs a number
> as argument. so, we've put it into parameters.

I don't understand the above, but perhaps it is moot ...

>
> > When the client connects to the Dir does it remain connected or does
> > it disconnect after announcing its presence?
> >    
>
> it disconnects (through bnet_close)

OK

>
> >    
> >
> >> }
> >>
> >> Client {
> >>    Name = localhost-fd
> >>    Address = localhost
> >>    FDPort = 9102
> >>    Catalog = MyCatalog
> >>    Password = "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"          #
> >> password for FileDaemon File Retention = 30 days            # 30 days
> >>    Job Retention = 6 months            # six months
> >>    AutoPrune = yes                     # Prune expired Jobs/Files
> >>    Presence = yes        # The presence parameter exist
> >> ------------------------->
> >> NEW PresenceTimer = 15 # Maximum time to verify the client presence
> >> --------> NEW WaitTimer = 60 minutes  # Maximum time to wait the client
> >> --------------> NEW # PresenceTimer and WaitTimer are defined in second
> >> by default. We can use minutes, hours, days... like the other # temporal
> >> parameter in Bacula.
> >> }
> >>
> >>
> >> 2/ In "bacula-fd.conf"
> >>
> >> Director {
> >>    Name = localhost-dir
> >>    Address = localhost
> >>    DIRport = 9104
> >> ---------------------------------------------------------> NEW Password
> >> = "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"
> >> }
> >>
> >>
> >>
> >>
> >>
> >> Exemple of a typical communication between the FD and the DIR:
> >>
> >> 1/ Starting daemons:
> >>
> >> 1.1/ DIR starts before FD (most frequent situations)
> >>
> >> DIR starts;
> >> DIR tries to connect to FD;
> >> if (FD connected) {
> >>         presence_parameter = 1;
> >> }
> >> FD starts;
> >> FD tries to connect to DIR;
> >> if (DIR connected) {
> >>         presence_parameter = 1;
> >>         FD give his new address to DIR;
> >>      
> >
> > How does the FD pass his address to the DIR?    
>
>    fd->host()
>
> >    
> >
> >> }
> >>
> >> 1.2/ FD starts before DIR
> >>
> >> FD starts;
> >> FD tries to connect to DIR;
> >> if (DIR connected) {
> >>         presence_parameter = 1;
> >>         FD give his new address to DIR;
> >> }
> >> DIR starts;
> >> DIR tries to connect to FD;
> >> if (FD connected) {
> >>         presence_parameter = 1;
> >> }
> >>
> >>
> >> 1/ Starting job (Backup, Restore):
> >>
> >> DIR check FD presence;
> >> if (FD hasn't got presence_parameter) {                  ----> old
> >> configuration run job like old configuration;
> >> }
> >> else {                                                             
> >>                                                                    
> >>       ----> new configuration
> >>         if (FD present) {
> >>                 run job;
> >>         }
> >>         else {
> >>                 while (WaitTimer isn't terminate) {
> >>                         check FD connection all the PresenceTimer
> >> interval; if (FD connect) {
> >>                                 run job;
> >>                         }
> >>                 }
> >>                 Job mark at JSAutomaticallyCanceled;
> >>         }
> >> }
> >>
> >>
> >>
> >>
> >> *Any remarks are welcome. We hope this feature to be included in bacula,
> >> so we made it with existing clients configuration in mind in order not
> >> to disturb existing configurations. *
> >>      
> >
> > My questions are above. Aside from the one remark I made above, the
> > only other remark I have for the moment (until I see the answers) is
> > to say, it is always preferable to announce and discuss a project
> > prior to coding it -- it can possibly save you a lot of time
> > recoding it or the horrible frustration of having it rejected after
> > you've spent a lot of time on it.
> >    
>
> yes, I know we should have made so. we'll try not to forget that for
> next features

What I would really like is that you continue to work on some of the features 
you have developed, but simply redirect your effort in the directions I have 
indicated.  If I have understood what you have done, I think you will find in 
the end that what I am suggesting is a lot less code and will accomplish most 
everything you want, and if you decide to do the project item 2, you will 
make a lot of users happy.

Obviously even implementing project item 2 needs a bit more design work before 
implementing.   I have already partially implemented the code in the Director 
needed to transfer a FD request for backup.

Best regards,

Kern

>
> > Once I have your responses to my questions, I will make my remarks.
> >
> >
> > Best regards,
> >
> > Kern
> >
> >
> >    



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.