[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] patch for presence of file daemon


On Tuesday 19 August 2008 19:25:07 Hederer Jean-Sébastien wrote:
> "Kern Sibbald" a écrit le 19/08/2008 17:13 :
> > Hello,
> >
> > I'm off on vacation essentially now until Saturday evening, so for this
> > email, I will attempt to give some points now, and if there is more
> > discussion, we will need to do it next week.
> >
> > First, I have to say that I have a bit of bad news for you, but I would
> > strongly urge you to take careful note of my reasoning and read this
> > carefully through to the end, because I think you can make a very
> > important contribution in this area -- though slightly modified from what
> > you have currently proposed.
> >
> > Concerning the concept of the Director contacting all the FDs at the
> > beginning and the FDs contacting the Director for the purposes to know if
> > they are online and to reduce error or warning messages.  I believe that
> > the potential problems associated with this proposed feature far outweigh
> > any advantages it would give, and I even have a problem seeing any
> > advantage.
>
> perhaps our explanations are not clear.
> advantages are clear (for us): when some FDs are not online, their jobs
> are postponed (with times parametered) until FDs connect. 

I wasn't aware of the postpone feature.  That is something that we can 
discuss, depending on how it is implemented ...

> and all 
> messages/status are clear: FD is present or not and job is ran or
> postponed (or canceled if timeout on presence is gone). if FD is not
> present, we don't have, like today, plenty of unuseful lines in logs.
> today we don't know  clearly if there is a communication problem to
> contact FD or if  FD is not online

As I have said, I believe the way Bacula currently contacts the FD is the 
correct way, and I have proposed a way of reducing the useless output.  We 
could possibly even add a new directive that directs connection problems to a 
different message class, and hence such messages could be directed to the bit 
bucket for laptop clients or simply be suppressed.

>
> > It seems to me that that particular feature does nothing but reduce the
> > number of messages and as designed can even lead to incorrect messages
> > being printed.  As a consequence, I cannot accept this feature.  If you
> > want I will list all the reasons, but I don't think that will be
> > necessary.
>
> well, we'll need to maintain our feature separately if you don't accept
> it. we'll try it for our clients and see if our way of doing it, is
> efficient or not.

Well, that is the beauty of Open Source.  You are able to do it, and if you 
maintain it as a patch, perhaps other users would want to apply it.

>
> I think we'll post a patch version "as is" for 2.4.2 on ML for people
> interested in it

Yes, that is OK.

>
> > Possible New code:
> > On the other hand, what I would accept, which will IMO accomplish the
> > same thing is to reduce the default retry time to 2 minutes, and to make
> > Bacula say try once every 30 seconds within those 2 minutes (i.e. either
> > 4 or 5 times) and then give up with a single error message.   This would
> > probably take minimal changes to Bacula.
>
> don't think this is as much efficient as we propose but when reading
> what you've said after, this seems to be linked to our way of using
> bacula (our DIRs fast never restart like we use some dynamic files for
> configuration)
>
> > Concerning having the FD transfer the IP address to the Director. That
> > feature already exists as the SETIP command, and it is already (as far as
> > I know) secure.  It is also tested and being used.
> > Currently the SETIP command can be sent from a console, and a simple
> > shell script can automatically accomplish it.  There is no need for the
> > user to even know his IP address.
> >
> > Possible New code:
> > What is really missing in the FD is item 2 in the current projects list
> > "Allow FD to initiate a backup".  What I had planned here is the
> > following: modify the FD so that it can be contacted by a local console
> > and the FD (not Dir) can be asked to do a Backup, or simply to send its
> > IP address.   The FD would then simulate a console (possibly identifying
> > itself as the FD) and do a SETIP command, and possibly ask for a backup. 
> > Note, both of these are already controlled by restricted consoles, so the
> > security is pretty much assured.
> >
> > What is different about this is that if the FD requests a backup, the
> > Director's console handler would start the job but pass the open console
> > connection it has with the FD to the backup command, and the backup would
> > proceed over that connection.  That permits implementing project item 2
> > and it also allows a work around for certain firewall problems where the
> > Director cannot contact a particular client (maybe the client is behind a
> > firewall or NATed in another network) so the Director can use the
> > connection made by the FD.
> >
> > I also would not be opposed to adding a directive to the FD that tells it
> > to send its IP address to the Director when the FD starts, but it would
> > do so by using the console "emulation" and sending the SETIP command.
> >
> > So, I think what I proposed above will essentially give you what you have
> > implemented, but a bit simpler, and it will also implement project item
> > 2, which will take a bit of additional work.
> >
> > If this interests you, we will need to discuss a few of the details so I
> > can show you how we can identify and if we want multiplex the FD, which
> > is quite easy to do.
> >
> > Best regards,
> >
> > Kern
> >
> > PS: though it is probably not necessary, I have made a few comments below
> > ...
>
> we'll see if this can be included in our calendar, but there is few
> chances.
>
> for this feature (item 2), FDs must maintain a socket with DIR to go
> through firewalls and NATs

That is not planned.  The problem with firewalls and NATs is generally (not 
always) when an outside program such as the Director tries to call into the 
FD which is behind a firewall or being NATted.  Having the FD initiate the 
call removes this problem.  It also allows you to configure Laptops to 
initiate their own backup.  

In addition, once we have a way for the FD to call into the Director via the 
console port, we can easily add additional features such as "back me up now" 
(already discussed), "if I missed a backup, do it now", "if my last backup 
was more than a day ago, do one now", ...

Best regards,

Kern

>
> > On Tuesday 19 August 2008 10:21:48 Jean-Sébastien Hederer wrote:
> >> "Kern Sibbald" a écrit le 14/08/2008 21:28 :
> >>> Hello,
> >>>
> >>> I have a few questions, please see below ...
> >>>
> >>> On Thursday 14 August 2008 17:23:39 Jean-Sébastien Hederer wrote:
> >>>> Hi,
> >>>>
> >>>> Maxime Rousseau has created a new feature for bacula. This patch has
> >>>> been created in order to optimize the communications between the File
> >>>> daemon and the Director daemon. It has been written for 2.4.0. All
> >>>> regression tests and function tests have been passed for 2.4.0. Patch
> >>>> is ready for 2.4.0. Maxime is making patch for 2.4.2 and trunk before
> >>>> sending it. Here are some explanations:
> >>>>
> >>>>
> >>>>
> >>>> With the new features, Bacula can backup clients who change their IP
> >>>> like laptops.
> >>>
> >>> Bacula already has a means to backup clients, which change their IP
> >>> address.  However, I admit that it could be optimized a bit.
> >>
> >> Changing IP is dynamic. Information is given from FD to DIR when
> >> signaling presence. We can have a parameter to enable/disable this
> >> feature. Should this parameter be on director  ressource  for DIR or
> >> client ressource for DIR?
> >>
> >>    I had never seen SETIP It's  DIR that can control the change
> >> of IP for an FD whatever how console is defined on FD.
> >
> > Well the FD or the console running on the FD machine just needs to
> > connect to the console port, to be properly authorized and to send in the
> > SETIP command. It is very simple.
> >
> >>>> There are less error messages when a job is canceled because of the
> >>>> absence of the File Daemon.
> >>>
> >>> Yes, there are probably too many error messages, but then it depends
> >>> on how you configure Bacula ...
> >>
> >> yes but here, if file daemon is not here, we'll be able to send a clear
> >> message  on it's status because we know he's not available.(and not a
> >> message saying job has been cancelled on network communication failure)
> >
> > You only know at a single instant in time when the FD is not there -- in
> > the next second it can be there, and if you have any comm problems your
> > state information in the Director will be out of date and will simply
> > keep the Director from contacting the FD.
> >
> > The correct way to do this (IMO) is to use the current design and simply
> > reduce the default retry time to a very small number (e.g. 2 minutes). 
> > This creates very little overhead and ensures that the Director will
> > connect when he wants and nothing will ever be blocked.
> >
> >>>> The communication between FD and DIR become bidirectional so
> >>>> connections are more frequent.
> >>>
> >>> Maybe I am misunderstanding you, but more frequent connections are
> >>> bad not good.
> >>
> >> yes there are a few more connections. when DIR starts and tries to
> >> detect FDs with parameter set and when FD starts/stops to say to DIR
> >> that he's available. but it permits not to have connections from DIR to
> >> FD that pollutes network when FD is not available.
> >
> > The Director never pollutes any network, and if the FD is not there, the
> > extra load on the network is absolutely trivial ...   Other than the far
> > too long current 30 minute default timeout, I think the current scheme is
> > very reasonable.
> >
> >>    for all these communications, we open a socket, make the
> >> communication and close the socket.
> >>
> >>    we have reused standard bacula functions as much as possible for
> >> all this feature.
> >
> > OK.
> >
> >>>> New features for the DIR:
> >>>>         - when the DIR start, he tries to connect to the FD. If the
> >>>> connection is
> >>>> successful, a presence parameter in the Client ressource change to
> >>>> "yes". Else the presence parameter keep his value "no".
> >>>
> >>> What happens if the Director has 2,000 clients?
> >>
> >> he will try to reach all clients for which the presence parameter is
> >> set. we took in mind that DIR stops/starts only very few times in a
> >> year. this is how work our clients. this optimizes number of network
> >> communications
> >
> > For me, the Director starts and stops hundreds of times per day, and many
> > users take their Director down much more often than you do because of the
> > need to add new clients or make other config changes.  So even for
> > smaller shops that have 60 or so clients, if the Director tries for 2
> > minutes for each client, it will be blocked for 2 hours -- that is
> > unacceptable.
> >
> >>> Does the DIR stall until it contacts them all?  How many resources
> >>> does it take to contact them?
> >>
> >> the DIR contacts all the FDs for which parameter is set before
> >> continuing.
> >
> > As I mentioned above, given the downside, any advantage of this scheme
> > doesn't seem worthwhile to me ...
> >
> >> we'll see how to "parallelize" FD communications.
> >>
> >>> How long will it take for it to contact the last of the 2,000 clients?
> >>
> >> timers are parametered.
> >
> > Yes, that is good, but for me not sufficient.
> >
> >>> If a scheduled job starts for the 2,000th client before the 2,000th
> >>> client is contacted by the startup routine, will the job be retarded
> >>> in starting?
> >>
> >> yes.
> >>
> >>> What happens if I have clients that I don't want the director
> >>> contacting because they are very infrequently used and the jobs are
> >>> only manually started?
> >>
> >> you don't set presence parameter or you put it to false (default value
> >> corresponds to actual behavior)
> >
> > I don't think it is really necessary as I have explained above.
> >
> >>>> - when the DIR is going to start a new job, he checks the presence
> >>>> parameter. If the client is
> >>>> present, the DIR starts the job, else he waits for him during a time
> >>>> specified in the Client ressource in the bacula-dir.conf (this
> >>>> parameter is named "WaitTimer"). He checks if the client is connected
> >>>> at each interval of a time (attribute "PresenceTimer" in
> >>>> bacula-dir.conf). If the client never connect himself during the
> >>>> "WaitTimer" time, the job is marked as "JSAutomaticallyCanceled" in
> >>>> the Catalog.
> >>>> "JSAutomaticallyCanceled" is a new parameter defined in jcr.h and it
> >>>> means that the job is canceled because the File daemon has never been
> >>>> connected.
> >>>
> >>> Is it possible to turn off this behavior?  I don't want it for my
> >>> setup, because it is not always possible for my clients to contact
> >>> the Director.
> >>
> >> yes, sure. ever  made. old configurations are fully compatible without
> >> changing behavior.
> >
> > Good.
> >
> >>>> - I have created a new file named fd_server.c. It allow the DIR to
> >>>> listen to the File Daemon
> >>>> connections (the default port is 9104, parameter DIRportFD in Director
> >>>> ressource of bacula.dir.conf). The parameter MaxClientsPresence
> >>>> defined in Director ressource in bacula-dir.conf decide how many File
> >>>> Daemons the DIR can listen simultaneously. - Authentifications
> >>>> fonctions are also implemented in authenticate.c in src/dird and
> >>>> src/filed.
> >>>
> >>> The IANA will never approve of a fourth port for Bacula.  There is
> >>> no reason to have the Director listening on two ports.  These
> >>> connections should be multiplexed on port 9101.
> >>
> >> problem is we should change treatment for incoming requests  on port
> >> 9101 in order to have two types of communications: console and FD
> >
> > We can simply add new commands, or we can have the FD identify itself
> > differently, or we can multiplex it -- all or any of these without
> > creating any problems of compatibility.  We have done this kind of thing
> > many times in evolving Bacula so that it can continue to interface with
> > old FDs while having new commands and interfaces for new ones.
> >
> >>    so, this could be not compatible with existing consoles
> >>
> >>>> New features for the FD:
> >>>>         - the FD must know the address of the Director which is
> >>>> stocked in the
> >>>> Director ressource in bacula-fd.conf.
> >>>
> >>> What happens if there are two or three Directors that can contact a
> >>> File daemon as is the case at my site?
> >>
> >> each director ressource is separated in FD configuration file. so,
> >> each ressource can be configured separately.
> >
> > This will be needed if you wish to implement some of the ideas I have
> > presented above.
>
> there seems to be misunderstanding here: this is implemented and this is
> what I've explained
>
> >>>> Also, he knows on which port he is able to contact the DIR (default
> >>>> 9104).
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> - when the FD start, he tries to connect to the DIR. If the
> >>>> connection is successful, a presence parameter
> >>>> in the Client ressource of the Director daemon changes to "yes". Else
> >>>> the presence parameter keep his default value "no". For the
> >>>> authentification he uses the existing password between the File Daemon
> >>>> and the Director. The File Daemon gives his new address to the DIR so
> >>>> if the client is a laptop, jobs can be run with any IP.
> >>>
> >>> As I mentioned above this capability already exists with SETIP.
> >>
> >> this is not exactly the same feature. this is not a feature for a
> >> console.
> >
> > SETIP sets the IP address for a client so that the Director knows where
> > to contact the client.  It is implemented via a console.   Obviously the
> > implementation is different, but if the end result is not the same, then
> > I did not properly understand what you have implemented.
> >
> >>>> - when the File Daemon stops, he warns the DIR he is going away.
> >>>> After this
> >>>
> >>> warning, presence_parameter = 0 : the DIR
> >>>
> >>>> knows the client is absent.
> >>>
> >>> So let's say that the client goes away and notifies the Director,
> >>> then when the client starts again, because of some temporary problem
> >>> it cannot notify the director.  Is the client then essentially
> >>> disabled?
> >>
> >> yes. this could be upgraded in order to periodically say to the DIR
> >> that he is present.
> >
> > As mentioned, I think it is always better for the Director to try to
> > contact the FD (unless the user has explicitly disabled it) each time. 
> > The state information you are proposing to keep will be stale after 1
> > second, so is not needed.
> >
> >>>> This feature doesn't work on Windows system. Perhaps the FD not
> >>>> finished in the same way as it stop on Linux. At least,
> >>>> on Windows, bacula does not go in the fonction "terminate_filed" in
> >>>> filed.c so the presence parameter keep his value at 1. ----> Perhaps
> >>>> there is a possible upgrade to do.
> >>>
> >>> It is possible that the Win32 FD gets some serious error on
> >>> termination so it never gets to the terminate_filed() code.  This is
> >>> also possible any time any FD crashes.
> >>>
> >>>> For the connections at the start of the two Daemons, there is a
> >>>> retry_interval defined at 10 seconds (if connection fail, retry after
> >>>> 10 seconds) and a max_retry_time defined at 20 seconds (abandon
> >>>> connection after 20 seconds).
> >>>>
> >>>> Normally, the old configurations works fine even though files are
> >>>> patched.
> >>>>
> >>>> If configuration files not exist when we apply the patch, they are
> >>>> created with a new configuration (Presence parameter, PresenceTimer,
> >>>> WaitTimer, Address of the Director...). Else you must modify the
> >>>> configuration files: if the Presence parameter in Client ressource in
> >>>> bacula-dir.conf and the address attribute in Director ressource in
> >>>> bacula-fd.conf not exist, bacula will run like an old configuration.
> >>>>
> >>>>
> >>>>
> >>>> Exemple of a new configuration:
> >>>>
> >>>>
> >>>> 1/ In "bacula-dir.conf"
> >>>>
> >>>> Director {                            # define myself
> >>>>    Name = localhost-dir
> >>>>    DIRport = 9101            # where we listen for UA connections
> >>>>    DIRportFD = 9104  # where we listen for FD connections
> >>>> -----------------> NEW QueryFile =
> >>>> "/home/rousseaum/bacula/bin/query.sql"
> >>>>    WorkingDirectory = "/home/rousseaum/bacula/working"
> >>>>    PidDirectory = "/home/rousseaum/bacula/working"
> >>>>    Maximum Concurrent Jobs = 1
> >>>>    Password = "6V2ghmC6A0YUfncxiF5wJJ1x+WAT2BpUD55l1tfaOury"         #
> >>>> Console password Messages = Daemon
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>    MaxClientsPresence = 20  #How many client the DIR can listen
> >>>> simultaneously -----------------> NEW
> >>>
> >>> Why is this needed?
> >>
> >>    we reused existing functions and the function reused needs a number
> >> as argument. so, we've put it into parameters.
> >
> > I don't understand the above, but perhaps it is moot ...
> >
> >>> When the client connects to the Dir does it remain connected or does
> >>> it disconnect after announcing its presence?
> >>
> >> it disconnects (through bnet_close)
> >
> > OK
> >
> >>>> }
> >>>>
> >>>> Client {
> >>>>    Name = localhost-fd
> >>>>    Address = localhost
> >>>>    FDPort = 9102
> >>>>    Catalog = MyCatalog
> >>>>    Password = "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"         
> >>>> # password for FileDaemon File Retention = 30 days            # 30
> >>>> days Job Retention = 6 months            # six months
> >>>>    AutoPrune = yes                     # Prune expired Jobs/Files
> >>>>    Presence = yes        # The presence parameter exist
> >>>> ------------------------->
> >>>> NEW PresenceTimer = 15 # Maximum time to verify the client presence
> >>>> --------> NEW WaitTimer = 60 minutes  # Maximum time to wait the
> >>>> client --------------> NEW # PresenceTimer and WaitTimer are defined
> >>>> in second by default. We can use minutes, hours, days... like the
> >>>> other # temporal parameter in Bacula.
> >>>> }
> >>>>
> >>>>
> >>>> 2/ In "bacula-fd.conf"
> >>>>
> >>>> Director {
> >>>>    Name = localhost-dir
> >>>>    Address = localhost
> >>>>    DIRport = 9104
> >>>> ---------------------------------------------------------> NEW
> >>>> Password = "VfCC+e5Lp87mlgdW58PqkxLRvyM2jcwhGCkBMNOOuzXz"
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Exemple of a typical communication between the FD and the DIR:
> >>>>
> >>>> 1/ Starting daemons:
> >>>>
> >>>> 1.1/ DIR starts before FD (most frequent situations)
> >>>>
> >>>> DIR starts;
> >>>> DIR tries to connect to FD;
> >>>> if (FD connected) {
> >>>>         presence_parameter = 1;
> >>>> }
> >>>> FD starts;
> >>>> FD tries to connect to DIR;
> >>>> if (DIR connected) {
> >>>>         presence_parameter = 1;
> >>>>         FD give his new address to DIR;
> >>>
> >>> How does the FD pass his address to the DIR?
> >>
> >>    fd->host()
> >>
> >>>> }
> >>>>
> >>>> 1.2/ FD starts before DIR
> >>>>
> >>>> FD starts;
> >>>> FD tries to connect to DIR;
> >>>> if (DIR connected) {
> >>>>         presence_parameter = 1;
> >>>>         FD give his new address to DIR;
> >>>> }
> >>>> DIR starts;
> >>>> DIR tries to connect to FD;
> >>>> if (FD connected) {
> >>>>         presence_parameter = 1;
> >>>> }
> >>>>
> >>>>
> >>>> 1/ Starting job (Backup, Restore):
> >>>>
> >>>> DIR check FD presence;
> >>>> if (FD hasn't got presence_parameter) {                  ----> old
> >>>> configuration run job like old configuration;
> >>>> }
> >>>> else {
> >>>>
> >>>>       ----> new configuration
> >>>>         if (FD present) {
> >>>>                 run job;
> >>>>         }
> >>>>         else {
> >>>>                 while (WaitTimer isn't terminate) {
> >>>>                         check FD connection all the PresenceTimer
> >>>> interval; if (FD connect) {
> >>>>                                 run job;
> >>>>                         }
> >>>>                 }
> >>>>                 Job mark at JSAutomaticallyCanceled;
> >>>>         }
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> *Any remarks are welcome. We hope this feature to be included in
> >>>> bacula, so we made it with existing clients configuration in mind in
> >>>> order not to disturb existing configurations. *
> >>>
> >>> My questions are above. Aside from the one remark I made above, the
> >>> only other remark I have for the moment (until I see the answers) is
> >>> to say, it is always preferable to announce and discuss a project
> >>> prior to coding it -- it can possibly save you a lot of time
> >>> recoding it or the horrible frustration of having it rejected after
> >>> you've spent a lot of time on it.
> >>
> >> yes, I know we should have made so. we'll try not to forget that for
> >> next features
> >
> > What I would really like is that you continue to work on some of the
> > features you have developed, but simply redirect your effort in the
> > directions I have indicated.  If I have understood what you have done, I
> > think you will find in the end that what I am suggesting is a lot less
> > code and will accomplish most everything you want, and if you decide to
> > do the project item 2, you will make a lot of users happy.
> >
> > Obviously even implementing project item 2 needs a bit more design work
> > before implementing.   I have already partially implemented the code in
> > the Director needed to transfer a FD request for backup.
> >
> > Best regards,
> >
> > Kern
> >
> >>> Once I have your responses to my questions, I will make my remarks.
> >>>
> >>>
> >>> Best regards,
> >>>
> >>> Kern



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.