[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Remaining dual changer problems


On Tuesday 27 May 2008 19:13:51 Josh Fisher wrote:
> Kern Sibbald wrote:
> > On Saturday 24 May 2008 15:54:51 Eric Bollengier wrote:
> >> Hello,
> >>
> >> On Saturday 24 May 2008 15:08:19 Kern Sibbald wrote:
> >>> Hello Eric,
> >>>
> >>> I assume that since I haven't heard anything from you that the last
> >>> fixes (2.2.10-b4) to the reservation system, fixed the problems you
> >>> were having. For others on the list, the company Eric works for runs
> >>> 270 nightly jobs so they have a tendency to run into Bacula SD bugs.
> >>
> >> Yes everything is ok right now.
> >>
> >>> There still remains one outstanding problem of which I am aware that
> >>> fortunately is not hitting you, and that is bug #1083 "SD attempts to
> >>> load volume already loaded in another drive for multi-drive disk
> >>> autochanger". This bug shows up only during swapping of a volume from
> >>> one drive to another and typically is created when
> >>> PreferMountedVolumes=no.
> >>>
> >>> I know what is causing the problem and after thinking about different
> >>> solutions, I think the best one is to add a new autochanger script
> >>> query. For memory the current commands are:
> >>>
> >>> #  The commands are:
> >>> #      Command            Function
> >>> #      unload             unload a given slot
> >>> #      load               load a given slot
> >>> #      loaded             which slot is loaded?
> >>> #      list               list Volume names (requires barcode reader)
> >>> #      slots              how many slots total?
> >>> #
> >>>
> >>> The new one would be "where" and would be called
> >>>
> >>>   mtx-changer "changer-device" where "slot-number"
> >>>
> >>> the other two arguments would be ignored.  This new function asks where
> >>> a Volume with "slot-number" is located.
> >>>
> >>> The answer can be:
> >>>
> >>> slot nnn
> >>>
> >>> or
> >>>
> >>> drive nnn
> >>
> >> If you ask "where 10" you will have
> >> slot 10
> >> or
> >> drive 1
> >
> > Yes, you would get one or the other but not both.  I forgot to mention
> > that it could also return an error in case the volume for slot 10 does
> > not exist. Otherwise if it does exist, it is either in its slot and
> > returns "slot 10" or in a drive and returns "drive n" where n is the
> > drive index (zero based).
> >
> >> ok, or we can change the "load" function to do the work.
> >>
> >> load drive0 slot1
> >>  - check if slot1 is already loaded (do nothing if already in drive0)
> >>  - load the slot1 to drive0 if slot1 is unloaded
> >>  - exit with error code 1,2 or 3 and with a message "already loaded 2"
> >>
> >> And we have to handle the third case in the SD.
> >
> > Yes, that would be possible, but since we have to handle the third case,
> > I prefer to keep each of the commands to mtx-changer as simple as
> > possible. This means a few more calls from the SD, but it makes the code
> > much cleaner, and I would strongly prefer not to change any of the
> > existing commands.
> >
> >>> This command would be issued before each load request, and if the
> >>> Volume is already in the correct drive, nothing more would be done; if
> >>> the volume is in its slot, it would be loaded; and if the volume is in
> >>> a different drive, it would be unloaded, then loaded into the desired
> >>> drive.
> >>>
> >>> This would allow a simple interface for the SD to ensure that it takes
> >>> the right action to load a particular slot in a particular drive
> >>> without the need for trying to track it within the SD.  For me it makes
> >>> the most sense because it is the changer device that definitively knows
> >>> where the volume is.
> >>
> >> It makes sens and will simplify the code, it's just a bit strange to
> >> loose volume location across the code, but i know that it's a very
> >> complex part of the SD.
> >
> > Actually, we wouldn't lose any Volume location compared to what happens
> > today. The problem today is that once the SD knows that a Volume is no
> > longer going to be used on a particular drive, the info about that Volume
> > is lost in the SD.
> >
> > For example, we currently have Vol001 on drive 0.  The SD wants to load
> > Vol002 on drive 0.  Before that operation, the SD has Vol001 in the
> > Volume list, and after that operation it has only Vol002.  I thought
> > about keeping both, but that is a real nightmare from a coding stand
> > point -- you need to know that Vol001 must be unloaded and that Vol002
> > must be loaded, and somehow the drive must point to both Volumes.  The
> > situation becomes even more complicated when moving a Volume from one
> > drive to another.
> >
> > In the end, I decided that the SD will keep track of what Volumes it has
> > mounted and is using, and will ask the autochanger questions when it
> > wants to move them around rather than trying to have the SD duplicate all
> > the info that is kept by the autochanger.  Duplicating the info is
> > dangerous because the SD generally knows a Volume name and the Slot
> > number, and possibly a drive if it is loaded.  The Autochanger generally
> > does not know the Volume name (unless barcodes are enabled and being
> > used).
>
> The "where" could be emulated by doing a "loaded" on each drive. The
> only difference is that a "where" followed by a "load" requires only two
> locks of the mutex, where with using "loaded" it requires a lock for
> each drive plus one for the "load" command. The problem is, requiring
> two locks is not much better than requiring 4 or 5, because it still
> introduces a race condition. Imagine two simultaneous jobs using the
> same pool when the next available volume for that pool is in slot 3. Job
> 1 is trying to use drive 0 and job 2 is trying to use drive 1. The
> following is likely to happen.
>
> 1. Both jobs begin by issuing "where" nearly simultaneously.
> 2. Job 1 happens to get the lock first and calls "where", forcing job 2
> to wait on the mutex.
> 3. When "where" returns, job 1 releases the mutex and sees the volume is
> in slot 3
> 4. Job 2 locks the mutex it has been waiting on and calls "where".
> 5. Job 1 decides to perform a "load" from slot 3 into drive 0, so begins
> waiting on the mutex.
> 6. When "where" returns, job 2 releases the mutex and sees the volume is
> in slot 3
> 7. Job 1 locks the mutex it has been waiting on and loads slot 3 into
> drive 0
> 8. Job 2 decides to perform a "load" from slot 3 into drive 1, so begins
> waiting on the mutex.
> 9. When "load" returns, job 1 releases the mutex
> 10. Job 2 locks the mutex it has been waiting on and attempts to load
> slot 3 into drive 1
>
> This results in job 2 failing because the volume is no longer in slot 3.
> Its "where" essentially gave it an incorrect answer, though the answer
> was correct at the time the command was issued.

Bacula doesn't work the way you describe above, so it really doesn't make much 
sense to discuss it.

>
> My understanding from looking at autochanger.c is that calls to the
> autochanger script are serialized by locking/unlocking a mutex defined
> for each autochanger resource, thus making autochanger commands atomic.

That is correct.

> My humble thought is that making the autochanger commands atomic is not
> sufficient when more than one command is required to make the needed
> volume available. 

That is also correct, but once Bacula knows what needs to be done, it 
serializes the commands.

> The autochanger needs to be locked the entire time, 
> from the start of the search for the next available volume until a
> appendable volume is marked in use. If this can be done, then there
> would be no need to change either the Autochanger API or any existing
> scripts.

There is no need to do the above, and indeed life is far more complex because 
it is not the autochanger that is important, but more what the threads are 
doing and what the Director says that it wants.  It is simply a matter of 
Bacula doing some record keeping, and the current problem is Bacula keeps 
tabs on one Volume per drive, but in certain situations (mostly created by 
Prefer Mounted Volumes=no), it must keep track of two Volumes per drive.  A 
simpler solution than implementing that is to simply be able to query the 
autochanger where a volume is at the moment it is needed -- thus my proposal.

If we lock the autochanger during the whole period, it would do no good, 
because there are multiple devices, multiple threads, multiple Volumes (all 
of which have locks) as well as the Director.

I am considering disabling the "Prefer Mounted Volumes = no"  because it is 
the feature that creates a tremendous amount of extra work in Bacula.  Rather 
than use that feature (which currently does not work or leads to failures), I 
recommend that to simultaneously write multiple volumes use multiple pools.  

Best regards,

Kern

>
> >>> Aside from wanting feedback on this idea, the big question is when to
> >>> implement this.  Clearly it should be done before the next major
> >>> version as it will allow us to eliminate a class of annoying little
> >>> problems.
> >>>
> >>> I am also considering the possibility of implementing it in Branch-2.2,
> >>> but I really don't like that idea too much because it means that it
> >>> will break all the autochanger scripts implemented by users (at least
> >>> one virtual disk changer and the FreeBSD chio based script).  In any
> >>> case, it is *very* unlikely to be implemented before the 2.2.10
> >>> release.
> >>
> >> I'm agree with you, changing the mtx script during a 2.2.X release is
> >> not a very good idea, but if the changelog and the error message is
> >> clear, i think that it can be done.
> >
> > Yes, that is my feeling too.  I think once 2.2.10 is out (hopefully a the
> > end of next week), we should encourage anyone having big problems with
> > the autochanger to try the development version, where we can implement
> > this idea (providing it still seems like a good idea after a longer
> > reflection on the problem).
> >
> > Kern
> >
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Bacula-devel mailing list
> > Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.sourceforge.net/lists/listinfo/bacula-devel
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Bacula-devel mailing list
> Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/bacula-devel



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.