[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Remaining dual changer problems


Kern Sibbald wrote:
> On Saturday 24 May 2008 15:54:51 Eric Bollengier wrote:
>   
>> Hello,
>>
>> On Saturday 24 May 2008 15:08:19 Kern Sibbald wrote:
>>     
>>> Hello Eric,
>>>
>>> I assume that since I haven't heard anything from you that the last fixes
>>> (2.2.10-b4) to the reservation system, fixed the problems you were
>>> having. For others on the list, the company Eric works for runs 270
>>> nightly jobs so they have a tendency to run into Bacula SD bugs.
>>>       
>> Yes everything is ok right now.
>>
>>     
>>> There still remains one outstanding problem of which I am aware that
>>> fortunately is not hitting you, and that is bug #1083 "SD attempts to
>>> load volume already loaded in another drive for multi-drive disk
>>> autochanger". This bug shows up only during swapping of a volume from one
>>> drive to another and typically is created when PreferMountedVolumes=no.
>>>
>>> I know what is causing the problem and after thinking about different
>>> solutions, I think the best one is to add a new autochanger script query.
>>> For memory the current commands are:
>>>
>>> #  The commands are:
>>> #      Command            Function
>>> #      unload             unload a given slot
>>> #      load               load a given slot
>>> #      loaded             which slot is loaded?
>>> #      list               list Volume names (requires barcode reader)
>>> #      slots              how many slots total?
>>> #
>>>
>>> The new one would be "where" and would be called
>>>
>>>   mtx-changer "changer-device" where "slot-number"
>>>
>>> the other two arguments would be ignored.  This new function asks where a
>>> Volume with "slot-number" is located.
>>>
>>> The answer can be:
>>>
>>> slot nnn
>>>
>>> or
>>>
>>> drive nnn
>>>       
>> If you ask "where 10" you will have
>> slot 10
>> or
>> drive 1
>>     
>
> Yes, you would get one or the other but not both.  I forgot to mention that it 
> could also return an error in case the volume for slot 10 does not exist.  
> Otherwise if it does exist, it is either in its slot and returns "slot 10" or 
> in a drive and returns "drive n" where n is the drive index (zero based).
>
>   
>> ok, or we can change the "load" function to do the work.
>>
>> load drive0 slot1
>>  - check if slot1 is already loaded (do nothing if already in drive0)
>>  - load the slot1 to drive0 if slot1 is unloaded
>>  - exit with error code 1,2 or 3 and with a message "already loaded 2"
>>
>> And we have to handle the third case in the SD.
>>     
>
> Yes, that would be possible, but since we have to handle the third case, I 
> prefer to keep each of the commands to mtx-changer as simple as possible.  
> This means a few more calls from the SD, but it makes the code much cleaner, 
> and I would strongly prefer not to change any of the existing commands.
>
>   
>>> This command would be issued before each load request, and if the Volume
>>> is already in the correct drive, nothing more would be done; if the
>>> volume is in its slot, it would be loaded; and if the volume is in a
>>> different drive, it would be unloaded, then loaded into the desired
>>> drive.
>>>
>>> This would allow a simple interface for the SD to ensure that it takes
>>> the right action to load a particular slot in a particular drive without
>>> the need for trying to track it within the SD.  For me it makes the most
>>> sense because it is the changer device that definitively knows where the
>>> volume is.
>>>       
>> It makes sens and will simplify the code, it's just a bit strange to loose
>> volume location across the code, but i know that it's a very complex part
>> of the SD.
>>     
>
> Actually, we wouldn't lose any Volume location compared to what happens today.  
> The problem today is that once the SD knows that a Volume is no longer going 
> to be used on a particular drive, the info about that Volume is lost in the 
> SD.  
>
> For example, we currently have Vol001 on drive 0.  The SD wants to load Vol002 
> on drive 0.  Before that operation, the SD has Vol001 in the Volume list, and 
> after that operation it has only Vol002.  I thought about keeping both, but 
> that is a real nightmare from a coding stand point -- you need to know that 
> Vol001 must be unloaded and that Vol002 must be loaded, and somehow the drive 
> must point to both Volumes.  The situation becomes even more complicated when 
> moving a Volume from one drive to another.  
>
> In the end, I decided that the SD will keep track of what Volumes it has 
> mounted and is using, and will ask the autochanger questions when it wants to 
> move them around rather than trying to have the SD duplicate all the info 
> that is kept by the autochanger.  Duplicating the info is dangerous because 
> the SD generally knows a Volume name and the Slot number, and possibly a 
> drive if it is loaded.  The Autochanger generally does not know the Volume 
> name (unless barcodes are enabled and being used).
>
>   

The "where" could be emulated by doing a "loaded" on each drive. The 
only difference is that a "where" followed by a "load" requires only two 
locks of the mutex, where with using "loaded" it requires a lock for 
each drive plus one for the "load" command. The problem is, requiring 
two locks is not much better than requiring 4 or 5, because it still 
introduces a race condition. Imagine two simultaneous jobs using the 
same pool when the next available volume for that pool is in slot 3. Job 
1 is trying to use drive 0 and job 2 is trying to use drive 1. The 
following is likely to happen.

1. Both jobs begin by issuing "where" nearly simultaneously.
2. Job 1 happens to get the lock first and calls "where", forcing job 2 
to wait on the mutex.
3. When "where" returns, job 1 releases the mutex and sees the volume is 
in slot 3
4. Job 2 locks the mutex it has been waiting on and calls "where".
5. Job 1 decides to perform a "load" from slot 3 into drive 0, so begins 
waiting on the mutex.
6. When "where" returns, job 2 releases the mutex and sees the volume is 
in slot 3
7. Job 1 locks the mutex it has been waiting on and loads slot 3 into 
drive 0
8. Job 2 decides to perform a "load" from slot 3 into drive 1, so begins 
waiting on the mutex.
9. When "load" returns, job 1 releases the mutex
10. Job 2 locks the mutex it has been waiting on and attempts to load 
slot 3 into drive 1

This results in job 2 failing because the volume is no longer in slot 3. 
Its "where" essentially gave it an incorrect answer, though the answer 
was correct at the time the command was issued.

My understanding from looking at autochanger.c is that calls to the 
autochanger script are serialized by locking/unlocking a mutex defined 
for each autochanger resource, thus making autochanger commands atomic. 
My humble thought is that making the autochanger commands atomic is not 
sufficient when more than one command is required to make the needed 
volume available. The autochanger needs to be locked the entire time, 
from the start of the search for the next available volume until a 
appendable volume is marked in use. If this can be done, then there 
would be no need to change either the Autochanger API or any existing 
scripts.

>>> Aside from wanting feedback on this idea, the big question is when to
>>> implement this.  Clearly it should be done before the next major version
>>> as it will allow us to eliminate a class of annoying little problems.
>>>
>>> I am also considering the possibility of implementing it in Branch-2.2,
>>> but I really don't like that idea too much because it means that it will
>>> break all the autochanger scripts implemented by users (at least one
>>> virtual disk changer and the FreeBSD chio based script).  In any case, it
>>> is *very* unlikely to be implemented before the 2.2.10 release.
>>>       
>> I'm agree with you, changing the mtx script during a 2.2.X release is not a
>> very good idea, but if the changelog and the error message is clear, i
>> think that it can be done.
>>     
>
> Yes, that is my feeling too.  I think once 2.2.10 is out (hopefully a the end 
> of next week), we should encourage anyone having big problems with the 
> autochanger to try the development version, where we can implement this idea 
> (providing it still seems like a good idea after a longer reflection on the 
> problem).
>
> Kern
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Bacula-devel mailing list
> Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/bacula-devel
>   

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.