[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Patch: Migration jobmedia table insert incomplete

<al@xxxxxxxxxxxxxx> aka Arno Lehmann  schrieb
mit Datum Sun, 02 Mar 2008 12:50:17 +0100 in m2n.bacula.devel:

|Of course, sometimes he makes it quite clear that he doesn't want to 
|discuss certain thinks, or he is annoyed because thinking through all 
|this takes time he prefers spending on actually *doing* things on 
|Bacula, but in general, he like to know about even small issues you 
|have... after all, improving Bacula is what wants to do :-)

Well, Arno, as I understand it, there is really big *fun* in creating
new features & enhancing things - this is that "being creative" which
is a genuine human impetus. While it is someway boring to re-evaluate
some already completed functions again and again to fix some minor
problems. OTOH, the quality of any new features will more or less
depend on the reliability of the base, so in the end it surely
will pay off.

|As long as you can take the occasional rough reply, I don't see a 
|problem. Of course you are aware that not every report will lead to an 
|immediate fix :-)

That's what i am glad for: I am not under the pressure to *need*
these things being fixed quickly.

|I would consider that a bug, but one hat might need quite a bit of 
|redesign to fix, and that affects only a limited number of users.

Same as I see it.

|> 2. This is the thing that I have been worrying the most about. I
|>    ...
|>    What happens is that the Director puts all jobs (and all newly
|>    started jobs) into either "waiting on max Storage jobs" or
|>    "waiting execution", while there is no job running on any client
|>    and no job running on the SD. It just does nothing and has to
|>    be restarted.
|That definitely qualifies as a bug... have you tried looking at the 
|debug output, once the DIR is in this state?

I am still trying to figure out the precise conditions to reproduce
that. For what I know, it depends on quite a couple of things coming
together. And up to now, it hit me only in moments when I really 
didnt want it to happen, because I was busy getting other things into

|>    What I have learned from reading bacula-users, is that most 
|>    people do not run such quantities of jobs as I do. So maybe this
|>    is the reason.
|Might be... how many jobs are you running in parallel?

Well, I am running some incremental backups every 10 minutes 
(which gives me a useful substitute for an "undelete"-function and
for filesystem-auto-versioning). Which means that I no longer make
a "backup" copy when editing files - instead I rely on the restore 
function - and use that quite often.
And these restores, colliding with backup jobs of differing
priorities, seem to be a key-factor here.

The absolute number of jobs seems not critical for this - it has 
happened with only five jobs in the queue. It seems to be the 
diversity of jobs coming together at one time.

|> 4. When migrating from disk to tape, there should be no need to do
|>    SD data spooling - as the data is already packed up, it will flow
|>    quickly to the tape, and data spooling would only slow down the
|>    process.
|>    But in that case it is likely possible that multiple jobs write
|>    simultanously to the tape. When later restoring such jobs, each 
|>    job must be restored by a separate restore command, which can
|>    make the process very slow.
|Good point... we might need a way to to disable job concurrency in 
|that case, making sure migration jobs going to the same storage device 
|are not multiplexed, even if that storage device allows multiple 
|cuncurrent jobs...

Maybe this. Or maybe we should look into the issue and see if it 
can be fixed or, if that is difficult, if such restore-operations
can be denied at start-time. 

As I think, this is the most severe of the matters I have described.

Because, if people run a restore, and that restore runs through
but the restored data is all garbage, then people will panick. And
this will give us a bad reputation, no matter that the data isn't
actually lost and that it can be workarounded.

And, as far as I understand it, currently this can happen. It was
not a migration where this happened to me - it is only more difficult
to avoid it when doing migrations.

The good thing is, that the manual already discourages letting
jobs intermix on media. It just doesn't say how severe the problem
can become. 

I also do not yet know if this happens every time when restoring two
interleaved jobs thru the same restore job, or if it only happens
in certain circumstances - as I tried that only once.

|> So, this is more or less the background which led me to my statement
|> that pervasive use of migration would currently show some
|> deficiencies... I hope You understand...
|I at least do. And I thank you for your insights - I'm quite sure 
|there are some hours of research behind it.

Warm thanks for this feedback! 
I will keep these matters in mind (or on file) and see what I can do 
further. As things are currently, my installations have gone into a 
shape where they start running rather smoothly - so I am getting a 
little reluctant on doing too much testing on it. Also, some more 
social matters are calling for my attention in the near future. 
But, as we now have talked about the matters, I will see that I can
set up another installation just for testing, so trying to get some
more valuable information (or maybe even patches).

best regards,

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.