[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database

On Monday 11 February 2008 18:11:10 Kern Sibbald wrote:
> Hello,
> I am really happy to see that someone besides myself is interested in this
> problem.  I am quite concerned because since 2.2.0 came out, with its new
> more correct pruning algorithm when finding a Volume, there is a lot less
> unnecessary pruning taking place, and the database for my backups has grown
> from 400MB to 1GB -- that is it more than doubled.
> My conclusion is that we can reduce the size of users databases by 50-80%
> by doing what I would call "Strict Pruning" that is coming up with some
> mechanism by which the user could say that he/she wants pruning to occur as
> close to the time limit specified rather than done only when absolutely
> necessary as the current code does.
Yes, it would be a nice option to have, but I think it would not be one that 
would help us. I think we appreciate that the available tapes are still in 
the dictionary even if they are expired. This I don't really know in fact, 
Eric would know better than I.

> In addition, I suspect that we can get significant gains by adding
> individual retention periods for Differential and Incremental records.  For
> example, if you do Fulls once a month, Differentials once a week, and
> Incrementals daily, and you set File Retention Period to 6 months,
> depending on what you want, you end up with lots and lots of Incremental
> and some Differential records that are not useful.  For example, you might
> only want to keep Incremental records for two weeks or a month.  That is
> for 2 weeks or a month, you can recover on a daily basis, but older than a
> month, you could only recover to the last Differential (i.e. on a weekly
> basis).  Likewise, you might only want to recover on a weekly basis for say
> a month or 3 months, and so the need for a Differential Retention Period.
> Finally on the subject of pruning, jobs are only pruned when they are run,
> so if you run a special job once, it will be pruned only at Volume pruning
> times (typically one year), or if you stop using backing up a particular
> machine, the records for that machine will remain in the database until all
> the Volumes containing those records are pruned (possibly much longer than
> the retention period set up for the job).
> So, I suggest that some work on pruning would make a major difference:
> 1. Add a Differential Retention Period
> 2. Add an Incremental Retention Period
> 3. Add some mechanism where the user could specify strict pruning, and that
> would start some sort of automatic Admin job that would apply strict
> pruning (perhaps just add a "Strict Pruning" directive to an Admin job
> would trigger the new code). Since pruning can take enormous amounts of
> times, I would recommend that we find some way to limit the amount of time
> the strict pruning Admin job runs.
> Now, let me say a bit about the approach you are taking to reduce the space
> used by File records.
> My first comment is that both lstat and MD5 records are variable length.
Yes, my question was just : is md5 32bits aligned ? (I mean, can I safely bet 
on it being a multiple of 32 bits ?)

> The length is known when the record is sent to the Director, so any attempt
> to fix the length will lead to failure.  In the case of lstat, the size of
> the fields can vary from OS to OS (32 bit vs 64 bit), and depending on the
> OS and whether or not the lstat involves a hardlink, there will be more
> fields in the lstat structure.  In the case of the MD5 record, it is really
> an old name that no longer applies -- it should be called Hash or
> something.  It can contain an MD5 or an SHA1, or any of a number of larger
> hash codes.  We could probably reduce the space by keeping it in binary,
> but then it will no longer be human readable, so we would in the cases
> where it is printed need to add additional code to convert it to ASCII.
For lstat, we took the huffman approach, as it was yielding good results.

> Now, I think we can get most gains by working on the lstat packet, and I
> would recommend that you initially take an entirely different approach. 
> That is the current lstat structure is basically a Unix stat structure. At
> the time I decided to put it in the database I considered that the base64
> coding did a pretty good job of compressing it, and I was not sure which
> fields in the packet we would actually need.  Now after 8 years of use, I
> think we have a much better idea of what fields are used, and I suspect
> that by careful examination and by creating our own packet, we can with no
> compression reduce the size of the packet by 50%.  Once that is done, then
> I think we should look carefully at how to compress it by taking a careful
> look at the fields. For example the st_modes field could probably be
> reduced in 90% of the cases from 4 bytes to 1 byte just by recognizing
> common patters that appear over and over -- much as you realized that the
> lstat packet separates fields with spaces, so there are lots of spaces in
> it.
Of course, getting rid of things that aren't used would be even better, than 
what I'm trying to do. The best would of course be to split the lstat into 
the few components we really need and use them as individual fields in the 
database. But that would break a lot of things, wouldn't it ?

The approach I'm trying to take for now is completely different. I'm trying to 
answer the question : what can we do with the current database "as is", 
without touching bacula's code. And the obvious answer is creating dedicated 
and more space efficient datatypes. It's simply a mapping between the 
current 'base64-text' fields and a "bytea" hidden field.

The long term work should of course be of making the file table itself more 
space efficient. I think the best way would be to get rid of those two 
variable length fields if possible, and replace them with only what is 
required from lstat. I'm not that good with the administration of bacula, but 
do we even need st_modes in the catalog ? I guess things like creation and 
last modification time are important, but what else is. Anyway, this is a lot 
more of work...

The same is true for the md5 field : maybe only one hash (sha256 maybe) would 
be better ? if the field is fixed size, we could craft a fixed size record, 
and get rid of the variable length header in the database (4 bytes 
saved ... :) )

Anyhow, for now, what I'm proposing is easy to implement, even if it would not 
bring as many dramatic improvements as what you propose (I'm talking about 
saving 10 to 15% only ...)

Still, I'd like to know if the md5 field is always a multiple of 32 bits in 
length ?

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.