[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database


On Monday 11 February 2008 20.34:40 Marc Cousin wrote:
> On Monday 11 February 2008 18:11:10 Kern Sibbald wrote:
> > Hello,
> >
> > I am really happy to see that someone besides myself is interested in
> > this problem.  I am quite concerned because since 2.2.0 came out, with
> > its new more correct pruning algorithm when finding a Volume, there is a
> > lot less unnecessary pruning taking place, and the database for my
> > backups has grown from 400MB to 1GB -- that is it more than doubled.
> >
> > My conclusion is that we can reduce the size of users databases by 50-80%
> > by doing what I would call "Strict Pruning" that is coming up with some
> > mechanism by which the user could say that he/she wants pruning to occur
> > as close to the time limit specified rather than done only when
> > absolutely necessary as the current code does.
>
> Yes, it would be a nice option to have, but I think it would not be one
> that would help us. I think we appreciate that the available tapes are
> still in the dictionary even if they are expired. This I don't really know
> in fact, Eric would know better than I.
>
> > In addition, I suspect that we can get significant gains by adding
> > individual retention periods for Differential and Incremental records. 
> > For example, if you do Fulls once a month, Differentials once a week, and
> > Incrementals daily, and you set File Retention Period to 6 months,
> > depending on what you want, you end up with lots and lots of Incremental
> > and some Differential records that are not useful.  For example, you
> > might only want to keep Incremental records for two weeks or a month. 
> > That is for 2 weeks or a month, you can recover on a daily basis, but
> > older than a month, you could only recover to the last Differential (i.e.
> > on a weekly basis).  Likewise, you might only want to recover on a weekly
> > basis for say a month or 3 months, and so the need for a Differential
> > Retention Period.
> >
> > Finally on the subject of pruning, jobs are only pruned when they are
> > run, so if you run a special job once, it will be pruned only at Volume
> > pruning times (typically one year), or if you stop using backing up a
> > particular machine, the records for that machine will remain in the
> > database until all the Volumes containing those records are pruned
> > (possibly much longer than the retention period set up for the job).
> >
> > So, I suggest that some work on pruning would make a major difference:
> >
> > 1. Add a Differential Retention Period
> > 2. Add an Incremental Retention Period
> > 3. Add some mechanism where the user could specify strict pruning, and
> > that would start some sort of automatic Admin job that would apply strict
> > pruning (perhaps just add a "Strict Pruning" directive to an Admin job
> > would trigger the new code). Since pruning can take enormous amounts of
> > times, I would recommend that we find some way to limit the amount of
> > time the strict pruning Admin job runs.
> >
> > Now, let me say a bit about the approach you are taking to reduce the
> > space used by File records.
> >
> > My first comment is that both lstat and MD5 records are variable length.
>
> Yes, my question was just : is md5 32bits aligned ? (I mean, can I safely
> bet on it being a multiple of 32 bits ?)
>
> > The length is known when the record is sent to the Director, so any
> > attempt to fix the length will lead to failure.  In the case of lstat,
> > the size of the fields can vary from OS to OS (32 bit vs 64 bit), and
> > depending on the OS and whether or not the lstat involves a hardlink,
> > there will be more fields in the lstat structure.  In the case of the MD5
> > record, it is really an old name that no longer applies -- it should be
> > called Hash or something.  It can contain an MD5 or an SHA1, or any of a
> > number of larger hash codes.  We could probably reduce the space by
> > keeping it in binary, but then it will no longer be human readable, so we
> > would in the cases where it is printed need to add additional code to
> > convert it to ASCII.
>
> For lstat, we took the huffman approach, as it was yielding good results.
>
> > Now, I think we can get most gains by working on the lstat packet, and I
> > would recommend that you initially take an entirely different approach.
> > That is the current lstat structure is basically a Unix stat structure.
> > At the time I decided to put it in the database I considered that the
> > base64 coding did a pretty good job of compressing it, and I was not sure
> > which fields in the packet we would actually need.  Now after 8 years of
> > use, I think we have a much better idea of what fields are used, and I
> > suspect that by careful examination and by creating our own packet, we
> > can with no compression reduce the size of the packet by 50%.  Once that
> > is done, then I think we should look carefully at how to compress it by
> > taking a careful look at the fields. For example the st_modes field could
> > probably be reduced in 90% of the cases from 4 bytes to 1 byte just by
> > recognizing common patters that appear over and over -- much as you
> > realized that the lstat packet separates fields with spaces, so there are
> > lots of spaces in it.
>
> Of course, getting rid of things that aren't used would be even better,
> than what I'm trying to do. The best would of course be to split the lstat
> into the few components we really need and use them as individual fields in
> the database. But that would break a lot of things, wouldn't it ?

Send me your Volume, Job, and File retention periods and the scheme you use 
for doing Full, Differential, and Incremental backups and your desired 
granularity of recovery, and I will tell you whether or not this idea will 
help you.

I suspect that just doing strict pruning will reduce your database size by 
50%.

>
> The approach I'm trying to take for now is completely different. I'm trying
> to answer the question : what can we do with the current database "as is",
> without touching bacula's code. And the obvious answer is creating
> dedicated and more space efficient datatypes. It's simply a mapping between
> the current 'base64-text' fields and a "bytea" hidden field.

Yes, for someone outside of Bacula that would be a natural reaction, but I am 
not likely to accept code that complicates or kludges the current situation.  
I would prefer that if any changes are going to be made to re-evaluate the 
current design.

>
> The long term work should of course be of making the file table itself more
> space efficient. I think the best way would be to get rid of those two
> variable length fields if possible, and replace them with only what is
> required from lstat. I'm not that good with the administration of bacula,
> but do we even need st_modes in the catalog ? I guess things like creation
> and last modification time are important, but what else is. Anyway, this is
> a lot more of work...

Yes, the st_modes is required to give the user the ability to do an ls -l.   
Getting rid of the variable length field is certainly possible, but it needs 
to be carefully examined to ensure that it does not penalize users that don't 
need some of the fields.

>
> The same is true for the md5 field : maybe only one hash (sha256 maybe)
> would be better ? if the field is fixed size, we could craft a fixed size
> record, and get rid of the variable length header in the database (4 bytes
> saved ... :) )

One hash is not possible without seriously restricting user's flexibility -- 
the MD5 field though not totally used as planned in 2.2. (hopefully it will 
be in 2.2) is a critical field for security and certain government legal 
requirements for the storage of data.  As legal requirements become more 
strict with increasing computer speed/technology we can expect the need for 
larger and larger hash codes.  In any case, it certainly needs to be user 
configurable without rebuilding the database -- thus variable.  In fact, as 
it stands, the user can have multiple different MD5 sizes in any given Job.  
I.e. some files may be more important to verify than others.

>
> Anyhow, for now, what I'm proposing is easy to implement, even if it would
> not bring as many dramatic improvements as what you propose (I'm talking
> about saving 10 to 15% only ...)
>
> Still, I'd like to know if the md5 field is always a multiple of 32 bits in
> length ?

As already explained, I would be very reluctant to make it a requirement to be 
a multiple of 32 bits.  It just takes one genius to come up with a new super 
fast algorithm that uses 129 bits to break the code.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilot Consulting.