[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database


I am really happy to see that someone besides myself is interested in this 
problem.  I am quite concerned because since 2.2.0 came out, with its new 
more correct pruning algorithm when finding a Volume, there is a lot less 
unnecessary pruning taking place, and the database for my backups has grown 
from 400MB to 1GB -- that is it more than doubled.  

My conclusion is that we can reduce the size of users databases by 50-80% by 
doing what I would call "Strict Pruning" that is coming up with some 
mechanism by which the user could say that he/she wants pruning to occur as 
close to the time limit specified rather than done only when absolutely 
necessary as the current code does.

In addition, I suspect that we can get significant gains by adding individual 
retention periods for Differential and Incremental records.  For example, if 
you do Fulls once a month, Differentials once a week, and Incrementals daily, 
and you set File Retention Period to 6 months, depending on what you want, 
you end up with lots and lots of Incremental and some Differential records 
that are not useful.  For example, you might only want to keep Incremental 
records for two weeks or a month.  That is for 2 weeks or a month, you can 
recover on a daily basis, but older than a month, you could only recover to 
the last Differential (i.e. on a weekly basis).  Likewise, you might only 
want to recover on a weekly basis for say a month or 3 months, and so the 
need for a Differential Retention Period.

Finally on the subject of pruning, jobs are only pruned when they are run, so 
if you run a special job once, it will be pruned only at Volume pruning times 
(typically one year), or if you stop using backing up a particular machine, 
the records for that machine will remain in the database until all the 
Volumes containing those records are pruned (possibly much longer than the 
retention period set up for the job).

So, I suggest that some work on pruning would make a major difference:

1. Add a Differential Retention Period
2. Add an Incremental Retention Period
3. Add some mechanism where the user could specify strict pruning, and that 
would start some sort of automatic Admin job that would apply strict pruning 
(perhaps just add a "Strict Pruning" directive to an Admin job would trigger 
the new code). Since pruning can take enormous amounts of times, I would 
recommend that we find some way to limit the amount of time the strict 
pruning Admin job runs.

Now, let me say a bit about the approach you are taking to reduce the space 
used by File records.

My first comment is that both lstat and MD5 records are variable length. The 
length is known when the record is sent to the Director, so any attempt to 
fix the length will lead to failure.  In the case of lstat, the size of the 
fields can vary from OS to OS (32 bit vs 64 bit), and depending on the OS and 
whether or not the lstat involves a hardlink, there will be more fields in 
the lstat structure.  In the case of the MD5 record, it is really an old name 
that no longer applies -- it should be called Hash or something.  It can 
contain an MD5 or an SHA1, or any of a number of larger hash codes.  We could 
probably reduce the space by keeping it in binary, but then it will no longer 
be human readable, so we would in the cases where it is printed need to add 
additional code to convert it to ASCII.

Now, I think we can get most gains by working on the lstat packet, and I would 
recommend that you initially take an entirely different approach.  That is 
the current lstat structure is basically a Unix stat structure. At the time I 
decided to put it in the database I considered that the base64 coding did a 
pretty good job of compressing it, and I was not sure which fields in the 
packet we would actually need.  Now after 8 years of use, I think we have a 
much better idea of what fields are used, and I suspect that by careful 
examination and by creating our own packet, we can with no compression reduce 
the size of the packet by 50%.  Once that is done, then I think we should 
look carefully at how to compress it by taking a careful look at the fields.  
For example the st_modes field could probably be reduced in 90% of the cases 
from 4 bytes to 1 byte just by recognizing common patters that appear over 
and over -- much as you realized that the lstat packet separates fields with 
spaces, so there are lots of spaces in it.

Anyway, those are my thoughts on this.  

What do you think?


On Monday 11 February 2008 14.39:18 Cousin Marc wrote:
> Here are the current pieces of code.
> If you want to try it, here's how :
> - first, you have to be using pg 8.3 (i'll backport, there is very little
> work to do on that)
> - second, a word of caution : it's a postgresql/C stored procedure. It
> means it can crash postgresql. It doesn't do it on my computer (anymore :)
> ). Don't use it on production for now (I guess nobody would do it, but I
> just don't want to take the blame :) )
> - run make.sh, it will create a .so file, that has to go in postgresql's
> lib directory (it won't copy it there, just to /tmp ... nevermind, it will
> be cleaner later too...)
> - then, run the SQL script. It will create a new lstat type, that you can
> use in place of the text type you have right now for lstat (you can't
> 'alter table alter column type', you have to create a new table ...)
> I've included a pl script. That's the one I used to generate the huffman
> tree (it comes from here: http://www.perlmonks.org/?node_id=603111 )
> Then I put statistics from our own database (corrected so that anything
> except A,B and ' ' are equiprobable). It's dirty as it was a one pass job,
> but it's output is easier to read than the constants from the C code.
> > Please post it.  I think seeing it, knowing how it is used, and how it
> > works will be useful in the evaluation of the patch.   Well, it's not a
> > Bacula patch, it's a patch against the database.
> >
> > > Anyway, if someone can remove my doubts about bacula's base64, I'd be
> > > very thankful.

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.