[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database

On Tuesday 12 February 2008 13.44:51 Cousin Marc wrote:
> > Send me your Volume, Job, and File retention periods and the scheme you
> > use for doing Full, Differential, and Incremental backups and your
> > desired granularity of recovery, and I will tell you whether or not this
> > idea will help you.
> >
> > I suspect that just doing strict pruning will reduce your database size
> > by 50%.
> Maybe Eric will, he's the one doing the setups for pruning ... But I think
> what our administrators want is to reduce pruning to what's really not on
> tapes, because the tapes have been erased.
> We're not worried about having a very big database. We're just trying to
> have it as small as possible for our current use.

I think you both are missing the point about pruning.  The problem is there is 
no strict pruning defined, so if you set Volume retention very large (default 
is 1 year), then you will be almost guaranteed to be carrying some File 
records around for 1 year, which is not what you want.  

Also if you erase tapes without doing a Bacula purge of the tape, you will end 
up with huge amounts of orphaned data in the catalog. I doubt you are doing 
that, but reading what you wrote above leaves the question open. 

Right now, you are focused on a 10% space saving, which we definitely need to 
do, but professional performance management dictates that if an easy 50% gain 
is possible one should first work on that ... :-)

> > Yes, for someone outside of Bacula that would be a natural reaction, but
> > I am not likely to accept code that complicates or kludges the current
> > situation. I would prefer that if any changes are going to be made to
> > re-evaluate the current design.
> There will be no changes in bacula, that's the whole point. It's more along
> the lines of creating 'dedicated' data types in postgresql to
> handle 'wasteful' base64 encoded fields a bit better. That's just a short
> term optimization.

I am not too much in favor of doing things outside of Bacula because that 
means it is more complicated for the user to setup and probably database 
dependent.  In addition, Bacula knows more about its data and how it is used 
so the first thing to do, IMO is to optimize what Bacula is using.

> > One hash is not possible without seriously restricting user's flexibility
> > -- the MD5 field though not totally used as planned in 2.2. (hopefully it
> > will be in 2.2) is a critical field for security and certain government
> > legal requirements for the storage of data.  As legal requirements become
> > more strict with increasing computer speed/technology we can expect the
> > need for larger and larger hash codes.  In any case, it certainly needs
> > to be user configurable without rebuilding the database -- thus variable.
> >  In fact, as it stands, the user can have multiple different MD5 sizes in
> > any given Job. I.e. some files may be more important to verify than
> > others.
> I understand that md5 is required, as it's the only way of reliably
> checking that a file has not been modified. But only one type of checksum
> may be more efficient from a database point of view, as it could be fixed
> size (no need to waste 4 bytes for instance in postgresql telling the
> engine : be careful, next field is variable length, here is it's size). Of
> course, going from md5 to sha256 sacrifices 16 bytes... 

I think you are spending a lot of energy worrying about 4 bytes.  We could 
probably tell the DB that the string is a maximum of 64K and that should cut 
it down to two bytes; 64K is more than sufficient.  We could even restrict 
the size to a maximum of 512 without much problem, and that would mean that 
there is only one byte needed to store the length. 

> I don't know if 
> there could be an efficient way of doing this. Anyway, base64 "wastes" more
> space in this scheme, so a transparent conversion at database level may be
> useful.

Well, if we are talking about PostgreSQL specific modifications, I am not too 
enthusiastic for several reasons some of which I have already mentioned: 1. 
Bacula needs to remain as database neutral as possible.  2. We (Bacula) 
cannot support changes or code that is database dependent and stored in the 
database engine -- it is just too complicated. Some people might want such a 
patch, but it is not something we could support because we don't have the 
expertise -- for Bacula, I would like the SQL engines to remain black boxes 
that we tweak in the minimum possible ways.

I have proposed that we revisit the both the idea of what fields lstat has as 
well as how it is stored (currently base64), and with only a small amount of 
work we can convert the MD5 record into binary, which will shorten the field. 
This could eliminate what you claim is base64 wasting space.   

By the way, my tests indicated at the time I did them 8 years ago that base64 
compressed the lstat record about in half.  For example, a zero instead of 
taking up 4 or 8 bytes was reduced to 2 bytes, and so on so I am not sure why 
you say the base64 wastes space unless you are referring to the MD5.  I chose 
to use base64 because at the time (8 years ago) it was not easy to store 
binary data in all the databases.  I don't think that constrain holds any 

> > As already explained, I would be very reluctant to make it a requirement
> > to be a multiple of 32 bits.  It just takes one genius to come up with a
> > new super fast algorithm that uses 129 bits to break the code.
> Okay, I'll experiment with both. For us right now, a byte per record is
> only 300MB in database size :)

This could be reduced to 150MB per byte by doing proper pruning.

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.