[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database

On Tuesday 12 February 2008 15.52:36 Bill Moran wrote:
> In response to Cousin Marc <mcousin@xxxxxxxx>:
> [snip]
> > > One hash is not possible without seriously restricting user's
> > > flexibility -- the MD5 field though not totally used as planned in 2.2.
> > > (hopefully it will be in 2.2) is a critical field for security and
> > > certain government legal requirements for the storage of data.  As
> > > legal requirements become more strict with increasing computer
> > > speed/technology we can expect the need for larger and larger hash
> > > codes.  In any case, it certainly needs to be user configurable without
> > > rebuilding the database -- thus variable.  In fact, as it stands, the
> > > user can have multiple different MD5 sizes in any given Job. I.e. some
> > > files may be more important to verify than others.
> >
> > I understand that md5 is required, as it's the only way of reliably
> > checking that a file has not been modified. But only one type of checksum
> > may be more efficient from a database point of view, as it could be fixed
> > size (no need to waste 4 bytes for instance in postgresql telling the
> > engine : be careful, next field is variable length, here is it's size).
> > Of course, going from md5 to sha256 sacrifices 16 bytes... I don't know
> > if there could be an efficient way of doing this. Anyway, base64 "wastes"
> > more space in this scheme, so a transparent conversion at database level
> > may be useful.
> >
> > > As already explained, I would be very reluctant to make it a
> > > requirement to be a multiple of 32 bits.  It just takes one genius to
> > > come up with a new super fast algorithm that uses 129 bits to break the
> > > code.
> >
> > Okay, I'll experiment with both. For us right now, a byte per record is
> > only 300MB in database size :)
> Any time you look at complicating things to improve efficiency, there's
> the question "is it worth it".
> On the larger of our two Bacula servers, the database size is 8.5G.  The
> file table contains 35 million rows.  If you can save 16 bytes per row,
> that means an on-disk savings of 1/2G.
> My reaction to that would be "big friggin deal".  Considering the fact
> that we've got 750G of file volumes on a RAID 5, saving 500M on the
> database doesn't really seems worth the effort to me.
> Let's say Bacula moves to using SHA-256 hashes instead of md5.  Now the
> savings in storage space is 32 bytes instead of 16 bytes.  So, I'd be
> saving a whole G on the total database size.  I still say, "why bother"
> Bacula works just dandy for us with these sizes.  If I do a "list jobs
> where a given file is saved" on the largest of our servers, the response
> is fast enough that I don't even consider it a wait.  Quite honestly, it's
> fast enough that I have trouble believing that it doesn't take longer.
> Nearly instantaneous.
> Just my opinion, of course.  I'd be interested to hear how much effect
> this would have on others and whether they think it's worthwhile to even
> investigate.

It is nice to hear this point of view :-)

I think the ability for the user to set Strict Pruning (i.e. all records 
disappear within a day or two of the retention periods you set) could reduce 
the database size by about 50%.  This is significant, and Strict Pruning is 
something most people think of when they first encounter Bacula pruning so it 
is very logical.  Most people don't understand why tapes are not recycled 
immediately when the expiration date arrives.  Obviously some people may 
prefer the current pruning, which basically does it when really required.  It 
is trivial to offer both ways of doing things -- the hardest part is 
explaining it clearly in the docs.

Also, I do think it would be a good idea to look at the details of the lstat 
packet.  It is probably 20-50% larger than needed without even thinking of 
doing any compression.  Also, we might be able to break the fields of the 
lstat packet out rather than have them encoded.  This would make certain 
lookups in the database possible using SQL only rather than SQL + C 
processing ...  Not a big deal for most users, but interesting.  Finally my 
hidden agenda is that the stat packet is not currently completely system 
independent, and I would like to make it so for good form.


This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.