[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bacula-devel] space saving in the database
In response to Cousin Marc <mcousin@xxxxxxxx>:
> > One hash is not possible without seriously restricting user's flexibility
> > -- the MD5 field though not totally used as planned in 2.2. (hopefully it
> > will be in 2.2) is a critical field for security and certain government
> > legal requirements for the storage of data. As legal requirements become
> > more strict with increasing computer speed/technology we can expect the
> > need for larger and larger hash codes. In any case, it certainly needs to
> > be user configurable without rebuilding the database -- thus variable. In
> > fact, as it stands, the user can have multiple different MD5 sizes in any
> > given Job. I.e. some files may be more important to verify than others.
> I understand that md5 is required, as it's the only way of reliably checking
> that a file has not been modified. But only one type of checksum may be more
> efficient from a database point of view, as it could be fixed size (no need
> to waste 4 bytes for instance in postgresql telling the engine : be careful,
> next field is variable length, here is it's size). Of course, going from md5
> to sha256 sacrifices 16 bytes... I don't know if there could be an efficient
> way of doing this. Anyway, base64 "wastes" more space in this scheme, so a
> transparent conversion at database level may be useful.
> > As already explained, I would be very reluctant to make it a requirement to
> > be a multiple of 32 bits. It just takes one genius to come up with a new
> > super fast algorithm that uses 129 bits to break the code.
> Okay, I'll experiment with both. For us right now, a byte per record is only
> 300MB in database size :)
Any time you look at complicating things to improve efficiency, there's
the question "is it worth it".
On the larger of our two Bacula servers, the database size is 8.5G. The
file table contains 35 million rows. If you can save 16 bytes per row,
that means an on-disk savings of 1/2G.
My reaction to that would be "big friggin deal". Considering the fact
that we've got 750G of file volumes on a RAID 5, saving 500M on the
database doesn't really seems worth the effort to me.
Let's say Bacula moves to using SHA-256 hashes instead of md5. Now the
savings in storage space is 32 bytes instead of 16 bytes. So, I'd be
saving a whole G on the total database size. I still say, "why bother"
Bacula works just dandy for us with these sizes. If I do a "list jobs
where a given file is saved" on the largest of our servers, the response
is fast enough that I don't even consider it a wait. Quite honestly, it's
fast enough that I have trouble believing that it doesn't take longer.
Just my opinion, of course. I'd be interested to hear how much effect
this would have on others and whether they think it's worthwhile to even
Collaborative Fusion Inc.
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list
This mailing list archive is a service of Copilot Consulting.