[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bacula-devel] space saving in the database
On Tuesday 12 February 2008 15.59:10 Jason A. Kates wrote:
> I too think that we should defer to Kern and his list of priorities.
Thanks for your vote of confidence :-)
It is interesting because this is the first time I have not worked on the
highest priority job (Item 1: Accurate restoration of renamed/deleted files),
which was also the project that interested me the most.
Instead, I am working on plugins (item 12: Add Plug-ins to the FileSet Include
statements), which I decided to work on because it is the #1 most requested
feature for enterprises (they want to be able to backup MS Exchange with
a "module"). Well, I wasn't much enjoying the project, because it is a lot
of *really* heavy design and delicate integration with Bacula, but now that I
am into it, it is getting really interesting.
And the real nice part is that a very kind programmer came along and is making
very good progress on the Accurate Backup project :-)
Also another kind programmer came along and is making great progress on Item
h7: Commercial database support, which is the #2 most demanded enterprise
feature -- I still have to figure out how to make this work legally/morally
with our Open Source license ...
> Kern has been doing a great job of prioritizing and running this
> project. I think that people that are that tight on storage should
> prune the files a few days earlier and let Kern work on functionality,
> disk space is fairly in expensive.
> On Tue, 2008-02-12 at 09:52 -0500, Bill Moran wrote:
> > In response to Cousin Marc <mcousin@xxxxxxxx>:
> > [snip]
> > > > One hash is not possible without seriously restricting user's
> > > > flexibility -- the MD5 field though not totally used as planned in
> > > > 2.2. (hopefully it will be in 2.2) is a critical field for security
> > > > and certain government legal requirements for the storage of data.
> > > > As legal requirements become more strict with increasing computer
> > > > speed/technology we can expect the need for larger and larger hash
> > > > codes. In any case, it certainly needs to be user configurable
> > > > without rebuilding the database -- thus variable. In fact, as it
> > > > stands, the user can have multiple different MD5 sizes in any given
> > > > Job. I.e. some files may be more important to verify than others.
> > >
> > > I understand that md5 is required, as it's the only way of reliably
> > > checking that a file has not been modified. But only one type of
> > > checksum may be more efficient from a database point of view, as it
> > > could be fixed size (no need to waste 4 bytes for instance in
> > > postgresql telling the engine : be careful, next field is variable
> > > length, here is it's size). Of course, going from md5 to sha256
> > > sacrifices 16 bytes... I don't know if there could be an efficient way
> > > of doing this. Anyway, base64 "wastes" more space in this scheme, so a
> > > transparent conversion at database level may be useful.
> > >
> > > > As already explained, I would be very reluctant to make it a
> > > > requirement to be a multiple of 32 bits. It just takes one genius to
> > > > come up with a new super fast algorithm that uses 129 bits to break
> > > > the code.
> > >
> > > Okay, I'll experiment with both. For us right now, a byte per record is
> > > only 300MB in database size :)
> > Any time you look at complicating things to improve efficiency, there's
> > the question "is it worth it".
> > On the larger of our two Bacula servers, the database size is 8.5G. The
> > file table contains 35 million rows. If you can save 16 bytes per row,
> > that means an on-disk savings of 1/2G.
> > My reaction to that would be "big friggin deal". Considering the fact
> > that we've got 750G of file volumes on a RAID 5, saving 500M on the
> > database doesn't really seems worth the effort to me.
> > Let's say Bacula moves to using SHA-256 hashes instead of md5. Now the
> > savings in storage space is 32 bytes instead of 16 bytes. So, I'd be
> > saving a whole G on the total database size. I still say, "why bother"
> > Bacula works just dandy for us with these sizes. If I do a "list jobs
> > where a given file is saved" on the largest of our servers, the response
> > is fast enough that I don't even consider it a wait. Quite honestly,
> > it's fast enough that I have trouble believing that it doesn't take
> > longer. Nearly instantaneous.
> > Just my opinion, of course. I'd be interested to hear how much effect
> > this would have on others and whether they think it's worthwhile to even
> > investigate.
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list
This mailing list archive is a service of Copilot Consulting.