[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] space saving in the database

Cousin Marc wrote:
> Hi,
> First an introduction on what I'm trying to do, so you know what we're playing 
> with :
> Eric Bollengier and I have been discussing about saving space in the database, 
> as ours is getting bigger and bigger (we're nearly at 300 million rows in 
> file table).
> We're starting with the most obvious : try to make the records more compact on 
> disc.  There's a lot of space used by the indexes too, but I don't know yet 
> if it would be possible to have something smaller (they are all used...)
> We're using PostgreSQL 8.2 on production now, so here are our stats.
> What we have now as space used is :
> - File is 108 GB for 275845760 records (size includes indexes)
> Space is used as follows :
> * Indexes :
> file_jpfid_idx -> 13 GB
> file_fp_idx -> 10GB
> file_jobid_idx -> 9GB
> file_pkey -> 8GB
> * Table itself (68GB):
> Administrative overhead (rows headers, block headers)
> block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB
> row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB
> NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB"
> Real data :
> SELECT attname, avg_width from pg_stats where tablename = 'file';
>   attname   | avg_width
> ------------+-----------
>  pathid     |         4
>  filenameid |         4
>  lstat      |        59
>  md5        |        24
>  fileid     |         4
>  fileindex  |         4
>  jobid      |         4
>  markid     |         4
> everything is int4 except lstat and md5.
> So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) for 
> lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with 
> 63 bytes for lstat and 28 for md5.
> So, something around 40GB of data. There is also 30% of free space (too much, 
> I think, we have to investigate this part)
> Of the 40GB of data, 
> So, I now get to the point :
> - we want to try to save some space in file table, so the obvious targets are 
> lstat and md5.
> - we don't want to change bacula code, so the database has to do the dirty 
> work...
> For now, I've been working on the lstat field. It is base64-encoded, and 
> contains mostly spaces, As and Bs.
> What I've done is create a new datatype 'lstat', and code a huffman 
> compress/decompress in C, with static weights for every symbol (' ',A,B 
> vastly outweighting the rest of the codes). The savings are around 12% for 
> the table (of course, we get nothing on the indexes as lstat isn't indexed). 
> It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC), 
> but may be optimized as I'm not that good at C programming.
> I now want to try to save some more space on md5 (the gain will be smaller).
> md5 is base64 encoded too, but every char has the same weight, so all I can 
> try is save some space by putting it back in raw form (it will be about 3/4 
> of the original size, a bit more as the variable size of the field will add 
> the same overhead). The problem is that there doesn't seem to be an end of 
> stream indicator, because the decoding part of bacula's code knows the size 
> of what it wants to retrieve.
> What I need now is to know exactly what I can expect in the lstat so that I 
> can encode and decode it correctly in the database backend. Is it 32 bytes 
> aligned ? (so I can discard the last few bits I may decode).
> For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes. 
> Are the last 4 discarded ?
> If I can't make such hypothesis, I will have to put a size header in my 
> compressed field. Even 1 byte is 300MB now for our database and will be even 
> worse in the future...
> Of course, if anyone is interested in the code, I will happily provide it as 
> is, but I guess that Eric will commit it in the repository when he thinks my 
> work is usable. For now, it only supports postgresql 8.3, and there is no 
> automation for installing it...

Please post it.  I think seeing it, knowing how it is used, and how it
works will be useful in the evaluation of the patch.   Well, it's not a
Bacula patch, it's a patch against the database.

> Anyway, if someone can remove my doubts about bacula's base64, I'd be very 
> thankful.

Dan Langille - http://www.langille.org/
BSDCan - The Technical BSD Conference: http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference:    http://www.pgcon.org/

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list

This mailing list archive is a service of Copilot Consulting.