[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] space saving in the database


Hi,

First an introduction on what I'm trying to do, so you know what we're playing 
with :

Eric Bollengier and I have been discussing about saving space in the database, 
as ours is getting bigger and bigger (we're nearly at 300 million rows in 
file table).

We're starting with the most obvious : try to make the records more compact on 
disc.  There's a lot of space used by the indexes too, but I don't know yet 
if it would be possible to have something smaller (they are all used...)

We're using PostgreSQL 8.2 on production now, so here are our stats.

What we have now as space used is :
- File is 108 GB for 275845760 records (size includes indexes)
Space is used as follows :
* Indexes :
file_jpfid_idx -> 13 GB
file_fp_idx -> 10GB
file_jobid_idx -> 9GB
file_pkey -> 8GB
* Table itself (68GB):
Administrative overhead (rows headers, block headers)
block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB
row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB
NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB"

Real data :
SELECT attname, avg_width from pg_stats where tablename = 'file';
  attname   | avg_width
------------+-----------
 pathid     |         4
 filenameid |         4
 lstat      |        59
 md5        |        24
 fileid     |         4
 fileindex  |         4
 jobid      |         4
 markid     |         4

everything is int4 except lstat and md5.
So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) for 
lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with 
63 bytes for lstat and 28 for md5.

So, something around 40GB of data. There is also 30% of free space (too much, 
I think, we have to investigate this part)

Of the 40GB of data, 


So, I now get to the point :
- we want to try to save some space in file table, so the obvious targets are 
lstat and md5.
- we don't want to change bacula code, so the database has to do the dirty 
work...
For now, I've been working on the lstat field. It is base64-encoded, and 
contains mostly spaces, As and Bs.

What I've done is create a new datatype 'lstat', and code a huffman 
compress/decompress in C, with static weights for every symbol (' ',A,B 
vastly outweighting the rest of the codes). The savings are around 12% for 
the table (of course, we get nothing on the indexes as lstat isn't indexed). 
It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC), 
but may be optimized as I'm not that good at C programming.


I now want to try to save some more space on md5 (the gain will be smaller).
md5 is base64 encoded too, but every char has the same weight, so all I can 
try is save some space by putting it back in raw form (it will be about 3/4 
of the original size, a bit more as the variable size of the field will add 
the same overhead). The problem is that there doesn't seem to be an end of 
stream indicator, because the decoding part of bacula's code knows the size 
of what it wants to retrieve.

What I need now is to know exactly what I can expect in the lstat so that I 
can encode and decode it correctly in the database backend. Is it 32 bytes 
aligned ? (so I can discard the last few bits I may decode).
For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes. 
Are the last 4 discarded ?
If I can't make such hypothesis, I will have to put a size header in my 
compressed field. Even 1 byte is 300MB now for our database and will be even 
worse in the future...

Of course, if anyone is interested in the code, I will happily provide it as 
is, but I guess that Eric will commit it in the repository when he thinks my 
work is usable. For now, it only supports postgresql 8.3, and there is no 
automation for installing it...

Anyway, if someone can remove my doubts about bacula's base64, I'd be very 
thankful.

Cheers
Marc

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilot Consulting.