[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bacula-devel] space saving in the database
First an introduction on what I'm trying to do, so you know what we're playing
Eric Bollengier and I have been discussing about saving space in the database,
as ours is getting bigger and bigger (we're nearly at 300 million rows in
We're starting with the most obvious : try to make the records more compact on
disc. There's a lot of space used by the indexes too, but I don't know yet
if it would be possible to have something smaller (they are all used...)
We're using PostgreSQL 8.2 on production now, so here are our stats.
What we have now as space used is :
- File is 108 GB for 275845760 records (size includes indexes)
Space is used as follows :
* Indexes :
file_jpfid_idx -> 13 GB
file_fp_idx -> 10GB
file_jobid_idx -> 9GB
file_pkey -> 8GB
* Table itself (68GB):
Administrative overhead (rows headers, block headers)
block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB
row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB
NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB"
Real data :
SELECT attname, avg_width from pg_stats where tablename = 'file';
attname | avg_width
pathid | 4
filenameid | 4
lstat | 59
md5 | 24
fileid | 4
fileindex | 4
jobid | 4
markid | 4
everything is int4 except lstat and md5.
So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) for
lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with
63 bytes for lstat and 28 for md5.
So, something around 40GB of data. There is also 30% of free space (too much,
I think, we have to investigate this part)
Of the 40GB of data,
So, I now get to the point :
- we want to try to save some space in file table, so the obvious targets are
lstat and md5.
- we don't want to change bacula code, so the database has to do the dirty
For now, I've been working on the lstat field. It is base64-encoded, and
contains mostly spaces, As and Bs.
What I've done is create a new datatype 'lstat', and code a huffman
compress/decompress in C, with static weights for every symbol (' ',A,B
vastly outweighting the rest of the codes). The savings are around 12% for
the table (of course, we get nothing on the indexes as lstat isn't indexed).
It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC),
but may be optimized as I'm not that good at C programming.
I now want to try to save some more space on md5 (the gain will be smaller).
md5 is base64 encoded too, but every char has the same weight, so all I can
try is save some space by putting it back in raw form (it will be about 3/4
of the original size, a bit more as the variable size of the field will add
the same overhead). The problem is that there doesn't seem to be an end of
stream indicator, because the decoding part of bacula's code knows the size
of what it wants to retrieve.
What I need now is to know exactly what I can expect in the lstat so that I
can encode and decode it correctly in the database backend. Is it 32 bytes
aligned ? (so I can discard the last few bits I may decode).
For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes.
Are the last 4 discarded ?
If I can't make such hypothesis, I will have to put a size header in my
compressed field. Even 1 byte is 300MB now for our database and will be even
worse in the future...
Of course, if anyone is interested in the code, I will happily provide it as
is, but I guess that Eric will commit it in the repository when he thinks my
work is usable. For now, it only supports postgresql 8.3, and there is no
automation for installing it...
Anyway, if someone can remove my doubts about bacula's base64, I'd be very
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list
This mailing list archive is a service of Copilot Consulting.