[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bacula-devel] space saving in the database
Cousin Marc wrote:
> Hi,
>
> First an introduction on what I'm trying to do, so you know what we're playing
> with :
>
> Eric Bollengier and I have been discussing about saving space in the database,
> as ours is getting bigger and bigger (we're nearly at 300 million rows in
> file table).
>
> We're starting with the most obvious : try to make the records more compact on
> disc. There's a lot of space used by the indexes too, but I don't know yet
> if it would be possible to have something smaller (they are all used...)
>
> We're using PostgreSQL 8.2 on production now, so here are our stats.
>
> What we have now as space used is :
> - File is 108 GB for 275845760 records (size includes indexes)
> Space is used as follows :
> * Indexes :
> file_jpfid_idx -> 13 GB
> file_fp_idx -> 10GB
> file_jobid_idx -> 9GB
> file_pkey -> 8GB
> * Table itself (68GB):
> Administrative overhead (rows headers, block headers)
> block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB
> row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB
> NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB"
>
> Real data :
> SELECT attname, avg_width from pg_stats where tablename = 'file';
> attname | avg_width
> ------------+-----------
> pathid | 4
> filenameid | 4
> lstat | 59
> md5 | 24
> fileid | 4
> fileindex | 4
> jobid | 4
> markid | 4
>
> everything is int4 except lstat and md5.
> So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) for
> lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with
> 63 bytes for lstat and 28 for md5.
>
> So, something around 40GB of data. There is also 30% of free space (too much,
> I think, we have to investigate this part)
>
> Of the 40GB of data,
>
>
> So, I now get to the point :
> - we want to try to save some space in file table, so the obvious targets are
> lstat and md5.
> - we don't want to change bacula code, so the database has to do the dirty
> work...
> For now, I've been working on the lstat field. It is base64-encoded, and
> contains mostly spaces, As and Bs.
>
> What I've done is create a new datatype 'lstat', and code a huffman
> compress/decompress in C, with static weights for every symbol (' ',A,B
> vastly outweighting the rest of the codes). The savings are around 12% for
> the table (of course, we get nothing on the indexes as lstat isn't indexed).
> It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC),
> but may be optimized as I'm not that good at C programming.
>
>
> I now want to try to save some more space on md5 (the gain will be smaller).
> md5 is base64 encoded too, but every char has the same weight, so all I can
> try is save some space by putting it back in raw form (it will be about 3/4
> of the original size, a bit more as the variable size of the field will add
> the same overhead). The problem is that there doesn't seem to be an end of
> stream indicator, because the decoding part of bacula's code knows the size
> of what it wants to retrieve.
>
> What I need now is to know exactly what I can expect in the lstat so that I
> can encode and decode it correctly in the database backend. Is it 32 bytes
> aligned ? (so I can discard the last few bits I may decode).
> For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes.
> Are the last 4 discarded ?
> If I can't make such hypothesis, I will have to put a size header in my
> compressed field. Even 1 byte is 300MB now for our database and will be even
> worse in the future...
>
> Of course, if anyone is interested in the code, I will happily provide it as
> is, but I guess that Eric will commit it in the repository when he thinks my
> work is usable. For now, it only supports postgresql 8.3, and there is no
> automation for installing it...
Please post it. I think seeing it, knowing how it is used, and how it
works will be useful in the evaluation of the patch. Well, it's not a
Bacula patch, it's a patch against the database.
> Anyway, if someone can remove my doubts about bacula's base64, I'd be very
> thankful.
--
Dan Langille - http://www.langille.org/
BSDCan - The Technical BSD Conference: http://www.bsdcan.org/
PGCon - The PostgreSQL Conference: http://www.pgcon.org/
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel
This mailing list archive is a service of Copilot Consulting.