[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bacula-devel] Accurate file project hash tables
Background (somewhat simplified):
The Accurate delete/restore project (item #1) is now implemented (thanks to
Eric). For it to work, the client machine must maintain a temporary table
supplied by the Director for each job. This table contains all the files that
are in the current backup from the Director's point of view at the beginning
of the job. The table allows the FD to determine what files have been added
and/or deleted to/from the system that are not yet saved (or deleted).
As currently implemented this table is a hash table using the hash class that
I wrote 3 or 4 years ago for this particular project. It is fast and
Currently the hash table is entirely kept in memory, which means the client
uses a rather huge amount of memory, which for systems with 10 million files
will probably exceed even the largest amounts of RAM currently used. We
would like to have some solution that allows the number of files to grow to
20 million or more and still use less than 1GB of RAM.
Recently (the last few days), I implemented what I call BIG_MALLOC for the
htable class that allocates memory in 1MB chunks. This makes the code run
several times faster with datasets of the order of 5 million by reducing the
number of system malloc requests from 5-6 million to less than 500. However,
it still does not reduce the memory foot print.
Suggestions for reducing the memory requirements:
1. General: Currently each file is kept as a full path and filename. This
makes things simple, but means there may be thousands of copies of a long
path name (one for each file in the directory).
Solution: separate the filename from the path name as is done in the Bacula DB
and keep each once in a red/black tree.
Benefit: A lot less memory used (perhaps less than half or one fourth). It is
quite easy to implement. It might give us a lot of breathing room and still
allow us to implement the ideas below later. Some simple tests of the
compression ratio on real filesystems should be done before implementing this
to ensure it really gives an important reduction in size.
Downside: Slightly more overhead. Not a full solution to the problem.
2. General: all the data is currently kept in big 1MB blocks of memory.
Solution: we could quite easily use a single block of memory and write
subsequent blocks to a file. The OS would work as a paging/cache system, so
the pages could be retrieved relatively quickly.
Benefit: the number of files would then scale to whatever you could fit on
disk. It would be rather easy to implement.
Downside: this might be considerably slower because of the fact that the hash
code has to follow the links and compare hash codes, and each link it follows
for a file (average 5 max 30) could require an I/O.
3. Idea: a variation of #2 would be move the links out of the data item into
memory, and thus the paged data would require an I/O only if there are
collisions in the buckets, but not for transversing the buckets.
Benefit: probably will run *much* faster.
Downside: will use more memory. We could fit something like 20 million links
in about 250MB of RAM, which is reasonable.
4. Idea: a variation of item #3 would be to generate two hash codes at the
same time and to compare both.
Benefit: this would *significantly* reduce the need to do a full comparison of
Downside: it would cost negligible CPU to compute a second or even third hash
at the same time we are computing the first one. The main downside is an
increase in memory. For about 20 million files, we would need an additional
80MB of RAM or about 330MB total. This still seems reasonable.
5. Idea: a variation on #4 would be to page the hash links and allow the user
to specify the total memory to use. This would be a slightly larger project
but is not enormous. Probably not worth doing at least at the moment.
6: Idea: use memory mapped files.
Benefit: this would simplify the code and the OS would do the paging.
Downside: performance not controlled or guaranteed. Probable *big* porting
problems among the various Unixes and Win32.
What do you think? Any other ideas?
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Bacula-devel mailing list
This mailing list archive is a service of Copilotco.