[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bacula-devel] Accurate file project hash tables
On Wednesday 26 March 2008 11:46:55 Martin Simmons wrote:
> >>>>> On Wed, 26 Mar 2008 09:55:23 +0100, Kern Sibbald said:
> > On Tuesday 25 March 2008 23:39:46 Martin Simmons wrote:
> > > >>>>> On Tue, 25 Mar 2008 17:55:50 +0100, Kern Sibbald said:
> > > >
> > > > Hello Eric,
> > > >
> > > > Background (somewhat simplified):
> > > > The Accurate delete/restore project (item #1) is now implemented
> > > > (thanks to Eric). For it to work, the client machine must maintain a
> > > > temporary table supplied by the Director for each job. This table
> > > > contains all the files that are in the current backup from the
> > > > Director's point of view at the beginning of the job. The table
> > > > allows the FD to determine what files have been added and/or deleted
> > > > to/from the system that are not yet saved (or deleted).
> > > >
> > > > As currently implemented this table is a hash table using the hash
> > > > class that I wrote 3 or 4 years ago for this particular project. It
> > > > is fast and efficient.
> > > >
> > > > Problem:
> > > > Currently the hash table is entirely kept in memory, which means the
> > > > client uses a rather huge amount of memory, which for systems with 10
> > > > million files will probably exceed even the largest amounts of RAM
> > > > currently used. We would like to have some solution that allows the
> > > > number of files to grow to 20 million or more and still use less than
> > > > 1GB of RAM.
> > > >
> > > > Recently (the last few days), I implemented what I call BIG_MALLOC
> > > > for the htable class that allocates memory in 1MB chunks. This makes
> > > > the code run several times faster with datasets of the order of 5
> > > > million by reducing the number of system malloc requests from 5-6
> > > > million to less than 500. However, it still does not reduce the
> > > > memory foot print.
> > > >
> > > > Suggestions for reducing the memory requirements:
> > > > ... Any other ideas?
> > >
> > > How about not sending all of the info to the FD at once :-)
> > >
> > > I'm thinking of something like "merge" on ordered lists. If the
> > > Director could generate the file list breadth-first and in alphabetical
> > > order (like ls -R), and the FD could traverse the file tree in the same
> > > order, then it would be easy to spot the missing files as they occur.
> > > This would allow the Director to send the list to the FD in chunks
> > > during the backup, rather than sending it all at the start of the job.
> > Possibly it would work, but first it would require more code and overhead
> > in the Director,
> Yes, it would add some complexity, but OTOH it would allow the FD to run
> without needing GBs of free disk space to store the whole list.
> > and much more difficult, the FD does a recursive descent
> > into the directories the user specifies in the order (or possibly inverse
> > order) they are specified by the user. Hence there is zero chance of
> > visiting the files in sorted order. To do so, you would have to find the
> > names of all the files, do a sort, then revisit all the files. That
> > would involve keeping a good part of the same data we already have, or it
> > would require an external sort.
> I think the FD processes the "File =" directives in a predictable order
> (which the Director obviously knows, since it transmitted them). The only
> sorting required on the FD side would be for the entries in each directory,
> independently of all other directories. The FD would only need to retain
> the sorted list for directories that it is working on (recursively).
OK, thanks for the ideas. We will definitely keep them in mind. If I am not
mistaken, we have already considered something similar to this, but I think
you have a few new wrinkles that we had not imagined. We did examine a good
number of alternatives before coding, and in the end decided to do what was
the simplest, knowing full well that at some point we would have to enhance
it. We chose the simplest way because the project was already quite
complicated. Now that it is implemented, tweaking algorithms when necessary
is much easier.
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
Bacula-devel mailing list
This mailing list archive is a service of Copilotco.