[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Accurate file project hash tables

On Tuesday 25 March 2008 23:39:46 Martin Simmons wrote:
> >>>>> On Tue, 25 Mar 2008 17:55:50 +0100, Kern Sibbald said:
> >
> > Hello Eric,
> >
> > Background (somewhat simplified):
> > The Accurate delete/restore project (item #1) is now implemented (thanks
> > to Eric).  For it to work, the client machine must maintain a temporary
> > table supplied by the Director for each job. This table contains all the
> > files that are in the current backup from the Director's point of view at
> > the beginning of the job.  The table allows the FD to determine what
> > files have been added and/or deleted to/from the system that are not yet
> > saved (or deleted).
> >
> > As currently implemented this table is a hash table using the hash class
> > that I wrote 3 or 4 years ago for this particular project. It is fast and
> > efficient.
> >
> > Problem:
> > Currently the hash table is entirely kept in memory, which means the
> > client uses a rather huge amount of memory, which for systems with 10
> > million files will probably exceed even the largest amounts of RAM
> > currently used.  We would like to have some solution that allows the
> > number of files to grow to 20 million or more and still use less than 1GB
> > of RAM.
> >
> > Recently (the last few days), I implemented what I call BIG_MALLOC for
> > the htable class that allocates memory in 1MB chunks.  This makes the
> > code run several times faster with datasets of the order of 5 million by
> > reducing the number of system malloc requests from 5-6 million to less
> > than 500.  However, it still does not reduce the memory foot print.
> >
> > Suggestions for reducing the memory requirements:
> > ... Any other ideas?
> How about not sending all of the info to the FD at once :-)
> I'm thinking of something like "merge" on ordered lists.  If the Director
> could generate the file list breadth-first and in alphabetical order (like
> ls -R), and the FD could traverse the file tree in the same order, then it
> would be easy to spot the missing files as they occur.  This would allow
> the Director to send the list to the FD in chunks during the backup, rather
> than sending it all at the start of the job.

Possibly it would work, but first it would require more code and overhead in 
the Director, and much more difficult, the FD does a recursive descent into 
the directories the user specifies in the order (or possibly inverse order) 
they are specified by the user.  Hence there is zero chance of visiting the 
files in sorted order.  To do so, you would have to find the names of all the 
files, do a sort, then revisit all the files.  That would involve keeping a 
good part of the same data we already have, or it would require an external 

Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.