[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Accurate file project hash tables


>>>>> On Wed, 26 Mar 2008 09:55:23 +0100, Kern Sibbald said:
> 
> On Tuesday 25 March 2008 23:39:46 Martin Simmons wrote:
> > >>>>> On Tue, 25 Mar 2008 17:55:50 +0100, Kern Sibbald said:
> > >
> > > Hello Eric,
> > >
> > > Background (somewhat simplified):
> > > The Accurate delete/restore project (item #1) is now implemented (thanks
> > > to Eric).  For it to work, the client machine must maintain a temporary
> > > table supplied by the Director for each job. This table contains all the
> > > files that are in the current backup from the Director's point of view at
> > > the beginning of the job.  The table allows the FD to determine what
> > > files have been added and/or deleted to/from the system that are not yet
> > > saved (or deleted).
> > >
> > > As currently implemented this table is a hash table using the hash class
> > > that I wrote 3 or 4 years ago for this particular project. It is fast and
> > > efficient.
> > >
> > > Problem:
> > > Currently the hash table is entirely kept in memory, which means the
> > > client uses a rather huge amount of memory, which for systems with 10
> > > million files will probably exceed even the largest amounts of RAM
> > > currently used.  We would like to have some solution that allows the
> > > number of files to grow to 20 million or more and still use less than 1GB
> > > of RAM.
> > >
> > > Recently (the last few days), I implemented what I call BIG_MALLOC for
> > > the htable class that allocates memory in 1MB chunks.  This makes the
> > > code run several times faster with datasets of the order of 5 million by
> > > reducing the number of system malloc requests from 5-6 million to less
> > > than 500.  However, it still does not reduce the memory foot print.
> > >
> > > Suggestions for reducing the memory requirements:
> > > ... Any other ideas?
> >
> > How about not sending all of the info to the FD at once :-)
> >
> > I'm thinking of something like "merge" on ordered lists.  If the Director
> > could generate the file list breadth-first and in alphabetical order (like
> > ls -R), and the FD could traverse the file tree in the same order, then it
> > would be easy to spot the missing files as they occur.  This would allow
> > the Director to send the list to the FD in chunks during the backup, rather
> > than sending it all at the start of the job.
> 
> Possibly it would work, but first it would require more code and overhead in 
> the Director,

Yes, it would add some complexity, but OTOH it would allow the FD to run
without needing GBs of free disk space to store the whole list.


>               and much more difficult, the FD does a recursive descent into 
> the directories the user specifies in the order (or possibly inverse order) 
> they are specified by the user.  Hence there is zero chance of visiting the 
> files in sorted order.  To do so, you would have to find the names of all the 
> files, do a sort, then revisit all the files.  That would involve keeping a 
> good part of the same data we already have, or it would require an external 
> sort.

I think the FD processes the "File =" directives in a predictable order (which
the Director obviously knows, since it transmitted them).  The only sorting
required on the FD side would be for the entries in each directory,
independently of all other directories.  The FD would only need to retain the
sorted list for directories that it is working on (recursively).

__Martin

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.