[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] Selective restore when files are pruned [patch]

On Monday 18 August 2008 12:20:34 Kjetil Torgrim Homme wrote:
> Kern Sibbald <kern@xxxxxxxxxxx> writes:
> > - As Martin points out, this code gives the SD a bit more knowledge
> > of the records it has stored, but unless someone has a better idea,
> > I see no alternative.
> the SD has this knowledge already, even if it ignores it.

Well, before the patch, it knew about certain streams (needs to pass selected 
ones to the Dir), but it did not know about or look at their contents.  

In any case, I have accepted this ...

> > - One aspect of this code I haven't looked at yet is whether it is
> > really required to add it in read_record.c rather than match_bsr.c,
> > where all the other bsr filtering code is located.  To be
> > investigated ...
> as far as I could tell, match_bsr is only called once per volume, and
> changing that design decision seemed more obtrusive.

No, match_bsr.c is called for each record.  You might take another look at it 
and see if it would be possible to move the code from read_record.c to 
match_bsr.c -- even if it takes a new subroutine call.  I haven't had a 
chance to look at that aspect yet, and I would like to do so before applying 
the patch.  The patch will definitely go in, but exactly how needs to be 

> > On a similar but slightly different subject: one user brought up a
> > problem that we are surely likely to see quite a lot in the near
> > future.  He has 600 million File records in his Bacula catalog, and
> > he is required to have at least a 7 year retention period, which
> > means the database is growing (I think it is currently at 100GB),
> > and it will continue to grow.
> >
> > He has proposed to improve performance to have a separate File table
> > for each client.  This would very likely improve the performance
> > quite a lot because if you have say 60 clients, instead of having
> > one gigantic File table it would be split into 60 smaller tables.
> > For example, instead of referencing File, Bacula would for a clients
> > named FD1 and FD2 reference FD1Files and FD2Files, and so on, each
> > of which would be identical tables but containing only the data for
> > a single client.
> >
> > The problem I have with the suggestion is that it would require
> > rather massive changes to the current SQL code, and it would break
> > all external programs that reference the File table of the database.
> this would be very awkward.  we have hundreds of clients, and they
> have very long names in Bacula (based on FQDN, often more than 40
> characters), so I dread typing in the SQL table names by hand :-)

Yes, I agree.  The basic idea is good, the problems with implementing it would 
be large ...

> > The first important information is that version 3.0.0 we are
> > planning to switch to by default using a 64 bit Id for the File
> > table -- this will remove the current restriction of 4G files (it
> > can manually be enabled in the current version, so the main change
> > is to make it automatic).
> oops, I just noticed our database schema still uses "int(10) unsigned"
> for FileId, I'll need to change that for sure ...

You will get it automatically (I hope) with 3.0 ...

> > The second thing that could help a lot is the "Selective restore"
> > patch submitted by Kjetil, because although a user may have a
> > requirement for long retention periods, that does not necessarily
> > mean the all the File records must be kept -- what is probably the
> > most important is retaining the data and being able to extract it in
> > a reasonable amount of time.  Implementation of this patch will
> > allow some users to prune the File records even though the Volumes
> > must be kept a long time.  Obviously this will not satisfy all
> > requirements.
> yes, it's a bit of a hack.  it's also a bit contradictory --
> typically, a full restore is only useful from a very recent backup.
> when restoring files from old backups, the user will want to
> cherrypick files, so it would definitely be best to not prune File
> information as long as the backup data is available.  but we're living
> in an imperfect world, and I think Bacula should try to cater for home
> users who have many files, but no beefy database server.

Yes, we try to keep Bacula suitable for small users, but we need to add high 
end features too (in a smart way) because some *very* big sites are using 
it ...

> that said, building the tree prior to the restore can take a long
> time, so even when full File information is available, entering a
> regexp can be much more convenient,

Yes, that is why I accepted the patch ...

> > Another suggestion that I have for the problem of growing File
> > tables is a sort of compromise.  Suppose that we implement two File
> > retention periods.  One as currently exists that defines when the
> > records are deleted, and a new period that defines when the records
> > are moved out of the File table and placed in a secondary table
> > perhaps called OldFiles.  This would allow users to keep the
> > efficiency for active files high but at the same time allow the
> > delete retention period to be quite long.  The database would still
> > grow, but there would be a lot less overhead.  Actually the name of
> > the table for these "expired" File records could even be defined on
> > a client by client or Job by Job basis which would allow for having
> > multiple "OldFiles" tables.
> >
> > Another advantage of my suggestion would be that within Bacula
> > itself, switching from using the File table to using the OldFiles
> > table could be made totally automatic (it will require a bit of
> > code, but no massive changes).  External programs would still
> > function normally in most cases, but if they wanted to access older
> > data, they would need some modification.
> using this scheme, an admin could configure Bacula to only keep the
> most current full backup and incrementals in the main (fast) table,
> and move the historic information the the OldFiles table.  this would
> allow more optimisation for the DBA than basing it on partitioning in
> the database, I think?

Yes, I need to look at how partitioning works.  I have a feeling it will not 
solve any of the problems if really gigantic database where some of the data 
is used all the time, and other data is almost never used.  I haven't given 
up on the File and OldFiles table idea.  However, I need to research 
partitioning because if it really solve the problem correctly, all the 
better -- we can focus on the many missing features ...


This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Bacula-devel mailing list

This mailing list archive is a service of Copilotco.