[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bacula-devel] small dbcheck patch for mysql


In response to "Yuri Timofeev" <tim4dev@xxxxxxxxx>:

> 2008/9/9 Bill Moran <wmoran@xxxxxxxxxxxxxxxxxxxxxxx>:
> > In response to Bob Hetzel <beh@xxxxxxxx>:
> >.  Additionally, doing that
> > rewrite so that it results in improvements to all SQL platforms is
> > probably going to require that the code be broken into SQL-dependent
> > and SQL-independent sections.
> 
> It may be that in the critical sections have to do it.

Agreed.  This was demonstrated last year by some massive speed improvements
in the way that job records are saved.

> The problem is that different DBMS have different optimizers.
> For example, in PostgreSQL (during my tests) operator EXPLAIN does not
> show that will be used indices  :
> 
> bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN Job ON
>   (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT 300000;
> 
>                                 QUERY PLAN
> ---------------------------------------------------------------------------
>  Limit  (cost=78.86..24169.76 rows=300000 width=8)
>    ->  Hash Left Join  (cost=78.86..181966.90 rows=2265021 width=8)
>          Hash Cond: (file.jobid = job.jobid)
>          Filter: (job.jobid IS NULL)
>          ->  Seq Scan on file  (cost=0.00..113937.42 rows=4530042 width=8)
>          ->  Hash  (cost=58.38..58.38 rows=1638 width=4)
>                ->  Seq Scan on job  (cost=0.00..58.38 rows=1638 width=4)
> (7 rows)
> 
> However, PostgreSQL worked quickly.

That's because using indexes on that query is a horrifically bad idea:

bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN Job ON
bacula-#   (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT 300000;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Limit  (cost=53.15..14650.94 rows=300000 width=8)
   ->  Hash Left Join  (cost=53.15..2115902.45 rows=43482932 width=8)
         Hash Cond: ("outer".jobid = "inner".jobid)
         Filter: ("inner".jobid IS NULL)
         ->  Seq Scan on file  (cost=0.00..1463605.32 rows=43482932 width=8)
         ->  Hash  (cost=50.52..50.52 rows=1052 width=4)
               ->  Seq Scan on job  (cost=0.00..50.52 rows=1052 width=4)
(7 rows)

bacula=# set enable_seqscan=off;
SET
Time: 0.210 ms
bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN Job ON
bacula-#   (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT 300000;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1042544.60 rows=300000 width=8)
   ->  Merge Right Join  (cost=0.00..151109653.30 rows=43482932 width=8)
         Merge Cond: ("outer".jobid = "inner".jobid)
         Filter: ("outer".jobid IS NULL)
         ->  Index Scan using job_pkey on job  (cost=0.00..175.80 rows=1052 width=4)
         ->  Index Scan using file_jobid_idx on file  (cost=0.00..150565938.23 rows=43482932 width=8)
(6 rows)

Notice that PostgreSQL expects the index scan on file_job_idx to take
150,565,938 operations along, whereas the entire sequential scan only
takes 1,463,605.  The seq scans are a clear winner in this case.

> MySQL (theoretically) should use an index (multiple columns), which
> already have:
> in make_mysql_tables.in
> CREATE TABLE File (...
> INDEX (JobId, PathId, FilenameId)
> 
> but MySQL worked slowly.

I don't know much about the innards of MySQL's query planner, but it
doesn't seem to be as robust as Postgres'.  Index scanning is usually
the best way to go, but (as is shown above) not always.  I have trouble
believing that the MySQL developers always use available indexes as
that would create a laundry list of corner cases where queries perform
horribly, but it seems that their planning logic fails in this particular
case.

It's not unusual.  If you mess with Postgres long enough, you'll find
odd queries that perform poorly even though there's a much faster way
to do it.  And that's the rub.  If I were writing a system purely for
PostgreSQL, I could reorder joins or make other tweaks to the query
until it ran quickly, but then I might have a query that ran like
molasses on MySQL.  The inverse is true as well.

My point is that such work is definitely warranted in the director, I'm
not sure if it's warranted in something like dbcheck.  If you've already
got it done, there's no reason to waste it, but suggesting that someone
review dbcheck and optimize the SQL is like asking me to clean toilets
with a toothbrush -- it's a useful exercise to teach privates not to
mouth off to sergeants, but it's not a particularly efficient way to get
the toilet clean.

I expect, that on Postgres, the fastest way to delete orphaned records
would be:
delete from filename 
 where filename.filenameid not in (select filenameid from file);

Assuming that is fast, would such a query even work on MySQL?

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@xxxxxxxxxxxxxxxxxxxxxxx
Phone: 412-422-3463x4023

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilot Consulting.