10053 to the rescue again!
When I took a look, the execution plans for the queries were not the same, so Of course, the first place I looked was in the statistics. What I found was quite interesting - the "fast" table had been analyzed with a sample of 10%, whereas the "slow" table had been analyzed 100%. When I re-analyzed the table with a 10% sample size, hey presto the original (fast) plan was once again chosen.
So I started taking a look at HOW the CBO had come up with the respective plans, and I did this with a 10053 trace. (Although the query was quite complicated, I've simplified it right down in this post, for clarity.) What I'll do here is explain first what the problem was, and then show how it affected the decision of the CBO...
The problem was that in the million row table, there was one row which had an incorrect value. It was a date of 13-JUL-8569. It was just wrong. Very wrong. There shouldn't have been anything higher than SYSDATE plus about a month. Here's what I saw (table and column names changed to protect the innocent!):
SQL> select to_char(max(my_date),'DD-MON-YYYY HH24:MI:SS') from my_table;
TO_CHAR(MAX(MY_DATE)
--------------------
13-JUL-8569 00:00:00
SQL> select to_char(min(my_date),'DD-MON-YYYY HH24:MI:SS') from my_table;
TO_CHAR(MIN(MY_DATE)
--------------------
03-JUL-1996 00:00:00
My_table has an index on my_date.
In the query we had the predicate WHERE my_date < SYSDATE. So without a histogram, the optimizer was assuming a uniform distribution of dates between 1996 and 8569, meaning that the likelihood of a date being < SYSDATE was pretty small. In fact, without this bad date, the range would be between 1996 and SYSDATE+30, meaning that the likelihood of a date being < SYSDATE was extremely high.
What had happened was that in sampling at 10%, we had not sampled the bad date, and hence calculated a sensible cardinality, but as soon as we sampled 100% we included the bad date and the cardinality was stupidly low. (Needless to say, the solution was to fix the bad data rather than to gamble on the sample missing it!)
To illustrate this I have recreated a simplified version below.
SQL> create table my_table
2 (my_date date, my_char char(100) );
Table created.
SQL> begin
2 for i in 1 .. 10000 loop
3 insert into my_table values (sysdate-dbms_random.value(1,1000), 'X');
4 end loop;
5 commit;
6 end;
7 /
PL/SQL procedure successfully completed.
SQL> create index test1_n1 on my_table(my_date);
Index created.
SQL> exec dbms_stats.gather_table_stats(ownname=>'SYS',tabname=>'MY_TABLE',cascade=>true);
PL/SQL procedure successfully completed.
SQL> update my_table set my_date=sysdate+1000000 where rownum=1;
1 row updated.
SQL> commit;
Commit complete.
SQL> set autotrace on
SQL> select * from my_table where my_date
<some editing here, but not of anything relevant>
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=38 Card=10000 Bytes=
1090000)
1 0 TABLE ACCESS (FULL) OF 'MY_TABLE' (Cost=38 Card=10000 Byte
s=1090000)
831 consistent gets
SQL> exec dbms_stats.gather_table_stats(ownname=>'SYS',tabname=>'MY_TABLE',cascade=>true);
PL/SQL procedure successfully completed.
SQL> select * from my_table where my_date < sysdate;
<some editing here, but not of anything relevant>
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=13 Card=10 Bytes=109
0)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'MY_TABLE' (Cost=13 Card=
10 Bytes=1090)
2 1 INDEX (RANGE SCAN) OF 'TEST1_N1' (NON-UNIQUE) (Cost=3 Ca
rd=10)
10631 consistent gets
So you can see that the index range scan looks preferable when the bad data is analyzed. The operation is assigned a falsely low cost, and you can see the far greater number of consistent gets required.
As I mentioned, the original query was a quite complicated, and involved joining several tables, so when I started looking at the problem, it wasn't quite a "needle in a haystack" but it was, shall we say, not obvious. It was the 10053 trace which led me to find the cause of the problem. I did this as follows:
SQL> alter session set events '10053 trace name context forever, level 1';
Session altered.
SQL> explain plan for select * from my_table where my_date <sysdate;
Explained.
The trace files were in my udump directory, and this is what I saw:
When bad data excluded, query runs fast using Full Table Scan -
SINGLE TABLE ACCESS PATH
Column: MY_DATE Col#: 1 Table: MY_TABLE Alias: MY_TABLE
NDV: 10000 NULLS: 0 DENS: 1.0000e-04 LO: 2452971 HI: 2453970
NO HISTOGRAM: #BKT: 1 #VAL: 2
TABLE: MY_TABLE ORIG CDN: 10000 ROUNDED CDN: 10000 CMPTD CDN: 10000
Access path: tsc Resc: 38 Resp: 37
Access path: index (scan)
Index: TEST1_N1
TABLE: MY_TABLE
RSC_CPU: 74650907 RSC_IO: 9963
IX_SEL: 1.0000e+00 TB_SEL: 1.0000e+00
BEST_CST: 38.00 PATH: 2 Degree: 1
.
When bad data included, query runs slowly using Index Range Scan -
SINGLE TABLE ACCESS PATH
Column: MY_DATE Col#: 1 Table: MY_TABLE Alias: MY_TABLE
NDV: 10000 NULLS: 0 DENS: 1.0000e-04 LO: 2452971 HI: 3453971
NO HISTOGRAM: #BKT: 1 #VAL: 2
TABLE: MY_TABLE ORIG CDN: 10000 ROUNDED CDN: 10 CMPTD CDN: 10
Access path: tsc Resc: 38 Resp: 37
Access path: index (scan)
Index: TEST1_N1
TABLE: MY_TABLE
RSC_CPU: 89153 RSC_IO: 12
IX_SEL: 9.9883e-04 TB_SEL: 9.9883e-04
BEST_CST: 13.00 PATH: 4 Degree: 1
.
Here you can see here the falsely low computed cardinality, which in the case of the predicate I used, is NUM_ROWS * (value-low / high-low).
I don't propose to reproduce it all here, but the paper by Wolfgang Breitling entitled "A look under the hood of CBO" is a good place to start reading up on this stuff if you are not familiar, and of course you can't discuss this without mentioning Jonathan Lewis' excellent "Cost Based Oracle Fundamentals".
