|
Server : Apache/2.2.2 (Fedora) System : Linux App1.pathumtani.go.th 2.6.20-1.2320.fc5smp #1 SMP Tue Jun 12 19:40:16 EDT 2007 i686 User : apache ( 48) PHP Version : 5.2.9 Disable Function : NONE Directory : /proc/self/root/usr/share/doc/postgresql-8.1.9/html/ |
Upload File : |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML
><HEAD
><TITLE
>How the Planner Uses Statistics</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
REV="MADE"
HREF="mailto:pgsql-docs@postgresql.org"><LINK
REL="HOME"
TITLE="PostgreSQL 8.1.9 Documentation"
HREF="index.html"><LINK
REL="UP"
TITLE="Internals"
HREF="internals.html"><LINK
REL="PREVIOUS"
TITLE="Example"
HREF="bki-example.html"><LINK
REL="NEXT"
TITLE="Appendixes"
HREF="appendixes.html"><LINK
REL="STYLESHEET"
TYPE="text/css"
HREF="stylesheet.css"><META
HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=ISO-8859-1"><META
NAME="creation"
CONTENT="2007-04-20T04:40:08"></HEAD
><BODY
CLASS="CHAPTER"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="5"
ALIGN="center"
VALIGN="bottom"
>PostgreSQL 8.1.9 Documentation</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="top"
><A
HREF="bki-example.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="top"
><A
HREF="bki.html"
>Fast Backward</A
></TD
><TD
WIDTH="60%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="top"
><A
HREF="appendixes.html"
>Fast Forward</A
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="top"
><A
HREF="appendixes.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="CHAPTER"
><H1
><A
NAME="PLANNER-STATS-DETAILS"
></A
>Chapter 52. How the Planner Uses Statistics</H1
><P
> This chapter builds on the material covered in <A
HREF="performance-tips.html#USING-EXPLAIN"
>Section 13.1</A
>
and <A
HREF="planner-stats.html"
>Section 13.2</A
>, and shows how the planner uses the
system statistics to estimate the number of rows each stage in a query might
return. This is a significant part of the planning / optimizing process,
providing much of the raw material for cost calculation.
</P
><P
> The intent of this chapter is not to document the code —
better done in the code itself, but to present an overview of how it works.
This will perhaps ease the learning curve for someone who subsequently
wishes to read the code. As a consequence, the approach chosen is to analyze
a series of incrementally more complex examples.
</P
><P
> The outputs and algorithms shown below are taken from version 8.0.
The behavior of earlier (or later) versions may vary.
</P
><DIV
CLASS="SECT1"
><H1
CLASS="SECT1"
><A
NAME="ROW-ESTIMATION-EXAMPLES"
>52.1. Row Estimation Examples</A
></H1
><A
NAME="AEN63512"
></A
><P
> Using examples drawn from the regression test database, let's start with a
very simple query:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244)</PRE
><P>
How the planner determines the cardinality of <CODE
CLASS="CLASSNAME"
>tenk1</CODE
>
is covered in <A
HREF="performance-tips.html#USING-EXPLAIN"
>Section 13.1</A
>, but is repeated here for
completeness. The number of rows is looked up from
<CODE
CLASS="CLASSNAME"
>pg_class</CODE
>:
</P><PRE
CLASS="PROGRAMLISTING"
>SELECT reltuples, relpages FROM pg_class WHERE relname = 'tenk1';
relpages | reltuples
----------+-----------
345 | 10000</PRE
><P>
The planner will check the <TT
CLASS="STRUCTFIELD"
>relpages</TT
>
estimate (this is a cheap operation) and if incorrect may scale
<TT
CLASS="STRUCTFIELD"
>reltuples</TT
> to obtain a row estimate. In this
case it does not, thus:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = 10000</PRE
><P>
</P
><P
> let's move on to an example with a range condition in its
<TT
CLASS="LITERAL"
>WHERE</TT
> clause:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000;
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244)
Filter: (unique1 < 1000)</PRE
><P>
The planner examines the <TT
CLASS="LITERAL"
>WHERE</TT
> clause condition:
</P><PRE
CLASS="PROGRAMLISTING"
>unique1 < 1000</PRE
><P>
and looks up the restriction function for the operator
<TT
CLASS="LITERAL"
><</TT
> in <CODE
CLASS="CLASSNAME"
>pg_operator</CODE
>.
This is held in the column <TT
CLASS="STRUCTFIELD"
>oprrest</TT
>,
and the result in this case is <CODE
CLASS="FUNCTION"
>scalarltsel</CODE
>.
The <CODE
CLASS="FUNCTION"
>scalarltsel</CODE
> function retrieves the histogram for
<TT
CLASS="STRUCTFIELD"
>unique1</TT
> from <CODE
CLASS="CLASSNAME"
>pg_statistics</CODE
>
- we can follow this by using the simpler <CODE
CLASS="CLASSNAME"
>pg_stats</CODE
>
view:
</P><PRE
CLASS="PROGRAMLISTING"
>SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='unique1';
histogram_bounds
------------------------------------------------------
{1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995}</PRE
><P>
Next the fraction of the histogram occupied by <SPAN
CLASS="QUOTE"
>"< 1000"</SPAN
>
is worked out. This is the selectivity. The histogram divides the range
into equal frequency buckets, so all we have to do is locate the bucket
that our value is in and count <SPAN
CLASS="emphasis"
><I
CLASS="EMPHASIS"
>part</I
></SPAN
> of it and
<SPAN
CLASS="emphasis"
><I
CLASS="EMPHASIS"
>all</I
></SPAN
> of the ones before. The value 1000 is clearly in
the second (970 - 1943) bucket, so by assuming a linear distribution of
values inside each bucket we can calculate the selectivity as:
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = (1 + (1000 - bckt[2].min)/(bckt[2].max - bckt[2].min))/num_bckts
= (1 + (1000 - 970)/(1943 - 970))/10
= 0.1031</PRE
><P>
that is, one whole bucket plus a linear fraction of the second, divided by
the number of buckets. The estimated number of rows can now be calculated as
the product of the selectivity and the cardinality of
<CODE
CLASS="CLASSNAME"
>tenk1</CODE
>:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = rel_cardinality * selectivity
= 10000 * 0.1031
= 1031</PRE
><P>
</P
><P
> Next let's consider an example with equality condition in its
<TT
CLASS="LITERAL"
>WHERE</TT
> clause:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA';
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244)
Filter: (stringu1 = 'ATAAAA'::name)</PRE
><P>
Again the planner examines the <TT
CLASS="LITERAL"
>WHERE</TT
> clause condition:
</P><PRE
CLASS="PROGRAMLISTING"
>stringu1 = 'ATAAAA'</PRE
><P>
and looks up the restriction function for <TT
CLASS="LITERAL"
>=</TT
>, which is
<CODE
CLASS="FUNCTION"
>eqsel</CODE
>. This case is a bit different, as the most
common values — <ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
>s, are used to determine the
selectivity. Let's have a look at these, with some extra columns that will
be useful later:
</P><PRE
CLASS="PROGRAMLISTING"
>SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
n_distinct | 672
most_common_vals | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA}
most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}</PRE
><P>
The selectivity is merely the most common frequency (<ACRONYM
CLASS="ACRONYM"
>MCF</ACRONYM
>)
corresponding to the third <ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
> — 'ATAAAA':
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = mcf[3]
= 0.003</PRE
><P>
The estimated number of rows is just the product of this with the
cardinality of <CODE
CLASS="CLASSNAME"
>tenk1</CODE
> as before:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = 10000 * 0.003
= 30</PRE
><P>
The number displayed by <TT
CLASS="COMMAND"
>EXPLAIN</TT
> is one more than this,
due to some post estimation checks.
</P
><P
> Now consider the same query, but with a constant that is not in the
<ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
> list:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244)
Filter: (stringu1 = 'xxx'::name)</PRE
><P>
This is quite a different problem, how to estimate the selectivity when the
value is <SPAN
CLASS="emphasis"
><I
CLASS="EMPHASIS"
>not</I
></SPAN
> in the <ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
> list.
The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the
<ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
>s:
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
= (1 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003
+ 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667))/(672 - 10)
= 0.001465</PRE
><P>
That is, add up all the frequencies for the <ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
>s and
subtract them from one — because it is <SPAN
CLASS="emphasis"
><I
CLASS="EMPHASIS"
>not</I
></SPAN
> one
of these, and divide by the <SPAN
CLASS="emphasis"
><I
CLASS="EMPHASIS"
>remaining</I
></SPAN
> distinct values.
Notice that there are no null values so we don't have to worry about those.
The estimated number of rows is calculated as usual:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = 10000 * 0.001465
= 15</PRE
><P>
</P
><P
> Let's increase the complexity to consider a case with more than one
condition in the <TT
CLASS="LITERAL"
>WHERE</TT
> clause:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
QUERY PLAN
-----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244)
Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name))</PRE
><P>
An assumption of independence is made and the selectivities of the
individual restrictions are multiplied together:
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
= 0.1031 * 0.001465
= 0.00015104</PRE
><P>
The row estimates are calculated as before:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = 10000 * 0.00015104
= 2</PRE
><P>
</P
><P
> Finally we will examine a query that includes a <TT
CLASS="LITERAL"
>JOIN</TT
>
together with a <TT
CLASS="LITERAL"
>WHERE</TT
> clause:
</P><PRE
CLASS="PROGRAMLISTING"
>EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
QUERY PLAN
-----------------------------------------------------------------------------------------
Nested Loop (cost=0.00..346.90 rows=51 width=488)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244)
Index Cond: (unique1 < 50)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244)
Index Cond: ("outer".unique2 = t2.unique2)</PRE
><P>
The restriction on <CODE
CLASS="CLASSNAME"
>tenk1</CODE
>
<SPAN
CLASS="QUOTE"
>"unique1 < 50"</SPAN
> is evaluated before the nested-loop join.
This is handled analogously to the previous range example. The restriction
operator for <TT
CLASS="LITERAL"
><</TT
> is <CODE
CLASS="FUNCTION"
>scalarlteqsel</CODE
>
as before, but this time the value 50 is in the first bucket of the
<TT
CLASS="STRUCTFIELD"
>unique1</TT
> histogram:
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = (0 + (50 - bckt[1].min)/(bckt[1].max - bckt[1].min))/num_bckts
= (0 + (50 - 1)/(970 - 1))/10
= 0.005057
rows = 10000 * 0.005057
= 51</PRE
><P>
The restriction for the join is:
</P><PRE
CLASS="PROGRAMLISTING"
>t2.unique2 = t1.unique2</PRE
><P>
This is due to the join method being nested-loop, with
<CODE
CLASS="CLASSNAME"
>tenk1</CODE
> being in the outer loop. The operator is just
our familiar <TT
CLASS="LITERAL"
>=</TT
>, however the restriction function is
obtained from the <TT
CLASS="STRUCTFIELD"
>oprjoin</TT
> column of
<CODE
CLASS="CLASSNAME"
>pg_operator</CODE
> - and is <CODE
CLASS="FUNCTION"
>eqjoinsel</CODE
>.
Additionally we use the statistical information for both
<CODE
CLASS="CLASSNAME"
>tenk2</CODE
> and <CODE
CLASS="CLASSNAME"
>tenk1</CODE
>:
</P><PRE
CLASS="PROGRAMLISTING"
>SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
tablename | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
tenk1 | 0 | -1 |
tenk2 | 0 | -1 |</PRE
><P>
In this case there is no <ACRONYM
CLASS="ACRONYM"
>MCV</ACRONYM
> information for
<TT
CLASS="STRUCTFIELD"
>unique2</TT
> because all the values appear to be
unique, so we can use an algorithm that relies only on the number of
distinct values for both relations together with their null fractions:
</P><PRE
CLASS="PROGRAMLISTING"
>selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
= (1 - 0) * (1 - 0) * min(1/10000, 1/1000)
= 0.0001</PRE
><P>
This is, subtract the null fraction from one for each of the relations,
and divide by the maximum of the two distinct values. The number of rows
that the join is likely to emit is calculated as the cardinality of
Cartesian product of the two nodes in the nested-loop, multiplied by the
selectivity:
</P><PRE
CLASS="PROGRAMLISTING"
>rows = (outer_cardinality * inner_cardinality) * selectivity
= (51 * 10000) * 0.0001
= 51</PRE
><P>
</P
><P
> For those interested in further details, estimation of the number of rows in
a relation is covered in
<TT
CLASS="FILENAME"
>src/backend/optimizer/util/plancat.c</TT
>. The calculation
logic for clause selectivities is in
<TT
CLASS="FILENAME"
>src/backend/optimizer/path/clausesel.c</TT
>. The actual
implementations of the operator and join restriction functions can be found
in <TT
CLASS="FILENAME"
>src/backend/utils/adt/selfuncs.c</TT
>.
</P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="bki-example.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="appendixes.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Example</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="internals.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Appendixes</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>