Sybase, Inc.
6475 Christie Avenue, Emeryville, California, USA 94608
Phone: 800-8-SYBASE
Overview
The primary goal of the optimizer is to determine the fastest method of executing a given query. This method is called the optimal access plan. The SQL Anywhere optimizer performs two distinct phases of optimization on each query. The first phase, called query rewrite optimization, transforms a query into a semantically equivalent form if the transformation is likely to assist in the discovery of the optimal access plan. The second phase, termed access plan generation, computes the optimal access plan through a cost-based analysis of join strategies. While there are several different possible techniques for retrieving data from a single table, in the case of one or more joins there are additional strategies that involve accessing the tables in a different order. The SQL Anywhere query optimizer considers a join strategy for each of the possible table combinations. For example, if there are n tables in a query, then the optimizer analyzes n ! (n factorial) join strategies. The optimizer uses heuristics (educated guesses) to determine the best use of table keys and indexes, given the order of table accesses in the join strategy being evaluated. The optimal access plan generated by this second phase is an optimized ordering of operations, including the evaluation of any nested subqueries, that will compute the query's result over the specified set of tables.1. IntroductionThe total execution time estimated for the execution of an access plan is termed its cost. This cost encompasses the time required for both disk accesses and CPU processing. In order to estimate the cost of an access plan, it is necessary to know how many records will be retrieved by a query. The following techniques are used to estimate how many rows of a single table will satisfy a predicate.
Indexes: There are two types of indexes in SQL Anywhere. A primary key index of a table includes index entries for each row and index entries for the foreign keys from other tables that refer to that row's primary key value. Other, non-primary key indexes simply contain index entries for the indexed columns and are explicitly created by the Database Administrator. SQL Anywhere's optimizer has a sophisticated estimation technique for use with indexed columns. It is able to determine the number of rows in a table that satisfy a search condition with excellent accuracy.
Using these techniques, the optimizer is able to determine the number of rows from a single table that will satisfy the query. When multiple tables are involved, a different technique may be used for each table, depending on the order of the tables in the join strategy. These techniques enable the optimizer to eliminate irrelevant rows as early as possible, thereby reducing the number of unnecessary disk accesses. Furthermore, the optimizer will reorder the evaluation of predicates in the "where" clause so that the predicate that eliminates the greatest number of rows is done first. An ideal access plan will only access rows that contribute to the query's result.Data characteristics: SQL Anywhere has the unique capability to compute column and table statistics during query processing and store them in the database. Therefore, even if you have not created an index on column x, the optimizer is able to more accurately calculate predicate selectivities for subsequent queries that contain an equality (=) condition on column x.
Default estimates: If the optimizer needs to estimate the selectivity of a predicate that includes a non-indexed column for which no prior statistics are available, it estimates the number of records satisfying the selection condition using a set of default selectivity estimates based on a uniform distribution of random data.
User override: Finally, the application designer may override all of the previous alternatives. If the designer knows a great deal about the distribution of values in the database, then it may be advisable for the designer to explicitly declare the probability of a predicate being true.
Although indexes are an excellent means of increasing performance, they must be a well-designed element of the database in order to balance the reduced execution time of queries with the additional overhead indexes add to update operations. An excessive number of indexes may result in slower performance. The overhead of the optimizer itself is another consideration, especially as the number of tables increases. To help address this problem the SQL Anywhere query optimizer includes a time-out feature, which ensures that the time required to generate the optimal access plan is less than the time required to execute it. In addition, SQL Anywhere takes advantage of a proprietary algorithm that eliminates obviously poor join strategies, thereby dramatically decreasing optimization overhead.
There are many ways to improve the performance of a database. Upgrading hardware is one possibility. It is also important to ensure that the system's configuration eliminates memory swapping. Optimizing the cache size and page size can significantly reduce execution time. In addition, how the query itself is expressed can affect performance. SQL Anywhere's The following are trademarks, registered trademarks, or service marks of their respective companies: SQL Anywhere and Watcom SQL are trademarks of Sybase, Inc. or its subsidiaries; DB2 is a trademark of IBM Corporation; Oracle is a trademark of Oracle Corporation.Figure 1b: ISQL Summarycost-based query optimizer addresses this latter issue. The optimizer is designed to translate queries into an efficient access plan that for the most part is independent of the query's syntax. Using its knowledge of the data and a thorough analysis of possible access plans, the expert experience built into the optimizer can improve query performance considerably, enabling you to concentrate on application development tasks rather than the performance of individual queries. If you would like to improve the performance of your database, it is best to address the issues of cache size, page size and hardware first, as they typically offer the most dramatic improvements. This Sybase White Paper focuses on the workings of the SQL Anywhere optimizer. It will explain how the optimizer evaluates the many different ways of executing a query and how it selects the optimal access plan.
This paper assumes that you understand the concepts of relational databases, including knowledge of tables, columns, keys and indexes. It also assumes that you have a working knowledge of SQL (Structured Query Language) and of the concepts of disk I/O. If you are not familiar with the details of indexing, you may find the overview in Appendix A helpful. A diagram of the sademo database schema can be found in Appendix B. All examples use Interactive SQL (ISQL), included with SQL Anywhere, and the demonstration database sademo.db. The examples can be run on any platform supported by SQL Anywhere. Refer to Figure 1 for details on executing the examples and a summary of the ISQL display. All text in courier represent SQL Anywhere commands or queries.
Data Window: Displays data selected by
the query.
Statistics Window: Displays the selected
Query Plan (described below) for each
executed query.
| Comparison | Percentage |
| = | 5 |
| <> | 95 |
| <, <=, >, >= | 25 |
| between | 6
If the optimizer needs to estimate the selectivity of a predicate that includes a non-indexed column for which no prior statistics are available, it estimates the number of records satisfying the selection condition using a set of default selectivity estimates based on a uniform distribution of random data. For example, if a search predicate with a (>) operator is specified, the optimizer will estimate that 25% of the records will satisfy that predicate (see Example 7). The (=) operator percentage applies only to those conditions that do not directly involve a column (see Example 8). |
- Note that these estimates are only default values and may be altered
based on your knowledge of the data (see Section 4.4, 'User Override').
Query: 'List all employees born in February'
select * from employee
where month(birth_date) = 2
Statistics:
Estimated 3 rows in query (I/O estimate 9)
PLAN> employee (seq)
This example also demonstrates distribution approximations, where no indexes are available. When a query includes the (=) operator on expressions, as opposed to column names, the optimizer estimates that 5% of the records will satisfy the comparison. Here, the optimizer estimates that 3 of 75 employees were born in February. This is close to the true value of 4.
Example 7:
Query: 'List all employees whose salary exceeds
$55,000 per annum'
select * from employee
where salary > 55000
Statistics:
Estimated 18 rows in query (I/O estimate 9)
PLAN> employee (seq)
This example demonstrates the use of distribution approximations
when no indexes are available. When a query includes the (>) operator,
the optimizer estimates that 25% of the records will satisfy the search
predicate. The employee table consists of 75 rows, so it estimates that
18 employees will have a salary that exceeds $55,000. This is close to
the true number of 22.
4.4 User Override
Example 9:
Query: 'List all sales order item details for quantity-ordered
> 50'
select * from sales_order_items
where quantity > 50
Statistics:
Estimated 274 rows in query (I/O estimate 26)
PLAN> sales_order_items (seq)
This example illustrates a poor selectivity estimate using SQL Anywhere's default estimates. The distribution percentage suggests that 25% of the records will satisfy the query. Since the table has 1097 rows, the optimizer estimates that 274 rows will be in the result. In reality, far fewer than 25% of the products ordered are for large quantities; a more accurate estimate would be about 5%.
Command:
select * from sales_order_items
where (quantity > 50, 5)
Statistics:
Estimated 54 rows in Query (I/O estimate 26)
PLAN> sales_order_items (seq)
Here, the query is run again with a user estimate of 5%, instead of the default estimate of 25%. The estimated number of rows in the result is now 54, which is very close to the true value of 60.
- Finally, the application designer may override all of the previous
alternatives. If the designer knows a great deal about the distribution
of values in the database, then it may be advisable to declare the probability
of a predicate being true. The SQL Anywhere optimizer will subsequently
use this estimate to determine the number of records that will be retrieved,
which may result in a different choice of access plans. Example 9 illustrates
how to declare a user estimate.
Users should rarely need to use this feature. If a database has been designed well with appropriate indexes, there is no need for a designer to provide any additional information regarding distribution of data. The optimizer uses index cardinality and the persistent storage of statistics to provide this information.
Application designers using queries containing coded estimates should
re-evaluate such queries periodically to ensure that the access plan chosen
by the optimizer due to the override is still optimal.
- Using the techniques outlined in the previous sections, the optimizer
estimates the number of rows from a single table that will satisfy a given
query. When multiple tables are involved, a different technique may be
applied to each table, depending on the order of the tables in the join
strategy. This enables the optimizer to eliminate irrelevant rows as early
as possible, thereby reducing the number of unnecessary disk accesses.
The optimizer will reorder the atomic conditions in the "where" clause
to correspond with its estimates, so that conditions that reject the largest
number of rows are considered first. An ideal access plan will only retrieve
rows that contribute to the query result. Where a choice of access plans
is possible, the SQL Anywhere optimizer will discourage the use of temporary
tables in favor of using an index. This is done by making access plans
that contain temporary tables artificially expensive.
Using an index will typically provide faster response time (see Example 10). Examples 11 and 12 show how the estimation techniques for single tables help the optimizer to wisely select the best access plan for multi-table queries. Performance improvements are noted for each. The examples were run using the standalone version of SQL Anywhere, using the default cache size of two megabytes, on a Pentium PC running the Windows/NT 3.51 operating system. The database was stopped and started between each query to eliminate the effects of the cache on subsequent queries.
Query: 'List all sales order items by order quantity'
select * from sales_order_items
order by quantity
Statistics:
1097 rows in query (I/O estimate 959)
PLAN> TEMPORARY TABLE sales_order_items (seq)
This example demonstrates how indexes can improve efficiency. When no index is available, the optimizer uses a temporary table to sort the values in the correct order. This query required 1.67 seconds to execute, but the entire table was retrieved before any result row was displayed, since the row to display first depends on the outcome of the sort step. In this case, the row with the smallest quantity was fetched after 0.34 seconds.
Alternatively, one could index the quantity column in the sales_order_items table to retrieve the rows in sorted order directly.
Command:
create index sold_qty on sales_order_items
(quantity)
Query:
select * from sales_order_items
order by quantity
Statistics:
1097 rows in query (I/O estimate 78)
PLAN> sales_order_items (sold_qty)
When the query is run again, using the sold_qty index,
the total execution time is nearly identical. However, since the access
plan uses an index instead of a temporary table the first row in the result
is available almost instantaneously. For more information about the indexes
used in SQL Anywhere, refer to Appendix A.
6. Overhead
6.1. Indexes
- Indexes are an excellent means of increasing performance. However,
they must be a well-designed element of the database in order to balance
the reduced execution time of queries with the additional overhead of index
maintenance during updates. If a row in a table is deleted, each index
on that table must be updated to remove the index entries that point to
that row. Similarly, on insertions an index entry must be created in each
index on that table. For updates, index maintenance is required on each
index that contains one or more columns modified by the update statement.
High frequencies of insertions and deletions will also result in additional
processing to maintain the balance of the B+-tree. Users must keep this
maintenance overhead in mind when designing their databases and applications
to ensure optimal performance.
- Optimizer overhead is another consideration, especially as the number
of tables increase. If there are only 4 tables, then 24 join strategies
are theoretically possible. However, if there are 8 tables, then there
are a possible 40,320 join strategies. Consider a 12 table query where
there are 479,001,600 possible join strategies. Creating and analyzing
this many strategies requires a great deal of processing, and so the SQL
Anywhere query optimizer includes a number of features to reduce this overhead.
First, access plans are cached so that the cost of optimization can be
amortized over multiple (identical) queries. Second, the optimizer includes
a time-out feature that ensures that the optimizer does not require more
processing than it can reduce. Third, SQL Anywhere takes advantage of a
proprietary algorithm that eliminates obviously poor join strategies, thereby
dramatically decreasing processing requirements.
- Like other commercial database systems including Oracle and DB2,
Sybase SQL Anywhere implements indexes by using B+-trees (Comer
1979). B+-trees are balanced search trees designed to work well on magnetic
disks or other direct-access secondary storage devices.
The diagram below illustrates the structure of a B+-tree. The lowest level, consisting of a set of leaf pages termed the sequence set, contains pointers to individual rows in the table along with the index 'key' of that row. The sequence set is kept ordered so that accessing rows in key sequence through the index involves only a sequential scan of the sequence set, plus a retrieval of the corresponding row for each index entry. With random indexed access, however, the higher levels of the index, termed the index set, are used to narrow the search for the index entry at each subsequent level. Each index set page contains several key-pointer pairs, the key being the highest key value of the lower-level page. The number of pages a page in the index set can point to is termed the index's fan-out. How many levels are required for a particular index depends on a number of factors, including the type of index, the size of the base table, the size of each index entry, and the page size of the database.