Switch to standard view
Sybase logo  
Sybase logo  
Products | About Sybase | Support



SQL Anywhere 5.x Cost-Based Query Optimizer
© Sybase, Workplace Database Division
Sybase, Inc.
6475 Christie Avenue, Emeryville, California, USA 94608
Phone: 800-8-SYBASE
SQL Anywhere Cost-Based Query Optimizer

Overview

The primary goal of the optimizer is to determine the fastest method of executing a given query. This method is called the optimal access plan. The SQL Anywhere optimizer performs two distinct phases of optimization on each query. The first phase, called query rewrite optimization, transforms a query into a semantically equivalent form if the transformation is likely to assist in the discovery of the optimal access plan. The second phase, termed access plan generation, computes the optimal access plan through a cost-based analysis of join strategies. While there are several different possible techniques for retrieving data from a single table, in the case of one or more joins there are additional strategies that involve accessing the tables in a different order. The SQL Anywhere query optimizer considers a join strategy for each of the possible table combinations. For example, if there are n tables in a query, then the optimizer analyzes n ! (n factorial) join strategies. The optimizer uses heuristics (educated guesses) to determine the best use of table keys and indexes, given the order of table accesses in the join strategy being evaluated. The optimal access plan generated by this second phase is an optimized ordering of operations, including the evaluation of any nested subqueries, that will compute the query's result over the specified set of tables.

The total execution time estimated for the execution of an access plan is termed its cost. This cost encompasses the time required for both disk accesses and CPU processing. In order to estimate the cost of an access plan, it is necessary to know how many records will be retrieved by a query. The following techniques are used to estimate how many rows of a single table will satisfy a predicate.

    Indexes: There are two types of indexes in SQL Anywhere. A primary key index of a table includes index entries for each row and index entries for the foreign keys from other tables that refer to that row's primary key value. Other, non-primary key indexes simply contain index entries for the indexed columns and are explicitly created by the Database Administrator. SQL Anywhere's optimizer has a sophisticated estimation technique for use with indexed columns. It is able to determine the number of rows in a table that satisfy a search condition with excellent accuracy.

    Data characteristics: SQL Anywhere has the unique capability to compute column and table statistics during query processing and store them in the database. Therefore, even if you have not created an index on column x, the optimizer is able to more accurately calculate predicate selectivities for subsequent queries that contain an equality (=) condition on column x.

    Default estimates: If the optimizer needs to estimate the selectivity of a predicate that includes a non-indexed column for which no prior statistics are available, it estimates the number of records satisfying the selection condition using a set of default selectivity estimates based on a uniform distribution of random data.

    User override: Finally, the application designer may override all of the previous alternatives. If the designer knows a great deal about the distribution of values in the database, then it may be advisable for the designer to explicitly declare the probability of a predicate being true.
     

Using these techniques, the optimizer is able to determine the number of rows from a single table that will satisfy the query. When multiple tables are involved, a different technique may be used for each table, depending on the order of the tables in the join strategy. These techniques enable the optimizer to eliminate irrelevant rows as early as possible, thereby reducing the number of unnecessary disk accesses. Furthermore, the optimizer will reorder the evaluation of predicates in the "where" clause so that the predicate that eliminates the greatest number of rows is done first. An ideal access plan will only access rows that contribute to the query's result.

Although indexes are an excellent means of increasing performance, they must be a well-designed element of the database in order to balance the reduced execution time of queries with the additional overhead indexes add to update operations. An excessive number of indexes may result in slower performance. The overhead of the optimizer itself is another consideration, especially as the number of tables increases. To help address this problem the SQL Anywhere query optimizer includes a time-out feature, which ensures that the time required to generate the optimal access plan is less than the time required to execute it. In addition, SQL Anywhere takes advantage of a proprietary algorithm that eliminates obviously poor join strategies, thereby dramatically decreasing optimization overhead.

1. Introduction
There are many ways to improve the performance of a database. Upgrading hardware is one possibility. It is also important to ensure that the system's configuration eliminates memory swapping. Optimizing the cache size and page size can significantly reduce execution time. In addition, how the query itself is expressed can affect performance. SQL Anywhere's The following are trademarks, registered trademarks, or service marks of their respective companies: SQL Anywhere and Watcom SQL are trademarks of Sybase, Inc. or its subsidiaries; DB2 is a trademark of IBM Corporation; Oracle is a trademark of Oracle Corporation.

cost-based query optimizer addresses this latter issue. The optimizer is designed to translate queries into an efficient access plan that for the most part is independent of the query's syntax. Using its knowledge of the data and a thorough analysis of possible access plans, the expert experience built into the optimizer can improve query performance considerably, enabling you to concentrate on application development tasks rather than the performance of individual queries. If you would like to improve the performance of your database, it is best to address the issues of cache size, page size and hardware first, as they typically offer the most dramatic improvements. This Sybase White Paper focuses on the workings of the SQL Anywhere optimizer. It will explain how the optimizer evaluates the many different ways of executing a query and how it selects the optimal access plan.

This paper assumes that you understand the concepts of relational databases, including knowledge of tables, columns, keys and indexes. It also assumes that you have a working knowledge of SQL (Structured Query Language) and of the concepts of disk I/O. If you are not familiar with the details of indexing, you may find the overview in Appendix A helpful. A diagram of the sademo database schema can be found in Appendix B. All examples use Interactive SQL (ISQL), included with SQL Anywhere, and the demonstration database sademo.db. The examples can be run on any platform supported by SQL Anywhere. Refer to Figure 1 for details on executing the examples and a summary of the ISQL display. All text in courier represent SQL Anywhere commands or queries.
 

Figure 1b: ISQL Summary

Data Window: Displays data selected by
the query.

Statistics Window: Displays the selected
Query Plan (described below) for each
executed query.

Command Window: Enter SQL commands
and DBA Toolbox Commands.
(Ctrl-arrow up/down scrolls through
previous commands or use
Ctrl-R for Command Recall.)

Figure 1a: To execute the examples (under Windows):
 

1) Ensure you have a copy of sademo.db on your hard drive.
2) Open 'ISQL' from the 'Sybase SQL Anywhere 5.0' program group.
3) Select 'Connect' from the 'Command' Menu.
4) Enter: 'dba' for the user.
'sql' for the password.
'{path}\sademo.db' for the database location.
 

2. Rewrite Optimization

3. Access Plan Generation
    The second phase of optimization, termed access plan generation, computes the optimal access plan through a cost-based analysis of join strategies. While there are several different possible techniques for retrieving data from a single table, in the case of one or more joins there are additional strategies that involve accessing the tables in a different order. The SQL Anywhere query optimizer considers a join strategy for each of the possible table combinations. For example, if there are n tables in a query, then the optimizer analyzes n! (n factorial) join strategies (see Example 3). The optimizer uses heuristics (educated guesses) to determine the best use of table keys and indexes, given the order of table accesses in the join strategy being evaluated. The optimal access plan generated by this second phase is an optimized ordering of operations, possibly including subqueries, that will compute the query's result over the specified set of tables. Over time, as the optimizer learns more about the characteristics of the data in the database, the SQL Anywhere optimizer may choose a different access plan for execution because its estimates are more exact (see Section 4.2) or because the data distribution has changed due to update operations.
Example 3:

Query: 'List all sales orders by product name'
select s.id, s.line_id, s.ship_date, p.name, p.quantity, s.quantity
from product p join sales_order_items s
order by p.name

Statistics:
Estimated 1102 rows in query (I/O estimate 86)
PLAN> p (ix_prod_name), s (ky_prod_id)

There are two ways to implement this query in regards to the order of the tables: product then sales_order_items, or sales_order_items then product. In this case, it is more efficient to access the product table first by its product name index (to satisfy the Order by clause), and then to access the sales_order_items table by product ID.

Example 4:

Query: 'List all employees'
select * from employee

Statistics:
75 rows in query (I/O estimate 13)
PLAN> employee (seq)

When all of the records from the employee table are selected, the optimizer knows that 75 rows will be retrieved, requiring 13 disk accesses. The optimal access plan in this case is simply a sequential scan of the employee table.

    When an SQL statement is entered in the Command Window, the Statistics Window displays the chosen access plan. The first line indicates the estimated number of records to be retrieved and also the estimated cost of the query in terms of the number of disk accesses (see Section 4, 'Estimating the Cost'). The second line indicates the optimized table order and the retrieval method (in brackets) for each table (see Examples 3 and 4 below).

    Occasionally, a query will span a large number of tables which could result in an excessive number of possible access plans. In this case the SQL Anywhere optimizer improves optimization performance on such queries by eliminating obviously poor table combinations (see Section 6, 'Overhead').

4. Estimating the cost
    The estimate of the total execution time required to execute an access plan is termed its cost. This cost encompasses the length of time for both disk accesses and processing requirements. The SQL Anywhere Optimizer estimates the number of disk accesses and translates the processing time into an equivalent disk access time. In order to determine the disk access requirements, it is necessary to know how many records will be retrieved by the query. The following techniques are used to estimate how many rows of a single table will satisfy the query. Section 5, 'Access Plan Selection', will explain how these estimates are used.
4.1. Indexes
    There are two types of indexes in SQL Anywhere. A primary key index of a table includes index entries for each row and index entries for the foreign keys from other tables that refer to that row's primary key value. Thus a primary key index not only serves to index rows by primary key value but also serves as the mechanism to enforce referential integrity constraints. Other, non-primary key indexes simply contain index entries for the indexed columns and can be created explicitly by the Database Administrator or by anyone with RESOURCE authority, including the owner of the table.
Example 5:

Query: 'List all sales orders for customers with ID > 150'
select *
from sales_order o
where o.cust_id > 150

Statistics:
Estimated 298 rows in query (I/O estimate 36)
PLAN>o (seq)

This example illustrates the accuracy of the optimizer when an index is used. In this case, even though the optimizer chose a sequential scan as the most efficient access plan, the optimizer used the index on customer IDs in the sales_order table (ix_sales_cust) to estimate the number of rows in the result. The actual number is 329.

    The optimizer has a sophisticated estimation technique for use with an indexed column. It is able to determine, with excellent accuracy, the number of records in a table.
     
4.2. Learning the characteristics of data

Example 6:

Query: 'List all employees who live in Texas'
select * from employee
where state = 'TX'

Statistics:
Estimated 1 rows in query (I/O estimate 9)
PLAN> employee (seq)

This example demonstrates the optimizer's ability to learn about data characteristics during query processing. Because the data has not been previously retrieved the optimizer predicts that there will be only one record that satisfies the equality predicate. In fact, five records were actually retrieved.

Query:
select emp_lname
from employee
where state = 'TX'
Statistics:
Estimated 5 rows in query (I/O estimate 9)
PLAN> employee (seq)

After the first query has run, and all the rows in the result have been retrieved, the SQL Anywhere optimizer analyzes its processing in an effort to learn about the characteristics of the `state' column in the employee table. When a subsequent query involving the same condition on the state column is executed, the optimizer recalls how many records satisfied the previous condition and optimizes the new query accordingly. Notice that the second time the optimizer accurately estimates that five rows will be retrieved.

    SQL Anywhere has the unique capability to modify its estimated cost of an access plan as queries are processed by analyzing the data as it is retrieved. In other words, even if you have not created an index on column x, the optimizer is able to more accurately calculate predicate selectivities for subsequent queries that contain an equality (=) condition on column x (see Example 6).
     
    Once the optimizer has gathered statistics about some data, it is able to improve the efficiency of any subsequent query that involves that data, as opposed to simple repetitions of the original query. SQL Anywhere employs persistent storage of these statistics so that when the engine is stopped, the statistics are written to disk. Upon subsequent startup, the statistics are loaded into cache.

    In addition, if the database is being accessed by multiple users, the improved efficiency is available to all users, as opposed to only the client that executed the original query.


4.3 Search Conditions
 

Comparison Percentage
= 5
<> 95
<, <=, >, >= 25
between 6

If the optimizer needs to estimate the selectivity of a predicate that includes a non-indexed column for which no prior statistics are available, it estimates the number of records satisfying the selection condition using a set of default selectivity estimates based on a uniform distribution of random data. For example, if a search predicate with a (>) operator is specified, the optimizer will estimate that 25% of the records will satisfy that predicate (see Example 7). The (=) operator percentage applies only to those conditions that do not directly involve a column (see Example 8).

          Note that these estimates are only default values and may be altered based on your knowledge of the data (see Section 4.4, 'User Override').
Example 8:

Query: 'List all employees born in February'
select * from employee
where month(birth_date) = 2

Statistics:
Estimated 3 rows in query (I/O estimate 9)
PLAN> employee (seq)

This example also demonstrates distribution approximations, where no indexes are available. When a query includes the (=) operator on expressions, as opposed to column names, the optimizer estimates that 5% of the records will satisfy the comparison. Here, the optimizer estimates that 3 of 75 employees were born in February. This is close to the true value of 4.

Example 7:

Query: 'List all employees whose salary exceeds $55,000 per annum'
select * from employee
where salary > 55000

Statistics:
Estimated 18 rows in query (I/O estimate 9)
PLAN> employee (seq)

This example demonstrates the use of distribution approximations when no indexes are available. When a query includes the (>) operator, the optimizer estimates that 25% of the records will satisfy the search predicate. The employee table consists of 75 rows, so it estimates that 18 employees will have a salary that exceeds $55,000. This is close to the true number of 22.
 

4.4 User Override

Example 9:

Query: 'List all sales order item details for quantity-ordered > 50'
select * from sales_order_items
where quantity > 50

Statistics:
Estimated 274 rows in query (I/O estimate 26)
PLAN> sales_order_items (seq)

This example illustrates a poor selectivity estimate using SQL Anywhere's default estimates. The distribution percentage suggests that 25% of the records will satisfy the query. Since the table has 1097 rows, the optimizer estimates that 274 rows will be in the result. In reality, far fewer than 25% of the products ordered are for large quantities; a more accurate estimate would be about 5%.

Command:
select * from sales_order_items
where (quantity > 50, 5)

Statistics:
Estimated 54 rows in Query (I/O estimate 26)
PLAN> sales_order_items (seq)

Here, the query is run again with a user estimate of 5%, instead of the default estimate of 25%. The estimated number of rows in the result is now 54, which is very close to the true value of 60.

    Finally, the application designer may override all of the previous alternatives. If the designer knows a great deal about the distribution of values in the database, then it may be advisable to declare the probability of a predicate being true. The SQL Anywhere optimizer will subsequently use this estimate to determine the number of records that will be retrieved, which may result in a different choice of access plans. Example 9 illustrates how to declare a user estimate.

    Users should rarely need to use this feature. If a database has been designed well with appropriate indexes, there is no need for a designer to provide any additional information regarding distribution of data. The optimizer uses index cardinality and the persistent storage of statistics to provide this information.

    Application designers using queries containing coded estimates should re-evaluate such queries periodically to ensure that the access plan chosen by the optimizer due to the override is still optimal.
     

5. Access Plan Selection
    Using the techniques outlined in the previous sections, the optimizer estimates the number of rows from a single table that will satisfy a given query. When multiple tables are involved, a different technique may be applied to each table, depending on the order of the tables in the join strategy. This enables the optimizer to eliminate irrelevant rows as early as possible, thereby reducing the number of unnecessary disk accesses. The optimizer will reorder the atomic conditions in the "where" clause to correspond with its estimates, so that conditions that reject the largest number of rows are considered first. An ideal access plan will only retrieve rows that contribute to the query result. Where a choice of access plans is possible, the SQL Anywhere optimizer will discourage the use of temporary tables in favor of using an index. This is done by making access plans that contain temporary tables artificially expensive.
    Using an index will typically provide faster response time (see Example 10). Examples 11 and 12 show how the estimation techniques for single tables help the optimizer to wisely select the best access plan for multi-table queries. Performance improvements are noted for each. The examples were run using the standalone version of SQL Anywhere, using the default cache size of two megabytes, on a Pentium PC running the Windows/NT 3.51 operating system. The database was stopped and started between each query to eliminate the effects of the cache on subsequent queries.
Example 10:

Query: 'List all sales order items by order quantity'
select * from sales_order_items
order by quantity

Statistics:
1097 rows in query (I/O estimate 959)
PLAN> TEMPORARY TABLE sales_order_items (seq)

This example demonstrates how indexes can improve efficiency. When no index is available, the optimizer uses a temporary table to sort the values in the correct order. This query required 1.67 seconds to execute, but the entire table was retrieved before any result row was displayed, since the row to display first depends on the outcome of the sort step. In this case, the row with the smallest quantity was fetched after 0.34 seconds.

Alternatively, one could index the quantity column in the sales_order_items table to retrieve the rows in sorted order directly.

Command:
create index sold_qty on sales_order_items (quantity)

Query:
select * from sales_order_items
order by quantity

Statistics:
1097 rows in query (I/O estimate 78)
PLAN> sales_order_items (sold_qty)

When the query is run again, using the sold_qty index, the total execution time is nearly identical. However, since the access plan uses an index instead of a temporary table the first row in the result is available almost instantaneously. For more information about the indexes used in SQL Anywhere, refer to Appendix A.
 

6. Overhead

6.1. Indexes

    Indexes are an excellent means of increasing performance. However, they must be a well-designed element of the database in order to balance the reduced execution time of queries with the additional overhead of index maintenance during updates. If a row in a table is deleted, each index on that table must be updated to remove the index entries that point to that row. Similarly, on insertions an index entry must be created in each index on that table. For updates, index maintenance is required on each index that contains one or more columns modified by the update statement. High frequencies of insertions and deletions will also result in additional processing to maintain the balance of the B+-tree. Users must keep this maintenance overhead in mind when designing their databases and applications to ensure optimal performance.
6.2. Optimizer
    Optimizer overhead is another consideration, especially as the number of tables increase. If there are only 4 tables, then 24 join strategies are theoretically possible. However, if there are 8 tables, then there are a possible 40,320 join strategies. Consider a 12 table query where there are 479,001,600 possible join strategies. Creating and analyzing this many strategies requires a great deal of processing, and so the SQL Anywhere query optimizer includes a number of features to reduce this overhead. First, access plans are cached so that the cost of optimization can be amortized over multiple (identical) queries. Second, the optimizer includes a time-out feature that ensures that the optimizer does not require more processing than it can reduce. Third, SQL Anywhere takes advantage of a proprietary algorithm that eliminates obviously poor join strategies, thereby dramatically decreasing processing requirements.
Appendix A - Indexes in Sql Anywhere
Appendix B - Sademo Database Structure


 

Bibliography

Comer, D. (June 1979). The Ubiquitous B-Tree. ACM Computing Surveys 11(2), pp. 121-137.

Sybase Inc. (December 1995). Sybase SQL Anywhere User's Guide, Volumes I and II.



[#]Home  [*]Top

© Copyright 2008, Sybase Inc.