Switch to standard view 
  Sybase logo
 
 
 



Introduction¯Four Approaches to Surrogate Key Generation

Surrogate primary keys are the unique, system generated, row identifiers that are commonly used by database designers. The primary reasons for using surrogate primary keys are to create small, simple keys and to provide an audit trail. Small keys result in small indexes, narrow dependent tables having foreign keys, and joins that are easier to write and faster to execute. Additionally, surrogate primary keys can be used to provide an audit trail, such as in the case of invoice numbers.

These keys are system-generated, and this TechNote deals with the question of how best to generate surrogate primary keys. The question of how to generate these keys is non-trivial because Sybase SQL Server's page-level locking causes the key generation algorithm to significantly affect overall SQL Server performance. This is especially true in an insert-intensive OLTP (online transaction processing) environment.

The different algorithms for key generation can be grouped into four general approaches:

Most discussions on surrogate primary key generation address the issues of uniqueness, page-level locking, and concurrency. This TechNote attempts to go further by addressing the issue of the cache hit ratio and introducing an algorithm that has been found to promote both high concurrency and a high cache hit ratio in a highly concurrent environment.

Monotonic Series

This section describes ways of generating monotonic keys. Monotonic keys are progressive in the collating sequence, for example:

1, 2, 3, 4, ...
1, 3, 4, 8, ...

15 March 1994, 21 March 1994, 1 April 1994, ...

Methods

Methods which can be used to produce a monotonic series are:

These are described in the following sections.

Max Plus One

Perhaps the most apparent solution to surrogate primary key generation is to select the max from the data table and to add one to determine the next key value.

1> declare @next_key select @next_key =
2> max(key_column) + 1 from data_table
3> go
1> insert into data_table (key_column, ...)
2> values (@next_key, ...)
3> go
However, this algorithm does have a serious problem. The transaction is not atomic. Two concurrent transactions could execute the select and obtain the same next key value before either one executes the insert. This algorithm could only be considered appropriate for a low-intensity environment.

Enhanced Max Plus One

The first improvement to the "Max Plus One" algorithm is to include the holdlock keyword in the select statement.

1> declare @next_key select @next_key = 
2> max(key_column) + 1 from data_table
3> holdlock
4> go
1> insert into data_table (key_column, ...)
2> values (@next_key, ...)
3> go
The SQL shown above will generate unique keys provided there is a clustered index on the key.

However, this algorithm frequently causes deadlocks in a highly concurrent environment. If two concurrent processes execute this code, both acquire shared locks on the last page. One then attempts to obtain the exclusive lock on the last page required by the insert statement and becomes blocked by the shared lock of the other process. The second process then tries to acquire the exclusive lock required by the insert statement and a deadlock results.

Even with the holdlock enhancement, the "Max Plus One" algorithm is suitable only in a low-intensity environment.

Next Key Table

A more versatile approach is to introduce a separate table to store the next key value.

1> begin tran
2> declare @next_key
3> update next_key set next_key = next_key + 1 4> select @next_key = next_key + 1
5> from next_key
6> insert into data_table (key_column, ...)
7> values (@next_key, ...)
8> commit tran
9> go
The SQL is atomic and uniqueness is guaranteed. Notice that the update statement precedes the select statement. The update first acquires an exclusive lock on the table ensuring that no other processes can obtain a next key value from the table.

In the above example, the transaction includes both the generation of the key and the insert into the data table. The transaction is completely single-threaded. Consequently, there are no gaps in the sequences of key values inserted into the data table.

Identity Property

The identity property, as supplied in Sybase System 10, provides a series of sequential numbers to use as primary keys. The implementation and issues regarding gaps in the key sequence are described in your System 10 SQL Server documentation.

Concurrency Issues

Regardless of the algorithm used to generate the keys, all monotonic key algorithms have similar effects on concurrency and cache hit.

For a better understanding of the effects of the key generation scheme on concurrency, we must examine various indexing scenarios. These scenarios include the following:

No Index Concurrency

In the case where you have no index, the table acts as a heap and all inserts go to the last data page by default.

When more than one process tries to concurrently insert, the first process will block the second. The first process acquires an exclusive lock on the last data page of the table. The second tries to acquire an exclusive lock on this same page but is blocked until the first transaction has either been committed or rolled back.

This spot of contention is commonly referred to as a "hot spot."

Clustered Index Concurrency

If you create a clustered index on the surrogate primary key and there is still a similar concurrency situation. Because the series of key values generated is monotonic, all inserts are directed to the last page.

If the maximum key value is 100, the next key value generated is 101. An insert with a primary key value of 101 is directed to the last data page of the table, as is an insert of the row with a primary key of 102. Only one of the two processes is able to have the necessary exclusive lock on the page at any one time. Consequently in this case as well, concurrent inserts must wait for an exclusive lock.

Note that this situation applies for all algorithms that generate monotonic keys. When using the identity property, multiple processes can concurrently generate next key values without blocking each other. However, once two processes try to concurrently insert into the data table, they will block as described above.

Non-Clustered Index Concurrency

If you use a non-clustered index only on the surrogate primary key, the table acts as a heap. By default, all inserts go to the last data page. Consequently, there is still a troublesome hot spot and concurrent inserts must wait for an exclusive lock.

Clustered Index on Other Key Concurrency

Perhaps the most frequently heard advice on the issue of page- level locking and surrogate primary key generation is to cluster on a different column.

One possibility is to create a non-clustered index on the surrogate primary key and a clustered index on some other column or set of columns.

For example, a table of employees could have a non-clustered index on the employee ID and a clustered index on last name, first name, which would be useful for sorting alphabetically. In this case, inserts are spread across the data pages based upon the distribution of the clustered index. For example, one process might be inserting employee Abbott while another is inserting employee Wilson. The first employee's row would be inserted toward the front of the table and the second employee's would be inserted toward the end. This is so even if Abbot has an employee ID of 101 and Wilson an employee ID of 102. At first glance, this method appears to solve the concurrency problem.

However, while Abbott's and Wilson's rows are far apart on the data pages of the table, a single index page probably contains both employee identifiers and pointers to both of these data pages. That index page is the last leaf-level page of the non- clustered index. When the rows are inserted, the index page must be updated before the transaction is complete. This update to the index page requires an exclusive lock. Because both of these concurrent inserts must acquire an exclusive lock on the same index page, a hot spot exists.

The last-page contention arguments still apply. The hot spot has not gone away. It has moved from the last data page to the last non-clustered index page.

In summary, a monotonic approach might be effective only in an environment that rarely has concurrent processes inserting into the same table.

Pseudo-Random Numbers

One of the most common alternatives to the monotonic series is to use a pseudo-random key. For example:

1> create procedure get_next_key 
2> @next_key int output
3> as update next_seed
4> set next_seed = next_seed + 1
5> go
1> select @next_key = 16807 * 
2> (next_seed % 127773) - 2836 *
3> (next_seed / 127773)
4> go
1> if (@next_key <= 0) 
2> select @next_key = @next_key + 2147483547
3> go

Note
The above example is based upon an algorithm written by Malcolm Colton of Sybase (Malcolm Colton, Sybase Client Technical Services Technical Memo, December 1990). Park and Miller receive credit for the pseudo-random algorithm (Stephen K. Park and Keith W. Miller, "Random Number Algorithms: Good Ones Are Hard to Find," Comm. ACM, Vol. 31, Number 10, October 1988).

Next-Seed Table

To use this technique, one must use a next-seed table to seed the random number generator. In the previous example, the seed must be in the range of four-byte integers. Each unique seed value will produce a unique surrogate primary key in the four-byte range.

Concurrency

The pseudo-random technique certainly is an improvement in concurrency. Inserts are now spread across the data pages and index pages. The inserts have a random chance of colliding. If the table is large, as is the case for most OLTP databases, inserts rarely block each other.

While concurrency has been improved, the pseudo-random technique does have three significant disadvantages:

Comparing Monotonic and Pseudo-Random Cache Hit Ratios

Monotonic Caching

Under the monotonic key approach, a good cache hit ratio is promoted because few additional pages must be hit with each new insert. While it is true that concurrent inserts contend for an exclusive lock, the pages that must be repeatedly hit are more than likely to already be in the data cache. Each insert descends this index down the same path. This one path down to the leaf page can easily remain in cache.

Pseudo-Random Caching

As for randomness, its strength is its weakness. Having blocking occur only randomly is an improvement. However, randomly hitting a page in data cache is not ideal.

Each insert descends the index on the primary key randomly. Consequently, one insert will descend one path through the index while another insert will descend another path. While this prevents them from blocking, many more pages must be hit.

Many large OLTP databases have a 1 to 100 data cache to database size ratio. An approximate one percent chance of a page being cache results in a poor cache hit ratio.

Depending on the nature of the application, random keys might not result in an improvement in overall system performance. Therefore, a random approach would be most effective in an environment that is insert intensive but has a relatively large data cache.

Composite Key Approach

Are high concurrency and a high cache hit ratio mutually exclusive? Or is there a better approach?

A third alternative to surrogate primary key generation is the "composite key" approach. The surrogate primary key is a single column composed of multiple parts. The first part of the key is used to control the spread of the inserts throughout the data table and the second part is used to ensure uniqueness. The following example is credited to Hal Spitz of Sybase:

1> create procedure get_next_key @next_key char(20) 
2> output as select ltrim(rtrim(str(suser_id()))+
3> rtrim(convert(char(8),getdate(),112))+
4> rtrim(convert(char(4),datepart(hh,getdate())))+
5> rtrim(convert(char(4),datepart(mi,getdate())))+
5> rtrim(convert(char(4),datepart(ss,getdate())))+
6> rtrim(convert(char(4),datepart(ms,getdate()))))
In this example, the first part of the key is the SQL Server user ID. Other possible values include the client host process ID and SQL Server process ID. The first part of the key is used to improve concurrency by controlling the spread of the inserts.

Since this example uses the SQL Server user ID, all rows inserted by a given user would be grouped together. This grouping would occur on the data pages in the case of a clustered index on the key and on the leaf-level index pages in the case of a non-clustered index on the key. If each concurrent user logs in using a different SQL Server user ID, concurrent inserts are spread throughout the table, resulting in less contention.

The second part of the key in this example is based upon datetime because it provides a practically unique value. In addition to providing uniqueness, the second part of the key keeps the cache hit ratio from deteriorating.

Using the composite key approach, a given SQL Server user ID generates keys in a monotonic series. For example, SQL Server user ID 125 might generate the following values:

12519940411145439533 
12519940411145440666
12519940411145442466
This series causes each SQL Server user ID to consistently work in one area of the table or index. Consequently two consecutive inserts by the same SQL Server user ID will, more than likely, be directed to the same page in the database.

Disadvantages

The primary disadvantage to this approach is that you must use a larger key. The example uses a twenty-byte string rather than a four-byte integer as used by the previous example. As previously mentioned, one of the primary reasons for using a surrogate primary key is to have a small key that results in a small index. Because this approach requires more space for the same amount of information, a lesser amount of valuable information can be stored on the pages in the data cache.

A secondary disadvantage is that the datetime values are part of a continuous sequence¯not a discrete sequence. There is no way to look for or control gaps in the key sequence.

A better solution for the insert-intensive OLTP environment would do all of the following:

Recycled Series Approach

In the recycled series approach, each concurrent process has its own conceptual next-key table.

For example, three concurrent processes can acquire next key values as follows: One process can work on the sequential series from 1 to 1,000,000, the second can work on the series from 1,000,001 to 2,000,000, and the third from 2,000,001 to 3,000,000.

An analogy can be made to a diner and its five wait persons:

What happens if the diner hires a sixth wait person (a new concurrent process)? He or she is given a new pad of order forms.

What happens if a wait person quits (one less concurrent process)? The manager takes his or her pad of order forms and saves it.

What happens if the wait person is later replaced (a replaced concurrent process)? He or she is given the predecessor's pad of order forms to continue in the series.

One implementation of this approach is included at the end of this TechNote. The included implementation is written completely in Transact-SQL. An alternative implementation that would require less overhead could be written as an Open Server application.

In summary, a recycled series approach might be effective when both concurrency and cache hit ratio affect overall database server performance. This is usually the case in insert-intensive OLTP databases.

Trigger-Based Implementation

Any of the above approaches, except a monotonic approach using the identity property, could be implemented using triggers. However, such an implementation can adversely affect SQL Server performance. First, SQL Server must insert rows into the wrong area of the table and/or indexes, and the trigger relocate the rows to the correct place. Second, because all inserts will initially be inserted into the same area of the table and/or indices a hot spot will result.

Performance Comparison of the Four Approaches

Rao Kanikicharla of Bell Atlantic Healthcare Systems tested comparative performance to measure the performance differences of these various approaches. With only one concurrently inserting process, the monotonic approach was found to be the most effective. As more concurrent processes were added, the overall throughput of the other approaches improved while the monotonic approach continued to process transactions at approximately the same rate. The break-even point was approximately four to five concurrent transactions performing simple insert statements. With ten concurrent transactions, the recycled series approach was found to be significantly faster. Additionally, it caused fewer deadlocks. The break-even point would be a smaller number of concurrent processes for larger transactions that hold the exclusive locks for a longer period of time.

The various approaches are effective in different environments. The monotonic approach may provide best performance in a low activity environment. The pseudo-random approach may provide best performance in an insert-intensive environment with a large data cache. The composite approach may provide better performance than the previous two in an insert-intensive environment with a relatively small data cache. Finally, the recycled series approach also works well in an insert- intensive environment with a small data cache. However, the recycled series approach requires less overhead due to a smaller key and provides a way to manage gaps in the series of key values.

Recycled Series Implementation Example
/* /* /* /* /* /* /* /* /* /* /* /* /* /*
Recycled series approach to surrogate primary
generation. Designed to promote high
concurrency without sacrificing the cache hit
ratio under page-level locking. Effective in a
insert-intensive OLTP environment with a
relatively small data cache.
Based upon presentation given at the 1994
International Sybase User Conference and
Training Meeting.
Gary Meyer, meyer@netcom.com
May 8, 1994
Create the next key table and force one row
per page.
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/ */ */


create table next_key
(spid int not null,
next_key int not null,
filler1 char(255) not null,
filler2 char(255) not null,
filler3 char(255) not null,
filler4 char(255) not null,
filler5 char(255) not null,
filler6 char(255) not null)
go
create unique clustered index next_key_spid 
on next_key (spid)
go
/* Insert a firstrow                             */
insert into next_key
(spid, next_key,
filler1, filler2,
filler3, filler4,
filler5, filler6)
values
(0, 0,
replicate("x",250), replicate("x",250),
replicate("x",250), replicate("x",250),
replicate("x",250), replicate("x",250))
go
/*
/*
Create the stored procedure to generate the
next key value
*/
*/
create procedure get_next_key
@next_key int = null output
as
declare @error_flag int

 
/* /*
First, check if this process already has a
series
*/
*/
select                                  
@next_key = next_key + 1
from next_key holdlock
where spid = @@spid
/*
/*
This process has its own series
*/
*/
if (@@rowcount<=(@next_key-1)%1000000)
/*
Update the next_key table with the new next_key
*/
  begin
update next_key
set next_key = @next_key
where spid = @@spid
end
/*
/*
This process does not hav its own series
(Exception processing)
*/
*/
else                                    

begin
declare @old_spid int
/*
Lock the table
*/
  update next_key                       
set spid = spid
  select @error_flag = 0

/* /*
Try to grab an unused series that has been
abandoned by another process
*/
*/
  set rowcount 1
select
@old_spid = spid,
@next_key = next_key + 1
from next_key
where not spid in
(select spid
from master..sysprocesses)
order by next_key /*look for the one with*/
/* the lowest next_key value*/
/*
Got an old unused series

*/
  if (@@rowcount = 1)                  

begin
set rowcount 0
update next_key /*claim next series as own*/
set next_key = @next_key,
spid = @@spid
where
spid = @old_spid
end
/* /* /*
Did not find an old series
Delete row when an old series has been
completely consumed
*/ */ */
  else
    begin
set rowcount 0
delete
from next_key
where spid = @@spid
/* /*
Determine the value to use to start a new
series
*/
*/
    declare @max_next_key int
select
@max_next_key = max(next_key)
from next_key
    select @next_key = (@max_next_key-1)/1000000+1
select @next_key = @next_key*1000000+1
/*
Start the new series
*/
    if not (@max_next_key is null)      
insert into next_key
(spid, next_key,
filler1, filler2,
filler3, filler4,
filler5, filler6)
values
(@@spid, @next_key,
replicate("x",250), replicate("x",250),
replicate("x",250), replicate("x",250),
replicate("x",250), replicate("x",250))
/*
Otherwise, error
*/
    else

begin
raiserror 20000 'next_key table is empty.'
select @error_flag = 1
end
end
end
if (@error_flag = 1)
return 1
else
return 0
go
/* /*
Make sure that the stored procedure uses row-
level locking on the next key table
*/
*/
update statistics next_key
go
exec sp_recompile next_key
go

by Gary Meyer



Back to Top
Follow Sybase
© Copyright 2010, Sybase Inc.