Word (WD) Index in Sybase IQ
The Word Index was added in 12.4.3. It is used for indexing words within a string for fast lookup. It can be thought of as an index that performs very fast string searching, and it was designed specifically to accelerate the CONTAINS predicate. Another way of putting it would be that it is used to index keywords by treating the contents of a CHAR or VARCHAR column as a delimited list.
Under some circumstances, describe below, the WD index can also be used to accelerate LIKE predicates, but the semantics of the pattern matching of a LIKE and the word searching of a CONTAINS are enough different that careful consideration is required.
When creating the WD index, you can supply two things, a length, and a set of characters to be used as word delimiters. The length argument defines the number of chracters in the longest word or word fragment to be stores within the WD index. Note that this length does not change the semantics of the CONTAINS predicate. If a delimited word in the column is longer than the specified length, then the leading portion of that word of the defined length will be stored in the WD index. Similarly, if a CONTAINS predicate includes a word longer than the defined length for the WD index, then first the WD index will be examined for the appropriate leading portion of that word. Then for any rows containing that leading portion, the full FP cell value must be searched to see whether the entire requested word is present. Thus, the length for a WD index allows the DBA for an application to make a tradeoff between faster query times versus more space and time consumed by the WD index.
Data Loading - The loading performance and characteristics of a WD index are similar to those of a HG index, but typically larger in time, main cache and temp cache usage. However, these resources will vary depending on the length of this WD index as well as on the actual data being indexed.
Data Storage - Like loading, the storage used by a WD index is similar but usually a bit larger than that used by a HG index. However, these resources will vary depending on the length of this WD index as well as on the actual data being indexed.
LIKE and the WD index
Having a WD index can be used to accelerate LIKE predicates whenever a LIKE pattern string contains at least one complete word that is bounded on both ends. A word can be bounded either by a delimiter character, or by the start or end of the cell. A wildcard cannot act as the end or beginning of a bounded word, nor can a bounded word contain a wildcard in its middle. When a LIKE pattern against a column with a WD index contains one or more such bounded words, and when those words are not especially common within the column, then the optimizer will automatically infer an additional CONTAINS predicate referring to those bounded words.
There are two scenarios where adding a WD index for accelerating LIKE predicates is sensible.
Keyword Searching via LIKE
The first such scenario is when an application wants the full word matching semantics of CONTAINS, but where, because of the front-end being used to access the server, the IQ special syntax for CONTAINS is not available to the user. One such example is an article keyword search application. The LIKE predicate can be used as follows to emulate the CONTAINS semantics. The application must first define a specific character to be used as the keyword delimiter, and obviously that character must be one that never occurs within a keyword in this application. Second, the WD index on the column should be created with a delimiter set that includes only the applications's chosed delimiter character. Third, the application must load the data for the column containing the keywords with the delimiter character between each pair of keywords, as well as preceeding the first keyword and following the last keyword within the column. An example cell value from such an application where the chose application delimiter was a comma could look like:
",politics,independent,commentary,economic policy,taxes,flat tax,"
Then the application's LIKE predicates searching for keywords would be of the form:
WHERE article_keywords LIKE '%,taxes,%'
AND article_keywords LIKE '%,economic policy,%'
Because the column was loaded with the chosen application delimiter character as both the first and last character in the cell, the above LIKE patterns will match the keyword reagrdless of the position within the cell. Also, note that because the WD index for this application used the comma as the only word delimiter character, the keyword "economic policy" will be treated as a single word within the WD index.
Using WD index to accelerate arbitrary LIKE searches
The second scenario where adding a WD index is sensible is where the application's typical LIKE patterns contains one or more bounded words, but real applications with such LIKE patterns, as exemplified below, seem to be rare
T.NAME LIKE '% SUN MICROSYSTEMS INCREASED %'
However, there is one tactic which can be applied to increase the usefulness of the WD for smaller LIKE pattern strings, as long as the application will never be used, then the DBA can be very flexible with the definition of the delimiter set for the WD index. Under these circumstances, it is possible to accelerate most arbitrary LIKE patterns, but the DBA must understand the details of how the WD index is being used to be able to maximize the performance benefits to his application.
The examples below point out what makes up a bounded word, and identifies which example can use a WD index with the default delimiter set to accelerate the LIKE search.
Example 1: Can use WD
T.NAME LIKE 'HEWLETT %'
T.NAME LIKE 'HEWLETT PAC%'
These are leading substring matches, within which IQ can recognize the word "HEWLETT" because it is bounded on the left by the start of the cell value and it is bounded on the right by the space.
Example 2: Cannot use WD
T.NAME LIKE 'HEWLET%'
This is a leading substring match, within which IQ cannot recognize any bounded word. Although it is bounded on the left by the start of the cell value, it is not bounded on the right by a word separator character or the end of the cell value.
Example 3: Can use WD
T.NAME LIKE '% HEWLETT %'
T.NAME LIKE '%TE HEWLETT PA%'
These are generic substring matches, within which IQ can recognize the word "HEWLETT" because it is bounded on the left and on the right by spaces.
Example 4: Cannot use WD
T.NAME LIKE '%HEWL%'
T.NAME LIKE '%HEWLETT%'
T.NAME LIKE '%LETT PA%'
T.NAME LIKE '%T PACKARD%'
These are generic substring matches, within which IQ cannot recognize any complete bounded words.
Example 5: Can use WD
T.NAME LIKE '%T PACKARD'
This is a trailing substring match, within which IQ can recognize the word "PACKARD" because it is bounded on the left by a space and it is bounded on the right by the end of the cell value.
Example 6: Cannot use WD
T.NAME LIKE '%PACKARD'
This is a trailing substring match, within which IQ cannot recognize any bounded word.
As shown by these examples, arbitrary LIKE patterns which can use the WD index are a rather restrictive subset of all LIKE patterns. Given this limitation on the uses of a WD index for accelerating LIKE predicates, the tactic which can expand the usefulness of the WD index is to consider creating a WD index with an extended word delimiter set. As long as the application will never be using the CONTAINS predicate on a given column, then that application can consider the use of an extended word delimiter set. For example, a WD index could be created with a delimiter set that included all of the default delimiter characters (everything except alphabetic and numeric characters), plus some additional characters, such as all the vowels. If such a WD index was built, then obviously more LIKE patterns would be found where there were bounded pseudo-words that could be used with the WD index. With such an expanded delimiter set, most of the examples above would be able to leverage the WD index to improve performance using the resulting vowel-less pseudo-words:
T.NAME LIKE 'HEWLET%' pseudo-words: H, WL
T.NAME LIKE '%HEWLETT%' pseudo-words: WL
T.NAME LIKE '%LETT PA%' pseudo-words: TT, P
T.NAME LIKE '%T PACKARD%' pseudo-words: P, CK
T.NAME LIKE '%PACKARD' pseudo-words: CK, RD
But such a WD index would still not help for certain LIKE patterns:
T.NAME LIKE '%HEL%'
How much performance an application can gain via a WD index will depend entirely on the specific LIKE patterns used, on the actual data present in the column, as welll as on the delimiter set and the word storage limit defined for the WD index.
Maintenance of a WD Index
IQ indexes typically do not need to be rebuilt because they are self-organizing. Garray pages in WD indexes can end up with unused slots as a result of, say deletes or lots of incremental inserts with distinct/new keys. the unused slots are always zeroed (to help in the compression). Rebuilding the WD index will remove the unused garray entries. One can use the garray column in sp_iqindexfragmentation output to help in the decision of whether to rebuild the index. If most of the garray pages are in "91-1005" range, the rebuilding is likely a waste of time. If most are in ranges like '0-10%', '11-20%' then a rebuild might be worth the effort. Each rebuild of an index takes time, so that will be another factor that will affect your decision to rebuild.
LIMIT keyword in CREATE INDEX syntax
This keyword is used for the creation of the WD index only. It specifies the maximu word length that is permitted in the WD index.
The default is 255 bytes. The minimum permitted value is 1 and maximum permitted value is 255
Longer words found during parsin causes an error. If the maximum word length specified in the CREATE INDEX statement or determined by default exceeds the column width, the used maximum word length is silently reduced to the column width. Using a lower maximum permitted word length allows insertions, deletions, and updates to use less sace and time. The empty word (two adjacent separators) is silently ignored. After a WD index is created, any insertions into its column are parsed using the separators and maximum word size determined at create time. These separators and maximum word size cannot be changed after the index is created.