Ilmar Kerm

Oracle, databases, Linux and maybe more

Here I’m following up my previous post Using linguistic indexes for sorting in Oracle. I don’t much like the Oracle solution, that requires creating a special index to speed up sorting, but… at the same time its very powerful, allows to index in many languages and no database changes are needed.

In this post I’ll take a look at the two popular open source databases MySQL and PostgreSQL. I’ll take a look only at features, that the database has included and that can be used without any special application changes.

PostgreSQL 8.4

Starting from 8.4, collation (sorting) rules can be defined per database and there is no possibility to set it in session level. All sorting and all indexes are ordered according to the database collation locale. In previous versions there was only one collation locale allowed for the entire database cluster.

For my example I create two databases, one with Estonian locale and one with German.

$ createdb -E utf8 -l 'et_EE.utf8' -O ilmar ilmar_ee
$ createdb -E utf8 -l 'de_DE.utf8' -O ilmar -T template0 ilmar_de

Currently PostgreSQL relies on the underlying OS to do the collations, so the OS must also support the specified locale. Check it with:

$ locale -a | grep et_EE
et_EE
et_EE.iso88591
et_EE.iso885915
et_EE.utf8

To change the collation you need to dump the entire database to a text file (pg_dump), create a new database and reload the data. So, a pretty painful procedure for large databases.

A small test if it really works. In ilmar_ee and ilmar_de I create table test_coll and load it with 4 rows:

CREATE TABLE test_coll (
  t character varying(100) NOT NULL
);

begin;
insert into test_coll values ('a');
insert into test_coll values ('o');
insert into test_coll values ('õ');
insert into test_coll values ('ä');
commit;


ilmar_ee=> select t from test_coll order by t;
a
o
õ
ä

ilmar_de=> select t from test_coll order by t;
a
ä
o
õ

Now can index be used for sorting?

CREATE TABLE t (
  t character varying(100) NOT NULL
);

CREATE INDEX idxt
  ON t
  USING btree
  (t);

CREATE OR REPLACE FUNCTION fill_t(p_num_rows bigint) RETURNS bigint AS
$BODY$
declare
  s t.t%type;
  i integer;
begin
  for i in 1..p_num_rows loop
    s:= case mod(i, 4) when 0 then 'a' when 1 then 'ä' when 2 then 'o' when 3 then 'õ' end;
    if mod(i,2) = 0 then
      s:= upper(s);
    end if;
    s:= s ||' wqe wqe wqe wqeqwdsa asd asdasd sasss we qwewq dssas';
    insert into t (t) values (s);
  end loop;
  return 0;
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;

select fill_t(10000);
vacuum analyze t;

After the test data is created, some tests. Oracle users will find the following very strange:

ilmar=> explain select t from t order by t;
                          QUERY PLAN
--------------------------------------------------------------
 Sort  (cost=868.39..893.39 rows=10000 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..204.00 rows=10000 width=55)

ilmar=> explain select t from t where t between 'a' and 'b' order by t;
                               QUERY PLAN
-------------------------------------------------------------------------
 Sort  (cost=395.10..401.35 rows=2500 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..254.00 rows=2500 width=55)
         Filter: (((t)::text >= 'a'::text) AND ((t)::text <= 'b'::text))

ilmar=> explain select t from t where t between 'a' and 'ak' order by t;
                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..8.27 rows=1 width=55)
   Index Cond: (((t)::text >= 'a'::text) AND ((t)::text <= 'ak'::text))

It seems that Postgres optimizer only consideres using index for sorting, when there is only a small fraction of the table filtered. A reason for this in the documentation is:

The planner will consider satisfying an ORDER BY specification either by scanning an available index that matches the specification, or by scanning the table in physical order and doing an explicit sort. For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern. Indexes are more useful when only a few rows need be fetched.

So it seems that Postgres cannot fast full scan an index. If I fill the table up even more, then finally optimizer is costing the sort operation higher than index scan. But the query is slow, 4125ms on my system.

ilmar=> select fill_t(90000);
ilmar=> vacuum analyze t;

ilmar=> explain select t from t order by t;
                              QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..9771.11 rows=100000 width=55)

A note from documentation:

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Better collation supports seems to be a work in progress:
Todo:Collate
Todo:IDU

Its also possible to use third party Orafce package, that enables the use on NLSSORT function, that is similar to Oracle.

MySQL 5.1

MySQL supports collations at the column level.
All available collations for the given charset can be queried like this:

mysql> show collation where charset = 'utf8';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin           | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |
| utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |
| utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |
| utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |
| utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |
| utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |
| utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |
| utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |
| utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |
| utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |
| utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |
| utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |
| utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |
| utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |
| utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |
| utf8_esperanto_ci  | utf8    | 209 |         | Yes      |       8 |
| utf8_hungarian_ci  | utf8    | 210 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
21 rows in set (0.04 sec)

I'll create a table test, in where column e uses Estonian sorting and column h uses hungarian sorting.

mysql> create table test (
  e varchar(100) not null collate 'utf8_estonian_ci',
  h varchar(100) not null collate 'utf8_hungarian_ci'
) charset=utf8;

mysql> insert into test values ('a','a');
mysql> insert into test values ('ä','ä');
mysql> insert into test values ('o','o');
mysql> insert into test values ('õ','õ');
mysql> select * from test;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.02 sec)

mysql> select * from test order by e;
+---+---+
| e | h |
+---+---+
| a | a |
| o | o |
| õ | õ |
| ä | ä |
+---+---+
4 rows in set (0.04 sec)

mysql> select * from test order by h;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.01 sec)

Perfect, now what about indexes?

mysql> create index idxe on test (e);
Query OK, 4 rows affected (0.71 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> create index idxh on test (h);
Query OK, 4 rows affected (0.18 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> explain select e from test order by e;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxe | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain select h from test order by h;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

If I force a different collation on a column, then values are read from index, but extra filesort step is needed:

mysql> explain select h from test order by h collate utf8_estonian_ci;
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

I'll add more rows to see how much difference there is when reading the order from index and when doing extra sorting.

mysql> delimiter //
mysql> create procedure load_data(p_num_rows INT)
    -> BEGIN
    ->   SET @i = 0;
    ->   REPEAT
    ->     SET @i = @i + 1;
    ->     INSERT INTO test (e, h) VALUES (CONCAT('eeeaad sadsa dasd asd', @i),
    ->       CONCAT('213aad sadsa dasd asd', @i));
    ->   UNTIL @i > p_num_rows END REPEAT;
    -> END
    -> //
Query OK, 0 rows affected (0.44 sec)

mysql> delimiter ;
mysql> call load_data(10000);
mysql> call load_data(20000);
mysql> ANALYZE TABLE test;

mysql> select sql_no_cache count(*) from (select h from test order by h collate utf8_estonian_ci) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (1.22 sec)

mysql> select sql_no_cache count(*) from (select h from test order by h) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (0.09 sec)

It is also possible to set collation at connection level, but this does not change the row sorting order like in Oracle.

mysql> truncate table test;
Query OK, 0 rows affected (0.07 sec)

mysql> insert into test values ('a','a');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('ä','ä');
Query OK, 1 row affected (0.08 sec)

mysql> insert into test values ('o','o');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('õ','õ');
Query OK, 1 row affected (0.03 sec)

mysql> set collation_connection=utf8_estonian_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)

mysql> set collation_connection=utf8_hungarian_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)

In Estonia we have our own language and Estonian language has its own alphabet that is a little bit different from English 🙂 And it would be good if a database application sorts text fields according to the correct language rules (I have seen many bug reports from clients, that “the sort order is wrong”).

In Oracle database its quite easy to sort according to the correct language rules, you just need to set the correct database session language from the client. This can be done using ALTER SESSION SET NLS_LANGUAGE or quite often the database client picks up the correct language from OS locale.

Small example to demonstrate this (I use Oracle Database 11.1.0.7 for these examples):

create table t (
  t varchar2(100) not null
);

insert into t values ('a');
insert into t values ('ä');
insert into t values ('o');
insert into t values ('õ');
commit;

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t order by t;

T
---------
a
o
õ
ä

SQL> alter session set nls_language='german';

Session altered.

SQL> select t from t order by t;

T
---------
a
ä
o
õ

But what happens if I create an index to speed up the sorting?

SQL> truncate table t;

SQL> create index idxt on t(t);

SQL> 
declare
  s t.t%type;
begin
  for i in 1..100000 loop
    s:= case mod(i,4) 
          when 0 then 'a'
          when 1 then 'ä'
          when 2 then 'o'
          when 3 then 'õ'
        end;
    if mod(i,3) = 0 then
      s:= upper(s);
    end if;
    s:= s||dbms_random.string('x',50);
    insert into t values (s);
  end loop;
  commit;
end;
/
SQL> exec dbms_stats.gather_table_stats(user,'t',cascade=>true);

PL/SQL procedure successfully completed.

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t order by t;

100000 rows selected.

Elapsed: 00:00:02.89

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|  5175K|       |  1482   (2)| 00:00:18 |
|   1 |  SORT ORDER BY     |      |   100K|  5175K|  5896K|  1482   (2)| 00:00:18 |
|   2 |   TABLE ACCESS FULL| T    |   100K|  5175K|       |   162   (2)| 00:00:02 |
-----------------------------------------------------------------------------------

SQL> alter session set nls_language='american';

Session altered.

SQL> select t from t order by t;

100000 rows selected.

Elapsed: 00:00:01.56

-------------------------------------------------------------------------
| Id  | Operation        | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |   100K|  5175K|   895   (1)| 00:00:11 |
|   1 |  INDEX FULL SCAN | IDXT |   100K|  5175K|   895   (1)| 00:00:11 |
-------------------------------------------------------------------------

First I did a query with Estonian language and then for comparison with American language. If you compare these two, the query in American language works as expected, it reads the correct sort order directly from index. But query in Estonian language requires extra sorting. This is also clearly demonstrated in the query elapsed times. Usually this is not a problem at all, but if there is a need for more performance, what if the data sets are really large and could benefit a sorting help from an index?

This is because normal indexes in oracle are always built with BINARY sort mode. This is the fastest sorting mode that orders the characters by their numerical value in the character set. But this is only good for the English language.

Here come the linguistic indexes.

SQL> create index idx_ling on t (nlssort(t, 'nls_sort=estonian'));

Index created.

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select /*+index(t idx_ling)*/ t from t order by t;

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |   100K|    14M|   101K  (1)| 00:20:20 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T        |   100K|    14M|   101K  (1)| 00:20:20 |
|   2 |   INDEX FULL SCAN           | IDX_LING |   100K|       |  1646   (1)| 00:00:20 |
----------------------------------------------------------------------------------------

So it works, there is no extra sorting step anymore. But it’s useless in this case, because the index contains only the sorting value, not the actual column value and the slow TABLE ACCESS BY INDEX ROWID needs to be performed.

This isn’t a problem when I do not need to return the actual value of column t and only need to do the sorting:

SQL> select 1 from t order by t;

-----------------------------------------------------------------------------
| Id  | Operation        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT |          |   100K|    14M|  1646   (1)| 00:00:20 |
|   1 |  INDEX FULL SCAN | IDX_LING |   100K|    14M|  1646   (1)| 00:00:20 |
-----------------------------------------------------------------------------

Or I could include the actual column value in the index itself:

SQL> create index idx_ling2 on t (nlssort(t, 'nls_sort=estonian'), t);

SQL> select t from t order by t;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |   100K|    14M|  2388   (1)| 00:00:29 |
|   1 |  INDEX FULL SCAN | IDX_LING2 |   100K|    14M|  2388   (1)| 00:00:29 |
------------------------------------------------------------------------------

If I change the language, the index is not used anymore.

SQL> drop index idxt;

SQL> alter session set nls_language='american';

Session altered.

SQL> select 1 from t order by t;

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|  5175K|       |  1482   (2)| 00:00:18 |
|   1 |  SORT ORDER BY     |      |   100K|  5175K|  5896K|  1482   (2)| 00:00:18 |
|   2 |   TABLE ACCESS FULL| T    |   100K|  5175K|       |   162   (2)| 00:00:02 |
-----------------------------------------------------------------------------------

The same linguistic index can also be used for searching, but for that NLS_COMP needs to be set to LINGUISTIC:

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t where t = 'x';

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    53 |   163   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |    53 |   163   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("T"='x')

SQL> alter session set nls_comp='linguistic';

Session altered.

SQL> select t from t where t = 'x';

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |     1 |   157 |     3   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| IDX_LING2 |     1 |   157 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access(NLSSORT("T",'nls_sort=''ESTONIAN''')=HEXTORAW('80000100')
              )

It is also possible to combine search and sorting:

SQL> select t from t where t between 'a' and 'b' order by t;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |   250 | 39250 |    13   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| IDX_LING2 |   250 | 39250 |    13   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access(NLSSORT("T",'nls_sort=''ESTONIAN''')>=HEXTORAW('14000100')
                AND NLSSORT("T",'nls_sort=''ESTONIAN''')<=HEXTORAW('19000100') )

In conclusion, its possible to use index for sorting in non-English languages, but the solution is not very straight-forward and simple. But at the same time it is powerful, you can index the same column in different languages at the same time! But I'm not yet sure, that I like this FBI kind of solution by Oracle, because when doing SQL tuning I really like the possibility of reading the actual column value from index and not doing the always painfully slow TABLE ACCESS BY INDEX ROWID, if at all possible.
In one of the next posts I will take a look at other databases, how have they solved sorting according to national language rules.

More examples about Linguistic Indexes can be found from Richard Foote blog, he focuses on the case-insensitive search feature:
Introduction To Linguistic Indexes – Part I

Also take a look at Metalink note 227335.1:
Linguistic Sorting - Frequently Asked Questions

And of course... documentation:
Oracle® Database Globalization Support Guide - Linguistic Sorting and String Searching

For every Oracle database session it is always good to set MODULE, ACTION and CLIENT_ID values for instrumentation. This allows DBA to see and debug in database level what the session is/was doing in detail. Oracle diagnostic tools are all powered up for using these values and a lot of diagnostic power is just lost, when the application is not instrumented.

For JDBC one can use end-to-end metrics support in JDBC driver
For PL/SQL one can use DBMS_APPLICATION_INFO subprograms

All this is just great, but what about APEX? I think it’s the best database application development tool in the market today, but has it got this instrumentation already built in?
Yes it has!

SELECT   module,
         action,
         client_id
  FROM   v$active_session_history
 WHERE   module LIKE 'APEX%';

MODULE                  ACTION          CLIENT_ID
----------------------- --------------- ------------------------
APEX:APPLICATION 109    PAGE 7          ILMAR:2697049844839191
APEX:APPLICATION 109    PAGE 12         ILMAR:2697049844839191
APEX:APPLICATION 109    PAGE 6          ILMAR:2697049844839191

MODULE is set to the application number
ACTION contains the page number
CLIENT_ID constains username and session id

This example is from APEX 3.2 and Oracle 11.1.0.7 database.

You can read more about using Oracle diagnostic tools and session tracing from Doug Burns blog:
Session Level ASH Reports