{"id":61,"date":"2009-07-27T20:15:00","date_gmt":"2009-07-27T18:15:00","guid":{"rendered":"https:\/\/ilmarkerm.eu\/blog\/2009\/07\/using-linguistic-indexes-for-sorting-in-open-source-databases\/"},"modified":"2009-07-27T20:15:00","modified_gmt":"2009-07-27T18:15:00","slug":"using-linguistic-indexes-for-sorting-in-open-source-databases","status":"publish","type":"post","link":"https:\/\/ilmarkerm.eu\/blog\/2009\/07\/using-linguistic-indexes-for-sorting-in-open-source-databases\/","title":{"rendered":"Using linguistic indexes for sorting in open source databases"},"content":{"rendered":"<p>Here I&#8217;m following up my previous post <a href=\"http:\/\/ilmarkerm.blogspot.com\/2009\/07\/using-linguisting-indexes-for-sorting.html\">Using linguistic indexes for sorting in Oracle<\/a>. I don&#8217;t much like the Oracle solution, that requires creating a special index to speed up sorting, but&#8230; at the same time its very powerful, allows to index in many languages and no database changes are needed.<\/p>\n<p>In this post I\u2019ll take a look at the two popular open source databases MySQL and PostgreSQL. I&#8217;ll take a look only at features, that the database has included and that can be used without any special application changes.<\/p>\n<h3>PostgreSQL 8.4<\/h3>\n<p>Starting from 8.4, collation (sorting) rules can be defined per database and there is no possibility to set it in session level. All sorting and all indexes are ordered according to the database collation locale. In previous versions there was only one collation locale allowed for the entire database cluster.<\/p>\n<p>For my example I create two databases, one with Estonian locale and one with German.<\/p>\n<pre>\n$ createdb -E utf8 -l 'et_EE.utf8' -O ilmar ilmar_ee\n$ createdb -E utf8 -l 'de_DE.utf8' -O ilmar -T template0 ilmar_de\n<\/pre>\n<p>Currently PostgreSQL relies on the underlying OS to do the collations, so the OS must also support the specified locale. Check it with:<\/p>\n<pre>\n$ locale -a | grep et_EE\net_EE\net_EE.iso88591\net_EE.iso885915\net_EE.utf8\n<\/pre>\n<p>To change the collation you need to dump the entire database to a text file (pg_dump), create a new database and reload the data. So, a pretty painful procedure for large databases.<\/p>\n<p>A small test if it really works. In ilmar_ee and ilmar_de I create table test_coll and load it with 4 rows:<\/p>\n<pre>\nCREATE TABLE test_coll (\n  t character varying(100) NOT NULL\n);\n\nbegin;\ninsert into test_coll values ('a');\ninsert into test_coll values ('o');\ninsert into test_coll values ('\u00f5');\ninsert into test_coll values ('\u00e4');\ncommit;\n\n\nilmar_ee=> select t from test_coll order by t;\na\no\n\u00f5\n\u00e4\n\nilmar_de=> select t from test_coll order by t;\na\n\u00e4\no\n\u00f5\n<\/pre>\n<p>Now can index be used for sorting?<\/p>\n<pre>\nCREATE TABLE t (\n  t character varying(100) NOT NULL\n);\n\nCREATE INDEX idxt\n  ON t\n  USING btree\n  (t);\n\nCREATE OR REPLACE FUNCTION fill_t(p_num_rows bigint) RETURNS bigint AS\n$BODY$\ndeclare\n  s t.t%type;\n  i integer;\nbegin\n  for i in 1..p_num_rows loop\n    s:= case mod(i, 4) when 0 then 'a' when 1 then '\u00e4' when 2 then 'o' when 3 then '\u00f5' end;\n    if mod(i,2) = 0 then\n      s:= upper(s);\n    end if;\n    s:= s ||' wqe wqe wqe wqeqwdsa asd asdasd sasss we qwewq dssas';\n    insert into t (t) values (s);\n  end loop;\n  return 0;\nend;\n$BODY$\n  LANGUAGE 'plpgsql' VOLATILE;\n\nselect fill_t(10000);\nvacuum analyze t;\n<\/pre>\n<p>After the test data is created, some tests. Oracle users will find the following very strange:<\/p>\n<pre>\nilmar=> explain select t from t order by t;\n                          QUERY PLAN\n--------------------------------------------------------------\n Sort  (cost=868.39..893.39 rows=10000 width=55)\n   Sort Key: t\n   ->  Seq Scan on t  (cost=0.00..204.00 rows=10000 width=55)\n\nilmar=> explain select t from t where t between 'a' and 'b' order by t;\n                               QUERY PLAN\n-------------------------------------------------------------------------\n Sort  (cost=395.10..401.35 rows=2500 width=55)\n   Sort Key: t\n   ->  Seq Scan on t  (cost=0.00..254.00 rows=2500 width=55)\n         Filter: (((t)::text >= 'a'::text) AND ((t)::text <= 'b'::text))\n\nilmar=> explain select t from t where t between 'a' and 'ak' order by t;\n                               QUERY PLAN\n------------------------------------------------------------------------\n Index Scan using idxt on t  (cost=0.00..8.27 rows=1 width=55)\n   Index Cond: (((t)::text >= 'a'::text) AND ((t)::text <= 'ak'::text))\n<\/pre>\n<p>It seems that Postgres optimizer only consideres using index for sorting, when there is only a small fraction of the table filtered. A reason for this in the <a href=\"http:\/\/www.postgresql.org\/docs\/8.4\/interactive\/indexes-ordering.html\">documentation<\/a> is:<\/p>\n<blockquote><p>\nThe planner will consider satisfying an ORDER BY specification either by scanning an available index that matches the specification, or by scanning the table in physical order and doing an explicit sort. For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I\/O due to following a sequential access pattern. Indexes are more useful when only a few rows need be fetched.\n<\/p><\/blockquote>\n<p>\nSo it seems that Postgres cannot fast full scan an index. If I fill the table up even more, then finally optimizer is costing the sort operation higher than index scan. But the query is slow, 4125ms on my system.<\/p>\n<pre>\nilmar=> select fill_t(90000);\nilmar=> vacuum analyze t;\n\nilmar=> explain select t from t order by t;\n                              QUERY PLAN\n-----------------------------------------------------------------------\n Index Scan using idxt on t  (cost=0.00..9771.11 rows=100000 width=55)\n<\/pre>\n<p>A note from <a href=\"http:\/\/www.postgresql.org\/docs\/8.4\/interactive\/locale.html\">documentation<\/a>:<\/p>\n<blockquote><p>\nThe drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.\n<\/p><\/blockquote>\n<p>Better collation supports seems to be a work in progress:<br \/>\n<a href=\"http:\/\/wiki.postgresql.org\/wiki\/Todo:Collate\">Todo:Collate<\/a><br \/>\n<a href=\"http:\/\/wiki.postgresql.org\/wiki\/Todo:ICU\">Todo:IDU<\/a><\/p>\n<p>Its also possible to use third party <a href=\"http:\/\/orafce.projects.postgresql.org\/\">Orafce<\/a> package, that enables the use on NLSSORT function, that is similar to Oracle.<\/p>\n<h3>MySQL 5.1<\/h3>\n<p>MySQL supports collations at the column level.<br \/>\nAll available collations for the given charset can be queried like this:<\/p>\n<pre>\nmysql> show collation where charset = 'utf8';\n+--------------------+---------+-----+---------+----------+---------+\n| Collation          | Charset | Id  | Default | Compiled | Sortlen |\n+--------------------+---------+-----+---------+----------+---------+\n| utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |\n| utf8_bin           | utf8    |  83 |         | Yes      |       1 |\n| utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |\n| utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |\n| utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |\n| utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |\n| utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |\n| utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |\n| utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |\n| utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |\n| utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |\n| utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |\n| utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |\n| utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |\n| utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |\n| utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |\n| utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |\n| utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |\n| utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |\n| utf8_esperanto_ci  | utf8    | 209 |         | Yes      |       8 |\n| utf8_hungarian_ci  | utf8    | 210 |         | Yes      |       8 |\n+--------------------+---------+-----+---------+----------+---------+\n21 rows in set (0.04 sec)\n<\/pre>\n<p>I'll create a table test, in where column e uses Estonian sorting and column h uses hungarian sorting.<\/p>\n<pre>\nmysql> create table test (\n  e varchar(100) not null collate 'utf8_estonian_ci',\n  h varchar(100) not null collate 'utf8_hungarian_ci'\n) charset=utf8;\n\nmysql> insert into test values ('a','a');\nmysql> insert into test values ('\u00e4','\u00e4');\nmysql> insert into test values ('o','o');\nmysql> insert into test values ('\u00f5','\u00f5');\nmysql> select * from test;\n+---+---+\n| e | h |\n+---+---+\n| a | a |\n| \u00e4 | \u00e4 |\n| o | o |\n| \u00f5 | \u00f5 |\n+---+---+\n4 rows in set (0.02 sec)\n\nmysql> select * from test order by e;\n+---+---+\n| e | h |\n+---+---+\n| a | a |\n| o | o |\n| \u00f5 | \u00f5 |\n| \u00e4 | \u00e4 |\n+---+---+\n4 rows in set (0.04 sec)\n\nmysql> select * from test order by h;\n+---+---+\n| e | h |\n+---+---+\n| a | a |\n| \u00e4 | \u00e4 |\n| o | o |\n| \u00f5 | \u00f5 |\n+---+---+\n4 rows in set (0.01 sec)\n<\/pre>\n<p>Perfect, now what about indexes?<\/p>\n<pre>\nmysql> create index idxe on test (e);\nQuery OK, 4 rows affected (0.71 sec)\nRecords: 4  Duplicates: 0  Warnings: 0\n\nmysql> create index idxh on test (h);\nQuery OK, 4 rows affected (0.18 sec)\nRecords: 4  Duplicates: 0  Warnings: 0\n\nmysql> explain select e from test order by e;\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n|  1 | SIMPLE      | test  | index | NULL          | idxe | 302     | NULL |    4 | Using index |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n1 row in set (0.00 sec)\n\nmysql> explain select h from test order by h;\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+\n1 row in set (0.00 sec)\n<\/pre>\n<p>If I force a different collation on a column, then values are read from index, but extra filesort step is needed:<\/p>\n<pre>\nmysql> explain select h from test order by h collate utf8_estonian_ci;\n+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+\n| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                       |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+\n|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index; Using filesort |\n+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+\n1 row in set (0.00 sec)\n<\/pre>\n<p>I'll add more rows to see how much difference there is when reading the order from index and when doing extra sorting.<\/p>\n<pre>\nmysql> delimiter \/\/\nmysql> create procedure load_data(p_num_rows INT)\n    -> BEGIN\n    ->   SET @i = 0;\n    ->   REPEAT\n    ->     SET @i = @i + 1;\n    ->     INSERT INTO test (e, h) VALUES (CONCAT('eeeaad sadsa dasd asd', @i),\n    ->       CONCAT('213aad sadsa dasd asd', @i));\n    ->   UNTIL @i > p_num_rows END REPEAT;\n    -> END\n    -> \/\/\nQuery OK, 0 rows affected (0.44 sec)\n\nmysql> delimiter ;\nmysql> call load_data(10000);\nmysql> call load_data(20000);\nmysql> ANALYZE TABLE test;\n\nmysql> select sql_no_cache count(*) from (select h from test order by h collate utf8_estonian_ci) a;\n+----------+\n| count(*) |\n+----------+\n|    30002 |\n+----------+\n1 row in set (1.22 sec)\n\nmysql> select sql_no_cache count(*) from (select h from test order by h) a;\n+----------+\n| count(*) |\n+----------+\n|    30002 |\n+----------+\n1 row in set (0.09 sec)\n<\/pre>\n<p>It is also possible to set collation at connection level, but this does not change the row sorting order like in Oracle.<\/p>\n<pre>\nmysql> truncate table test;\nQuery OK, 0 rows affected (0.07 sec)\n\nmysql> insert into test values ('a','a');\nQuery OK, 1 row affected (0.03 sec)\n\nmysql> insert into test values ('\u00e4','\u00e4');\nQuery OK, 1 row affected (0.08 sec)\n\nmysql> insert into test values ('o','o');\nQuery OK, 1 row affected (0.03 sec)\n\nmysql> insert into test values ('\u00f5','\u00f5');\nQuery OK, 1 row affected (0.03 sec)\n\nmysql> set collation_connection=utf8_estonian_ci;\nQuery OK, 0 rows affected (0.01 sec)\n\nmysql> select h from test order by h;\n+---+\n| h |\n+---+\n| a |\n| \u00e4 |\n| o |\n| \u00f5 |\n+---+\n4 rows in set (0.00 sec)\n\nmysql> set collation_connection=utf8_hungarian_ci;\nQuery OK, 0 rows affected (0.00 sec)\n\nmysql> select h from test order by h;\n+---+\n| h |\n+---+\n| a |\n| \u00e4 |\n| o |\n| \u00f5 |\n+---+\n4 rows in set (0.00 sec)\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Here I&#8217;m following up my previous post Using linguistic indexes for sorting in Oracle. I don&#8217;t much like the Oracle solution, that requires creating a special index to speed up sorting, but&#8230; at the same time its very powerful, allows to index in many languages and no database changes are needed. In this post I\u2019ll [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[12,25,19],"class_list":["post-61","post","type-post","status-publish","format-standard","hentry","category-blog-entry","tag-mysql","tag-performance","tag-postgresql"],"_links":{"self":[{"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/posts\/61","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/comments?post=61"}],"version-history":[{"count":0,"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/posts\/61\/revisions"}],"wp:attachment":[{"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/media?parent=61"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/categories?post=61"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ilmarkerm.eu\/blog\/wp-json\/wp\/v2\/tags?post=61"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}