Ilmar Kerm

Oracle, databases, Linux and maybe more

WARNING! This post contains ClickOps methods.

I have a customer running Autota PostgreSQL in AWS and they wanted to monitor and alert certain database users on how much they read rows from the database.

For auditing PostgreSQL database there is a extension pgAudit, which is also by default installed in Aurora PostgreSQL. You can read about how to set it up for your AWS database here.

To enable auditing only on a specific user

ALTER USER "ilmar.kerm" SET pgaudit.log TO 'all';
ALTER USER "ilmar.kerm" SET pgaudit.log_catalog TO 'off';
ALTER USER "ilmar.kerm" SET pgaudit.log_parameter TO 'on';
ALTER USER "ilmar.kerm" SET pgaudit.log_rows TO 'on';

For my use case, the important bit is also to turn ON pgaudit.log_rows.

After that is done, pgAudit will emit a log record into the main PostgreSQL text log, that looks like this:

2024-12-18 02:25:44 UTC:10.20.30.40(53162):ilmar.kerm@egp:[32766]:LOG:  AUDIT: SESSION,94,1,READ,SELECT,,,"SELECT id, value, somethingelse FROM ""APPLSCHEMA1"".""TableName""",<none>,42

This audit record tells me that user “ilmar.kerm” executed SELECT id, value, somethingelse FROM “APPLSCHEMA1″.”TableName”, and that query returned 42 rows.

I now need to extract that value 42 from these audit rows and turn them into CloudWatch metric values.

Cloudwatch Metrics

To extract metric values from CloudWatch Logs, there are Metric filters, but they can only extract values from JSON or SPACE-DELIMITED log records. The pg_audit record above was COMMA-DELIMITED.

Very recently AWS announced an ability to transform logs (into JSON) on ingestion using pre-built or custom parsers. And the transformed logs can also be used to create Metric filters.

Don’t worry, the transformed log is store in addition to the old format, the old format is still accessible as-is. This ofcourse means that the amount of logs CloudWatch stores for PostgreSQL will double. A thing to keep in mind, and not enable auditing for the busy application users.

NB! The examples below use the dreaded ClickOps methodology, since currently AWS Terraform modules do not support this new feature.

First have to turn on the log transformer:

  1. Open the PostgreSQL database CloudWatch Log Group /aws/rds/cluster/CLUSTER_NAME/postgresql
  2. Navigate to Transformer and click Create transformer
  3. Choose a parser: Postgres
  4. Press Add processor
  5. Processor: CSV Source: body Delimiter: , Quote character:
  6. Under Columns it would be possible to give each column under body a proper name, but I’m not going to, since the transformations apply to all PostgreSQL log records, not just pgAudit.
  7. Press Save

Here is how the transformer settings look and also how a new transformed audit record will look like.

Every field is beautifully split into JSON attributes that we can extract for metric values. This is what we’ll do next.

Creating a metric to count all rows affected by audited operations.

  1. Navigate to Metric filters and press Create metric filter
  2. Create filter pattern: { $.body.column_1 = “AUDIT: SESSION” }
  3. Enable metric filter on transformed logs: ON
  4. Next
  5. Filter name: write something yourself
  6. Metric namespace and Metric name – write something yourself, this is how the metric will be named under CloudWatch Metrics
  7. Metric value: $.body.column_10
  8. Unit: Count
  9. Dimension name: operation, Dimension value: $.body.column_4

Something to keep in mind – auditing COPY command

pg_audit always reports that COPY command returns 0 rows. I would alert on all executed COPY commands.

To just count the COPY operations executed, create a new metric filter.

  1. Navigate to Metric filters and press Create metric filter
  2. Create filter pattern: { ($.body.column_1 = “AUDIT: SESSION”) && ($.body.column_4 = “READ”) && ($.body.column_5 = “COPY”) }
  3. Enable metric filter on transformed logs: ON
  4. Next
  5. Filter name: write something yourself
  6. Metric namespace and Metric name – write something yourself, this is how the metric will be named under CloudWatch Metrics
  7. Metric value: 1
  8. Unit: Count

Tested with Centos 5 and PostgreSQL 8.4.

Sadly PostgreSQL does not have remap_schema parameter in its data import tool pg_restore, so importing tables to another schema in destination database is a little bit tricky.

Here is a Bash script I use to connect to the source database over SSH tunnel, export two tables (srv.stat_export and disp.trip) and then import them to the destination database under schema etl.

Here I’m following up my previous post Using linguistic indexes for sorting in Oracle. I don’t much like the Oracle solution, that requires creating a special index to speed up sorting, but… at the same time its very powerful, allows to index in many languages and no database changes are needed.

In this post I’ll take a look at the two popular open source databases MySQL and PostgreSQL. I’ll take a look only at features, that the database has included and that can be used without any special application changes.

PostgreSQL 8.4

Starting from 8.4, collation (sorting) rules can be defined per database and there is no possibility to set it in session level. All sorting and all indexes are ordered according to the database collation locale. In previous versions there was only one collation locale allowed for the entire database cluster.

For my example I create two databases, one with Estonian locale and one with German.

$ createdb -E utf8 -l 'et_EE.utf8' -O ilmar ilmar_ee
$ createdb -E utf8 -l 'de_DE.utf8' -O ilmar -T template0 ilmar_de

Currently PostgreSQL relies on the underlying OS to do the collations, so the OS must also support the specified locale. Check it with:

$ locale -a | grep et_EE
et_EE
et_EE.iso88591
et_EE.iso885915
et_EE.utf8

To change the collation you need to dump the entire database to a text file (pg_dump), create a new database and reload the data. So, a pretty painful procedure for large databases.

A small test if it really works. In ilmar_ee and ilmar_de I create table test_coll and load it with 4 rows:

CREATE TABLE test_coll (
  t character varying(100) NOT NULL
);

begin;
insert into test_coll values ('a');
insert into test_coll values ('o');
insert into test_coll values ('õ');
insert into test_coll values ('ä');
commit;


ilmar_ee=> select t from test_coll order by t;
a
o
õ
ä

ilmar_de=> select t from test_coll order by t;
a
ä
o
õ

Now can index be used for sorting?

CREATE TABLE t (
  t character varying(100) NOT NULL
);

CREATE INDEX idxt
  ON t
  USING btree
  (t);

CREATE OR REPLACE FUNCTION fill_t(p_num_rows bigint) RETURNS bigint AS
$BODY$
declare
  s t.t%type;
  i integer;
begin
  for i in 1..p_num_rows loop
    s:= case mod(i, 4) when 0 then 'a' when 1 then 'ä' when 2 then 'o' when 3 then 'õ' end;
    if mod(i,2) = 0 then
      s:= upper(s);
    end if;
    s:= s ||' wqe wqe wqe wqeqwdsa asd asdasd sasss we qwewq dssas';
    insert into t (t) values (s);
  end loop;
  return 0;
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;

select fill_t(10000);
vacuum analyze t;

After the test data is created, some tests. Oracle users will find the following very strange:

ilmar=> explain select t from t order by t;
                          QUERY PLAN
--------------------------------------------------------------
 Sort  (cost=868.39..893.39 rows=10000 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..204.00 rows=10000 width=55)

ilmar=> explain select t from t where t between 'a' and 'b' order by t;
                               QUERY PLAN
-------------------------------------------------------------------------
 Sort  (cost=395.10..401.35 rows=2500 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..254.00 rows=2500 width=55)
         Filter: (((t)::text >= 'a'::text) AND ((t)::text <= 'b'::text))

ilmar=> explain select t from t where t between 'a' and 'ak' order by t;
                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..8.27 rows=1 width=55)
   Index Cond: (((t)::text >= 'a'::text) AND ((t)::text <= 'ak'::text))

It seems that Postgres optimizer only consideres using index for sorting, when there is only a small fraction of the table filtered. A reason for this in the documentation is:

The planner will consider satisfying an ORDER BY specification either by scanning an available index that matches the specification, or by scanning the table in physical order and doing an explicit sort. For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern. Indexes are more useful when only a few rows need be fetched.

So it seems that Postgres cannot fast full scan an index. If I fill the table up even more, then finally optimizer is costing the sort operation higher than index scan. But the query is slow, 4125ms on my system.

ilmar=> select fill_t(90000);
ilmar=> vacuum analyze t;

ilmar=> explain select t from t order by t;
                              QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..9771.11 rows=100000 width=55)

A note from documentation:

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Better collation supports seems to be a work in progress:
Todo:Collate
Todo:IDU

Its also possible to use third party Orafce package, that enables the use on NLSSORT function, that is similar to Oracle.

MySQL 5.1

MySQL supports collations at the column level.
All available collations for the given charset can be queried like this:

mysql> show collation where charset = 'utf8';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin           | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |
| utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |
| utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |
| utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |
| utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |
| utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |
| utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |
| utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |
| utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |
| utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |
| utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |
| utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |
| utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |
| utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |
| utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |
| utf8_esperanto_ci  | utf8    | 209 |         | Yes      |       8 |
| utf8_hungarian_ci  | utf8    | 210 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
21 rows in set (0.04 sec)

I'll create a table test, in where column e uses Estonian sorting and column h uses hungarian sorting.

mysql> create table test (
  e varchar(100) not null collate 'utf8_estonian_ci',
  h varchar(100) not null collate 'utf8_hungarian_ci'
) charset=utf8;

mysql> insert into test values ('a','a');
mysql> insert into test values ('ä','ä');
mysql> insert into test values ('o','o');
mysql> insert into test values ('õ','õ');
mysql> select * from test;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.02 sec)

mysql> select * from test order by e;
+---+---+
| e | h |
+---+---+
| a | a |
| o | o |
| õ | õ |
| ä | ä |
+---+---+
4 rows in set (0.04 sec)

mysql> select * from test order by h;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.01 sec)

Perfect, now what about indexes?

mysql> create index idxe on test (e);
Query OK, 4 rows affected (0.71 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> create index idxh on test (h);
Query OK, 4 rows affected (0.18 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> explain select e from test order by e;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxe | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain select h from test order by h;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

If I force a different collation on a column, then values are read from index, but extra filesort step is needed:

mysql> explain select h from test order by h collate utf8_estonian_ci;
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

I'll add more rows to see how much difference there is when reading the order from index and when doing extra sorting.

mysql> delimiter //
mysql> create procedure load_data(p_num_rows INT)
    -> BEGIN
    ->   SET @i = 0;
    ->   REPEAT
    ->     SET @i = @i + 1;
    ->     INSERT INTO test (e, h) VALUES (CONCAT('eeeaad sadsa dasd asd', @i),
    ->       CONCAT('213aad sadsa dasd asd', @i));
    ->   UNTIL @i > p_num_rows END REPEAT;
    -> END
    -> //
Query OK, 0 rows affected (0.44 sec)

mysql> delimiter ;
mysql> call load_data(10000);
mysql> call load_data(20000);
mysql> ANALYZE TABLE test;

mysql> select sql_no_cache count(*) from (select h from test order by h collate utf8_estonian_ci) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (1.22 sec)

mysql> select sql_no_cache count(*) from (select h from test order by h) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (0.09 sec)

It is also possible to set collation at connection level, but this does not change the row sorting order like in Oracle.

mysql> truncate table test;
Query OK, 0 rows affected (0.07 sec)

mysql> insert into test values ('a','a');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('ä','ä');
Query OK, 1 row affected (0.08 sec)

mysql> insert into test values ('o','o');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('õ','õ');
Query OK, 1 row affected (0.03 sec)

mysql> set collation_connection=utf8_estonian_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)

mysql> set collation_connection=utf8_hungarian_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)