Ilmar Kerm

Oracle, databases, Linux and maybe more

It feels like no-one is really talking about running Oracle database on a filesystem for years now. ASM seems to be the way to go. Oracle in the documentation even says that local file systems, although supported, are not recommended.

https://docs.oracle.com/en/database/oracle/oracle-database/19/ladbi/supported-storage-options-for-oracle-database.html#GUID-120C3995-F827-4D01-B955-CB1BD0457AF7

ASM works great and if you have RAC or a few dear pet databases, ASM for sure is a safe way to go – no argument from me here at all. I especially love the possibility to mirror the storage between different failure goups, that are defined by the administrator, so you can mirror the storage lets say between different hardware racks or availability zones or just different storage hardware. For high-end dear pet setups, ASM is no doubt the only way to go.

But I think the story changes for large cattle of small cloudy databases, where you have a large number of databases and you want to spend minimal effort managing a single one. After Oracle removed ASM from the database installation in 11g, now ASM requires Oracle Grid Infrastructure (Oracle Restart) installation – a separate Oracle software setup – that is, as any Oracle software, notoriously hard to maintain, patch – compared to standard Linux software. And the performance benefits promised by ASM, async IO and direct IO, are also available on a file system. True – the choice of filesystems and setups is harder to manage, but if you do your testing properly, the end result is easier to maintain. Do keep in mind, Oracle does not test running on filesystems any more and any problem is left to the OS support to resolve – which in case of using Oracle Linux, is also Oracle 🙂

There are many historical blogs about running Oracle on filesystems in the Internet, and the consensus seems to be that XFS is the best filesystem currently for running databases.

In addition, note Oracle Linux: Supported and Recommended File Systems (Doc ID 236826.1) also says this about XFS:

XFS is designed for high scalability and provides near native I/O performance even when the file system spans multiple storage devices. Beginning with Oracle Linux 7, XFS is the default file system and is included with Basic and Premier Support subscriptions, at no additional charge.

Based on this information I’ve been running my cattle on custom cloud small database instances on XFS, but quite soon started to have complaints about occasional database slowness and observed high log file sync and commit times. The important bit here – the applications in question commit a lot – they are micro/mini-services that get called a lot. Like many thousands of times per second. Before during platform testing I wasn’t really able to see any such behaviour, because i used tools like Swingbench and SLOB – and they are quite well behaved apps.

Then by accident found a post in Redhat support portal:

https://access.redhat.com/solutions/5906661

That rather recently XFS maintainers introduced a change, that non-aligned (to filesystem block size) direct IO write requests get serialized more thoroughly now. That change was introduced on purpose to avoid possible corruptions.

The change itself is here. It is just intorucing some extra waits on XFS filesystem level.

https://lore.kernel.org/linux-xfs/e2d4678c-eeae-e9df-dde7-84601a52d903@oracle.com/

By default XFS uses 4096 byte block size – so unless you use 4096 blocksize for Oracle redo logs – you will hit this issue. Using 512 block size redo logs on 4096 block size XFS filesystem, means Oracle will issue non-aligned writes for redo.

On Linux Oracle allows setting redo log block size to either 512 or 4096 bytes, but it has a protection mechanism – the redo log block size must match the storage physical sector size. Quoting from Doc ID 2148290.1.

Using 512N disks as 4K leads to atomicity problem (e.g. ASM/DB write operation thinks that 4K IO size is atomic which is not. It leads to consistency issues when there is a crash in the middle of a 4K write causing partial write).

If you use 512 sector size disks, then Oracle will allow you to only create redo logs with 512 byte blocksize.

# cat /sys/block/vdf/queue/physical_block_size
512
# cat /sys/block/vdf/queue/logical_block_size
512

SQL> alter database add logfile '/oralocal/1.rdo' size 1g blocksize 4096;

ORA-01378: The logical block size (4096) of file /oralocal/1.rdo is not
compatible with the disk sector size (media sector size is 512 and host sector size is 512)

If you want to use XFS for redo logs, you need to create XFS filesystem with 512 byte block size. This bring another issue – XFS has deprecated this small block size. It is still currently possible, but warning is issued.

# mkfs.xfs -b size=512 -m crc=0 -f /dev/vg01/redo
V4 filesystems are deprecated and will not be supported by future versions.
meta-data=/dev/vg01/redo         isize=256    agcount=4, agsize=31457280 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0    bigtime=0 inobtcount=0
data     =                       bsize=512    blocks=125829120, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=512    blocks=204800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

To me this brings the following conclusions:

  • XFS can only be used for Oracle redo logs if you use storage with 4096 sector size (advanced format)
  • For 512 byte sector size storage, a different filesystem is needed. For example EXT4.
  • XFS is still fine to use for Oracle datafiles. This issue only concerns redo.

Testing the impact

How bad this issue is exactly. When I ran comparisons using SLOB, I did not notice the difference betwen XFS and EXT4 much. Looking at the storage patterns I saw that SLOB issues large writes and commits rarely, so storage sees quite large writes coming from the log writer.

But my apps make small changes and commit crazy often (yes, bad database application desgn – but I think quite common thinking for an average enterprise application developer). So I needed a new testing suite that can generate high number of IOPS with very small IO sizes. Introducting Horrible Little Oracle Benchmark.

https://gist.github.com/ilmarkerm/462d14da050fb73fb4eeed5384604f1b

This just commits a lot, after making a small change to the database. The results below do not represent any kind of maximum throughput limits, I just wanted to see the impact of using different filesystems under similar load.

Baseline… XFS bsize=4096

Load Profile                    Per Second   Per Transaction
~~~~~~~~~~~~~~~            ---------------   ---------------
      Redo size (bytes):       2,247,197.7             573.7
  Logical read (blocks):          19,760.3               5.0
          Block changes:          15,744.3               4.0
 Physical read (blocks):               0.3               0.0
Physical write (blocks):              45.6               0.0
       Read IO requests:               0.3               0.0
      Write IO requests:               3.6               0.0
         Executes (SQL):           3,923.7               1.0
           Transactions:           3,917.3
 
Event                                Waits Time (sec)      Wait   time
------------------------------ ----------- ---------- --------- ------
log file sync                    2,367,110     3340.7    1.41ms   92.8
DB CPU                                          281.9              7.8
buffer busy waits                  129,582        4.6   35.49us     .1
log file switch (private stran          10         .1    9.29ms     .0
db file sequential read                122         .1  513.16us     .0
control file sequential read            98          0  327.92us     .0

                                                Total
                                       %Time     Wait              Waits   % bg
Event                            Waits -outs Time (s)  Avg wait     /txn   time
-------------------------- ----------- ----- -------- --------- -------- ------
log file parallel write      1,182,647     0    1,166     .99ms      0.5   64.6
LGWR any worker group        1,179,342     0      592  501.58us      0.5   32.8

EXT4

Load Profile                    Per Second   Per Transaction
~~~~~~~~~~~~~~~            ---------------   ---------------
      Redo size (bytes):       4,151,622.2             571.5
         Executes (SQL):           7,267.6               1.0
           Transactions:           7,264.7
 
Event                                Waits Time (sec)      Wait   time
------------------------------ ----------- ---------- --------- ------
log file sync                    4,386,595     3114.6  710.03us   86.5
DB CPU                                          517.8             14.4
buffer busy waits                  302,823       11.1   36.71us     .3

                                                Total
                                       %Time     Wait              Waits   % bg
Event                            Waits -outs Time (s)  Avg wait     /txn   time
-------------------------- ----------- ----- -------- --------- -------- ------
log file parallel write      1,954,831     0      841  430.21us      0.4   59.8
LGWR any worker group        1,226,037     0      271  221.22us      0.3   19.3
LGWR all worker groups         330,590     0      166  502.00us      0.1   11.8
LGWR worker group ordering     318,164     0       47  146.44us      0.1    3.3

XFS with blocksize 512

Load Profile                    Per Second   Per Transaction
~~~~~~~~~~~~~~~            ---------------   ---------------
      Redo size (bytes):       4,118,949.8             571.4
         Executes (SQL):           7,213.7               1.0
           Transactions:           7,208.6
 
Event                                Waits Time (sec)      Wait   time
------------------------------ ----------- ---------- --------- ------
log file sync                    4,354,303     3127.9  718.34us   86.9
DB CPU                                          503.7             14.0
buffer busy waits                  349,311       12.5   35.76us     .3

                                                Total
                                       %Time     Wait              Waits   % bg
Event                            Waits -outs Time (s)  Avg wait     /txn   time
-------------------------- ----------- ----- -------- --------- -------- ------
log file parallel write      1,950,597     0      847  434.01us      0.4   59.7
LGWR any worker group        1,245,012     0      277  222.65us      0.3   19.5
LGWR all worker groups         330,132     0      169  512.99us      0.1   11.9
LGWR worker group ordering     318,918     0       44  139.26us      0.1    3.1

EXT4 with LVM

On previous tests I created file system on top of the raw block device, here I want to compare what happens if there is LVM in the middle.

Load Profile                    Per Second   Per Transaction
~~~~~~~~~~~~~~~            ---------------   ---------------
      Redo size (bytes):       3,412,249.5             571.3
         Executes (SQL):           5,981.8               1.0
           Transactions:           5,972.8
 
Event                                Waits Time (sec)      Wait   time
------------------------------ ----------- ---------- --------- ------
log file sync                    3,613,132     3203.5  886.64us   88.9
DB CPU                                          423.9             11.8
buffer busy waits                  301,471         11   36.60us     .3

                                                Total
                                       %Time     Wait              Waits   % bg
Event                            Waits -outs Time (s)  Avg wait     /txn   time
-------------------------- ----------- ----- -------- --------- -------- ------
log file parallel write      1,610,090     0      843  523.49us      0.4   59.0
LGWR any worker group        1,024,262     0      280  273.01us      0.3   19.6
LGWR all worker groups         270,089     0      171  631.60us      0.1   11.9
LGWR worker group ordering     310,863     0       66  210.95us      0.1    4.6

The important bits

  • Oracle RDBMS 19.16.2
  • Oracle Linux 8.7 with UEKr6 kernel 5.4.17-2136.316.7.el8uek.x86_64
  • Dedicated filesystem for redo logs
  • filesystemio_options=setall
  • No partitioning on block devices

During all tests datafiles remained at the same filesystem and the load on them was absolutely minimal. Between each test I reformatted the redo log filesystem and copied the redo log files over to the new filesystem.

I also tested both XFS and EXT4 when filesystemio_options set to asynch or none, but both of them performed similarly to XFS bsize=4096 with filesystemio_options=setall.

Conclusion

From my tests I conclude, that XFS performance degradation (when not using 4096 byte sector size storage) for high-commit rate applications is significant. The slowdown is observed as longer latency in log file sync and log file parallel write wait events.

As a replacement, EXT4 and XFS blocksize=512 perform similarly. Adding LVM in the mix, reduces the commit throughput a little. Since XFS blocksize=512 is deprecated, EXT4 is the go-to major filesystem on Linux for Oracle database redo logs.

Before going for a file system for Oracle database, do read through Oracle Linux: Supported and Recommended File Systems (Doc ID 236826.1) to understand the implications.

One comment

  1. > Looking at the storage patterns I saw that SLOB issues large writes and commits rarely

    A SLOB test can be configured in countless ways. This is a good blog post, but but would have been even better to list your SLOB details (loaded SCALE and runit.sh option/args and slob.conf).

Comments are closed.