Oracle redo log performance on Linux filesystems
- Written by: ilmarkerm
- Category: Blog entry
- Published: March 18, 2023
It feels like no-one is really talking about running Oracle database on a filesystem for years now. ASM seems to be the way to go. Oracle in the documentation even says that local file systems, although supported, are not recommended.
ASM works great and if you have RAC or a few dear pet databases, ASM for sure is a safe way to go – no argument from me here at all. I especially love the possibility to mirror the storage between different failure goups, that are defined by the administrator, so you can mirror the storage lets say between different hardware racks or availability zones or just different storage hardware. For high-end dear pet setups, ASM is no doubt the only way to go.
But I think the story changes for large cattle of small cloudy databases, where you have a large number of databases and you want to spend minimal effort managing a single one. After Oracle removed ASM from the database installation in 11g, now ASM requires Oracle Grid Infrastructure (Oracle Restart) installation – a separate Oracle software setup – that is, as any Oracle software, notoriously hard to maintain, patch – compared to standard Linux software. And the performance benefits promised by ASM, async IO and direct IO, are also available on a file system. True – the choice of filesystems and setups is harder to manage, but if you do your testing properly, the end result is easier to maintain. Do keep in mind, Oracle does not test running on filesystems any more and any problem is left to the OS support to resolve – which in case of using Oracle Linux, is also Oracle 🙂
There are many historical blogs about running Oracle on filesystems in the Internet, and the consensus seems to be that XFS is the best filesystem currently for running databases.
In addition, note Oracle Linux: Supported and Recommended File Systems (Doc ID 236826.1) also says this about XFS:
XFS is designed for high scalability and provides near native I/O performance even when the file system spans multiple storage devices. Beginning with Oracle Linux 7, XFS is the default file system and is included with Basic and Premier Support subscriptions, at no additional charge.
Based on this information I’ve been running my cattle on custom cloud small database instances on XFS, but quite soon started to have complaints about occasional database slowness and observed high log file sync and commit times. The important bit here – the applications in question commit a lot – they are micro/mini-services that get called a lot. Like many thousands of times per second. Before during platform testing I wasn’t really able to see any such behaviour, because i used tools like Swingbench and SLOB – and they are quite well behaved apps.
Then by accident found a post in Redhat support portal:
https://access.redhat.com/solutions/5906661
That rather recently XFS maintainers introduced a change, that non-aligned (to filesystem block size) direct IO write requests get serialized more thoroughly now. That change was introduced on purpose to avoid possible corruptions.
The change itself is here. It is just intorucing some extra waits on XFS filesystem level.
https://lore.kernel.org/linux-xfs/e2d4678c-eeae-e9df-dde7-84601a52d903@oracle.com/
By default XFS uses 4096 byte block size – so unless you use 4096 blocksize for Oracle redo logs – you will hit this issue. Using 512 block size redo logs on 4096 block size XFS filesystem, means Oracle will issue non-aligned writes for redo.
On Linux Oracle allows setting redo log block size to either 512 or 4096 bytes, but it has a protection mechanism – the redo log block size must match the storage physical sector size. Quoting from Doc ID 2148290.1.
Using 512N disks as 4K leads to atomicity problem (e.g. ASM/DB write operation thinks that 4K IO size is atomic which is not. It leads to consistency issues when there is a crash in the middle of a 4K write causing partial write).
If you use 512 sector size disks, then Oracle will allow you to only create redo logs with 512 byte blocksize.
# cat /sys/block/vdf/queue/physical_block_size
512
# cat /sys/block/vdf/queue/logical_block_size
512
SQL> alter database add logfile '/oralocal/1.rdo' size 1g blocksize 4096;
ORA-01378: The logical block size (4096) of file /oralocal/1.rdo is not
compatible with the disk sector size (media sector size is 512 and host sector size is 512)
If you want to use XFS for redo logs, you need to create XFS filesystem with 512 byte block size. This bring another issue – XFS has deprecated this small block size. It is still currently possible, but warning is issued.
# mkfs.xfs -b size=512 -m crc=0 -f /dev/vg01/redo
V4 filesystems are deprecated and will not be supported by future versions.
meta-data=/dev/vg01/redo isize=256 agcount=4, agsize=31457280 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0, sparse=0, rmapbt=0
= reflink=0 bigtime=0 inobtcount=0
data = bsize=512 blocks=125829120, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=512 blocks=204800, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
To me this brings the following conclusions:
- XFS can only be used for Oracle redo logs if you use storage with 4096 sector size (advanced format)
- For 512 byte sector size storage, a different filesystem is needed. For example EXT4.
- XFS is still fine to use for Oracle datafiles. This issue only concerns redo.
Testing the impact
How bad this issue is exactly. When I ran comparisons using SLOB, I did not notice the difference betwen XFS and EXT4 much. Looking at the storage patterns I saw that SLOB issues large writes and commits rarely, so storage sees quite large writes coming from the log writer.
But my apps make small changes and commit crazy often (yes, bad database application desgn – but I think quite common thinking for an average enterprise application developer). So I needed a new testing suite that can generate high number of IOPS with very small IO sizes. Introducting Horrible Little Oracle Benchmark.
https://gist.github.com/ilmarkerm/462d14da050fb73fb4eeed5384604f1b
This just commits a lot, after making a small change to the database. The results below do not represent any kind of maximum throughput limits, I just wanted to see the impact of using different filesystems under similar load.
Baseline… XFS bsize=4096
Load Profile Per Second Per Transaction ~~~~~~~~~~~~~~~ --------------- --------------- Redo size (bytes): 2,247,197.7 573.7 Logical read (blocks): 19,760.3 5.0 Block changes: 15,744.3 4.0 Physical read (blocks): 0.3 0.0 Physical write (blocks): 45.6 0.0 Read IO requests: 0.3 0.0 Write IO requests: 3.6 0.0 Executes (SQL): 3,923.7 1.0 Transactions: 3,917.3 Event Waits Time (sec) Wait time ------------------------------ ----------- ---------- --------- ------ log file sync 2,367,110 3340.7 1.41ms 92.8 DB CPU 281.9 7.8 buffer busy waits 129,582 4.6 35.49us .1 log file switch (private stran 10 .1 9.29ms .0 db file sequential read 122 .1 513.16us .0 control file sequential read 98 0 327.92us .0 Total %Time Wait Waits % bg Event Waits -outs Time (s) Avg wait /txn time -------------------------- ----------- ----- -------- --------- -------- ------ log file parallel write 1,182,647 0 1,166 .99ms 0.5 64.6 LGWR any worker group 1,179,342 0 592 501.58us 0.5 32.8
EXT4
Load Profile Per Second Per Transaction ~~~~~~~~~~~~~~~ --------------- --------------- Redo size (bytes): 4,151,622.2 571.5 Executes (SQL): 7,267.6 1.0 Transactions: 7,264.7 Event Waits Time (sec) Wait time ------------------------------ ----------- ---------- --------- ------ log file sync 4,386,595 3114.6 710.03us 86.5 DB CPU 517.8 14.4 buffer busy waits 302,823 11.1 36.71us .3 Total %Time Wait Waits % bg Event Waits -outs Time (s) Avg wait /txn time -------------------------- ----------- ----- -------- --------- -------- ------ log file parallel write 1,954,831 0 841 430.21us 0.4 59.8 LGWR any worker group 1,226,037 0 271 221.22us 0.3 19.3 LGWR all worker groups 330,590 0 166 502.00us 0.1 11.8 LGWR worker group ordering 318,164 0 47 146.44us 0.1 3.3
XFS with blocksize 512
Load Profile Per Second Per Transaction ~~~~~~~~~~~~~~~ --------------- --------------- Redo size (bytes): 4,118,949.8 571.4 Executes (SQL): 7,213.7 1.0 Transactions: 7,208.6 Event Waits Time (sec) Wait time ------------------------------ ----------- ---------- --------- ------ log file sync 4,354,303 3127.9 718.34us 86.9 DB CPU 503.7 14.0 buffer busy waits 349,311 12.5 35.76us .3 Total %Time Wait Waits % bg Event Waits -outs Time (s) Avg wait /txn time -------------------------- ----------- ----- -------- --------- -------- ------ log file parallel write 1,950,597 0 847 434.01us 0.4 59.7 LGWR any worker group 1,245,012 0 277 222.65us 0.3 19.5 LGWR all worker groups 330,132 0 169 512.99us 0.1 11.9 LGWR worker group ordering 318,918 0 44 139.26us 0.1 3.1
EXT4 with LVM
On previous tests I created file system on top of the raw block device, here I want to compare what happens if there is LVM in the middle.
Load Profile Per Second Per Transaction ~~~~~~~~~~~~~~~ --------------- --------------- Redo size (bytes): 3,412,249.5 571.3 Executes (SQL): 5,981.8 1.0 Transactions: 5,972.8 Event Waits Time (sec) Wait time ------------------------------ ----------- ---------- --------- ------ log file sync 3,613,132 3203.5 886.64us 88.9 DB CPU 423.9 11.8 buffer busy waits 301,471 11 36.60us .3 Total %Time Wait Waits % bg Event Waits -outs Time (s) Avg wait /txn time -------------------------- ----------- ----- -------- --------- -------- ------ log file parallel write 1,610,090 0 843 523.49us 0.4 59.0 LGWR any worker group 1,024,262 0 280 273.01us 0.3 19.6 LGWR all worker groups 270,089 0 171 631.60us 0.1 11.9 LGWR worker group ordering 310,863 0 66 210.95us 0.1 4.6
The important bits
- Oracle RDBMS 19.16.2
- Oracle Linux 8.7 with UEKr6 kernel 5.4.17-2136.316.7.el8uek.x86_64
- Dedicated filesystem for redo logs
- filesystemio_options=setall
- No partitioning on block devices
During all tests datafiles remained at the same filesystem and the load on them was absolutely minimal. Between each test I reformatted the redo log filesystem and copied the redo log files over to the new filesystem.
I also tested both XFS and EXT4 when filesystemio_options set to asynch or none, but both of them performed similarly to XFS bsize=4096 with filesystemio_options=setall.
Conclusion
From my tests I conclude, that XFS performance degradation (when not using 4096 byte sector size storage) for high-commit rate applications is significant. The slowdown is observed as longer latency in log file sync and log file parallel write wait events.
As a replacement, EXT4 and XFS blocksize=512 perform similarly. Adding LVM in the mix, reduces the commit throughput a little. Since XFS blocksize=512 is deprecated, EXT4 is the go-to major filesystem on Linux for Oracle database redo logs.
Before going for a file system for Oracle database, do read through Oracle Linux: Supported and Recommended File Systems (Doc ID 236826.1) to understand the implications.
> Looking at the storage patterns I saw that SLOB issues large writes and commits rarely
A SLOB test can be configured in countless ways. This is a good blog post, but but would have been even better to list your SLOB details (loaded SCALE and runit.sh option/args and slob.conf).