November 2017 – Ilmar Kerm

I came to work today morning and there was an alert in my inbox saying that one of the large databases failed the nightly restore test.
Looking into RMAN logs I saw that recovery failed when applying archivelogs and error was something I have never seen before:

ORA-00756: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 3, block# 37166793, file offset is 3350347776 bytes)
ORA-10564: tablespace SYSAUX
ORA-01110: data file 3: '/nfs/...'
ORA-10560: block type 'FIRST LEVEL BITMAP BLOCK'

Version: 12.1.0.2 with 2017-08 bundle patch

Looking at MOS I see two bugs that could match:

Bug 22302666 ORA-753 ORA-756 or ORA-600 [3020] with KCOX_FUTURE after RMAN Restore / PITR with BCT after Open Resetlogs in 12c

but this bugfix is already included in 2017-08 bundle patch, and:

Bug 23589471 ORA-600 [3020] KCOX_FUTURE or ORA-756 Lost Write detected during Recovery of a RMAN backup that used BCT

Looks like this matches quite well with my situation and the note has a really scary sentence in it: Incremental RMAN backups using BCT may miss to backup some blocks.

Bugs exists in modern software, bugs exist even in rock solid Oracle database backup and recovery procedures. It doesn’t matter if the backup was completed successfully, the state of the backup can only be determened when a restore is attempted. So please start testing your backups. Regularly. Daily.

Non-technical post about database backup and recovery
Scripts to implement incremental-forever backup strategy in Oracle and testing backups automatically

We have hundreds of developers who need access to production database for incident management purposes. But we don’t want to use shared accounts to access the database, each user has their own access to the database that is audited and has privileges according to the users access level.
Managing all these users manually on Oracle Database side is just too much, especially that all their access details are already described in Active Directory. Wouldn’t it be nice if we can just syncronise the AD users to the database side? Luckily we have an Ansible module for this task.

First, on the database side need to create a dedicated profile for the syncronised users:

CREATE PROFILE ldap_user LIMIT password_life_time UNLIMITED;

I assume you are already familiary with Ansible, so I’ll go straight to the point.
First you need to clone the ansible-oracle-modules to your playbook directory:

git clone https://github.com/oravirt/ansible-oracle-modules library

This contains, among other goodies, a module that does exactly what is required 🙂 The module is called oracle_ldapuser.
This module requires extra python-ldap module to be installed. Install it using yum, not pip. Pip will install wrong version.

yum install python-ldap

The playbook looks like this:

Remember that it is a syncronisation, so new users are created and removed when the playbook is run.

Here is a slightly modified soft blog post I wrote to explain my technical OUG conference presentation about database backups to a non technical audience.

All interactions our customers do on our site generate data. Now imagine that due to some disaster this data is suddenly lost or corrupted and we cannot get it back. What would happen?

This is not a fantasy, events like this happen in real life. For example take a look what happened to Gitlab this year:
https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/
TL;DR: Incredible series of human errors and poorly tested recovery procedures. Resulting in a long downtime and data loss for the popular site.

Or take a look what happened to Google Music in 2012:
https://landing.google.com/sre/book/chapters/data-integrity.html#google-musicmarch-2012-runaway-deletion-detection-JMsvtqsz
TL;DR: Buggy privacy-protection data pipeline removed 600000 audio tracks from Google Music. Over a month passed since Google engineers noticed the problem. About 436000 tracks were recovered from cold offsite tape backups, but a small number of the remaining tracks users had to re-upload to Google Music service again.
Can you imagine if we have to ask our users to re-enter their data again?

Quoting from Linux System Administrators Guide:
http://www.tldp.org/LDP/sag/html/backups.html

There are basically four reasons why you might lose data: hardware failures, software bugs, human action, or natural disasters. Although modern hardware tends to be quite reliable, it can still break seemingly spontaneously. … Modern software doesn’t even tend to be reliable; a rock solid program is an exception, not a rule. Humans are quite unreliable, they will either make a mistake, or they will be malicious and destroy data on purpose. Nature might not be evil, but it can wreak havoc even when being good. All in all, it is a small miracle that anything works at all.

These are the situations that our disaster recovery plan must protect us from. We need to imagine every possible situation that results in a data loss and we must plan and test how to recover from them. There are plenty of modern features in modern databases that help you recover corrupted data: view historical data versions, have a real-time updated copy of the data off-site or roll back the entire database or single committed transactions. All these features are part of our disaster recovery plan, but quite often people who are not deeply familiar with databases seem to look at these modern features and think that they are enough to keep the data safe. No, all these features have severe limitations – they only work on limited situations and they only work during a very limited time window.
Often it is quite sad to see new database products on the market that make it even difficult or impossible to put a proper fail-safe recovery plan in place. This is completely fine for the majority of today’s data of IOT metrics, clicks, likes and tweets – data where you only care about the overall statistical picture, but it is very confusing for me to see these products marketed to store for example financial data.

For example here is a quote from Google discussing online database replication limitation:
https://landing.google.com/sre/book/chapters/data-integrity.html#scaling-issues-fulls-incrementals-and-the-competing-forces-of-backups-and-restores-ZKsnIpTjU2

A classic but flawed response to the question “Do you have a backup?” is “We have something even better than a backup—replication!” Replication provides many benefits, including locality of data and protection from a site-specific disaster, but it can’t protect you from many sources of data loss. Datastores that automatically sync multiple replicas guarantee that a corrupt database row or errant delete are pushed to all of your copies, likely before you can isolate the problem.

All these features are necessary to the overall disaster recovery plan, but they are not enough. They work as designed and protect against losses they were designed against. They have limitations and this is OK and by design, because when a larger disaster strikes, you must always have and can always fall back to a time honoured disaster recovery solution – keeping an extra copy of your data separate from your main system – a backup – and backups need to be retained for months.

Quoting from Linux System Administrators Guide again:

When it comes to backups, paranoia is in the job description.

Backups also have a cost associated to them, it would be quite expensive to fully back up entire 100+TB data warehouse database every day and keep every daily copy for months. This is a topic I discuss in my Oracle User Group presentation.

Hopefully I’ve now convinced you that backups are essential, but let’s be honest, you really don’t care about backups, you only care about recovery – the ability to recover data from a disaster.

Let’s imagine now that you have a backup system that works flawlessly, every backup you execute finishes without errors. Maybe you even paid good money to a backup vendor for this system. Then disaster strikes and you need to restore from backup. The documented restore process starts fine, but during some point it returns an error – the backup product’s internal deduplication engine database is corrupted and it cannot restore the data, since it cannot find the data pieces anymore. Improbable and it will never happen you say? No, this did actually happen to me. But back then I did not actually regularly and frequently test that I was able to restore from backups, because restoring a single backup took days consuming a lot of resources and planning to execute. This is a very frequent situation companies encounter even today. There even is a law:

Schrödinger’s backup law

The state of any backup is unknown until a restore is attempted

Corollary:

A backup that has never been restored is equally likely to be good or bad. Like Herr Schrödinger’s poor cat!

After implementing the new backup system I discuss in my Oracle User Group presentation for our Oracle databases, we are now able to:

test all database backups daily and automatically. Testing backups daily is actually very very rare even today due to the high cost involved. The new system makes it very cheap and effortless.
use the backups for creating extra production database copies for creating test databases or refreshing performance testing environment or upgrading or …
reduce the downtime of the site following a total disaster

Ilmar Kerm

Archives: Regular database restore tests are important

Archives: Syncronising users from Active Directory to Oracle Database using Ansible

Archives: About database backup and recovery

Categories

About me

Pages

Recent Posts

Tags

Archive