Ilmar Kerm

Oracle, databases, Linux and maybe more

Here is a slightly modified soft blog post I wrote to explain my technical OUG conference presentation about database backups to a non technical audience.

All interactions our customers do on our site generate data. Now imagine that due to some disaster this data is suddenly lost or corrupted and we cannot get it back. What would happen?

This is not a fantasy, events like this happen in real life. For example take a look what happened to Gitlab this year:
https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/
TL;DR: Incredible series of human errors and poorly tested recovery procedures. Resulting in a long downtime and data loss for the popular site.

Or take a look what happened to Google Music in 2012:
https://landing.google.com/sre/book/chapters/data-integrity.html#google-musicmarch-2012-runaway-deletion-detection-JMsvtqsz
TL;DR: Buggy privacy-protection data pipeline removed 600000 audio tracks from Google Music. Over a month passed since Google engineers noticed the problem. About 436000 tracks were recovered from cold offsite tape backups, but a small number of the remaining tracks users had to re-upload to Google Music service again.
Can you imagine if we have to ask our users to re-enter their data again?

Quoting from Linux System Administrators Guide:
http://www.tldp.org/LDP/sag/html/backups.html

There are basically four reasons why you might lose data: hardware failures, software bugs, human action, or natural disasters. Although modern hardware tends to be quite reliable, it can still break seemingly spontaneously. … Modern software doesn’t even tend to be reliable; a rock solid program is an exception, not a rule. Humans are quite unreliable, they will either make a mistake, or they will be malicious and destroy data on purpose. Nature might not be evil, but it can wreak havoc even when being good. All in all, it is a small miracle that anything works at all.

These are the situations that our disaster recovery plan must protect us from. We need to imagine every possible situation that results in a data loss and we must plan and test how to recover from them. There are plenty of modern features in modern databases that help you recover corrupted data: view historical data versions, have a real-time updated copy of the data off-site or roll back the entire database or single committed transactions. All these features are part of our disaster recovery plan, but quite often people who are not deeply familiar with databases seem to look at these modern features and think that they are enough to keep the data safe. No, all these features have severe limitations – they only work on limited situations and they only work during a very limited time window.
Often it is quite sad to see new database products on the market that make it even difficult or impossible to put a proper fail-safe recovery plan in place. This is completely fine for the majority of today’s data of IOT metrics, clicks, likes and tweets – data where you only care about the overall statistical picture, but it is very confusing for me to see these products marketed to store for example financial data.

For example here is a quote from Google discussing online database replication limitation:
https://landing.google.com/sre/book/chapters/data-integrity.html#scaling-issues-fulls-incrementals-and-the-competing-forces-of-backups-and-restores-ZKsnIpTjU2

A classic but flawed response to the question “Do you have a backup?” is “We have something even better than a backup—replication!” Replication provides many benefits, including locality of data and protection from a site-specific disaster, but it can’t protect you from many sources of data loss. Datastores that automatically sync multiple replicas guarantee that a corrupt database row or errant delete are pushed to all of your copies, likely before you can isolate the problem.

All these features are necessary to the overall disaster recovery plan, but they are not enough. They work as designed and protect against losses they were designed against. They have limitations and this is OK and by design, because when a larger disaster strikes, you must always have and can always fall back to a time honoured disaster recovery solution – keeping an extra copy of your data separate from your main system – a backup – and backups need to be retained for months.

Quoting from Linux System Administrators Guide again:

When it comes to backups, paranoia is in the job description.

Backups also have a cost associated to them, it would be quite expensive to fully back up entire 100+TB data warehouse database every day and keep every daily copy for months. This is a topic I discuss in my Oracle User Group presentation.

Hopefully I’ve now convinced you that backups are essential, but let’s be honest, you really don’t care about backups, you only care about recovery – the ability to recover data from a disaster.

Let’s imagine now that you have a backup system that works flawlessly, every backup you execute finishes without errors. Maybe you even paid good money to a backup vendor for this system. Then disaster strikes and you need to restore from backup. The documented restore process starts fine, but during some point it returns an error – the backup product’s internal deduplication engine database is corrupted and it cannot restore the data, since it cannot find the data pieces anymore. Improbable and it will never happen you say? No, this did actually happen to me. But back then I did not actually regularly and frequently test that I was able to restore from backups, because restoring a single backup took days consuming a lot of resources and planning to execute. This is a very frequent situation companies encounter even today. There even is a law:

Schrödinger’s backup law

The state of any backup is unknown until a restore is attempted

Corollary:

A backup that has never been restored is equally likely to be good or bad. Like Herr Schrödinger’s poor cat!

After implementing the new backup system I discuss in my Oracle User Group presentation for our Oracle databases, we are now able to:

  • test all database backups daily and automatically. Testing backups daily is actually very very rare even today due to the high cost involved. The new system makes it very cheap and effortless.
  • use the backups for creating extra production database copies for creating test databases or refreshing performance testing environment or upgrading or …
  • reduce the downtime of the site following a total disaster

I added a few larger updates to my Oracle Image Copy backup scripts.

  • Now it also has built in support for Netapp NAS storage
  • In addition to autorestore script there now is restore.py script that you can use to spin off cloned databases restored to a specified point in time

Go and check out the scripts from here