Category: Blog entry
Some background info: We need to provide direct database access to many developers and data analysts. On database side managing so many different users securely means you have to use the same authentication source as all the other systems in the company, for example Active Directory. For databases we have implemented using Radius and for Oracle database this means users have to use thick Oracle client on their computers.
Installing Oracle client is not a trivial task for many users and the MacOS instantclient that Oracle provides has been very dependent on the exact MacOS version. In order to provide some help to overcome these problems I decided to create a small Virtualbox VM, that users could just download and run. The VM would be based on Fedora Linux, have preconfigured Oracle Instantclient with common client tools like SQLPlus and SQL Developer.
This kind of VM needs to be maintained and to push out changes I already set up a git respository containing ansible playbook. In VM, under root user I checked out this git respository and set up regular cron job to run one specified playbook locally.
Hourly crontab job looks something like this:
cd ansible-sqldeveloper-vm && git pull && ansible-playbook local.yml
This playbook can be used to push out changes to application level also, for example maintaining common tnsnames.ora file to include newly created databases.
It did not take long to discover first bug with the setup – SQL Developer under Linux “looses” text cursor in editor window very easily, it is just hidden. To change that users need to go to SQL Developer preferences and change how text cursor looks like. I don’t want users to fix this common error themselves.
Another issue, new SQL Developer version came out. It is easy to push out new RPM using Ansible but SQL Developer does not start immediately after that, users have to configure there to find JDK first. Another thing I really don’t want my users to deal with.
Ansible to the rescue. All I needed to do was commit the necessary changes to my local.yml playbook and update it in the git repository.
Lets take the easier issue of telling newly installed SQL Developr where to find JDK first.
SQL Developer 17.4 rads the JDK location from file $HOME/.sqldeveloper/17.4.0/product.conf so /home/sqldeveloper/.sqldeveloper/17.4.0/product.conf in my case. This file contains only one important line telling the location of JDK:
All I need to do is use ansible copy module to create this file.
Now the other problem, I need to change specific SQL Developer preferences. When I did the changes required on a test system I found out that SQL Developer wrote it preferences to XML file $HOME/.sqldeveloper/system220.127.116.115.2349/o.sqldeveloper/product-preferences.xml. The file location is SQL Developer version specific. The specific change was a newly added XML tag under root element:
<hash n="CaretOptions"> <value n="caretColor" v="-31744"/> <value n="insertShape" v="8"/> <value n="overwriteShape" v="5"/> </hash>
This file contains all user preferences, so I don’t really want to overwrite this entire file, I only want to add this new XML tag. This is possible using Ansible xml module where I can give XPath as an argument and the module will make sure that the provided XPath expression is present in the file.
Also I want this to work for all SQL Developer versions and not change the path every time SQL Developer is upgraded.
The resulting Ansible playbook looks like this.
First I’ll describe the needed changes in a separate Ansible parameters file just to make the playbook look cleaner.
The playbook itself:
I’m going to present outside EMEA for the first time – in RMOUG training days 2018 Denver, Colorado.
I came to work today morning and there was an alert in my inbox saying that one of the large databases failed the nightly restore test.
Looking into RMAN logs I saw that recovery failed when applying archivelogs and error was something I have never seen before:
ORA-00756: recovery detected a lost write of a data block ORA-10567: Redo is inconsistent with data block (file# 3, block# 37166793, file offset is 3350347776 bytes) ORA-10564: tablespace SYSAUX ORA-01110: data file 3: '/nfs/...' ORA-10560: block type 'FIRST LEVEL BITMAP BLOCK'
Version: 18.104.22.168 with 2017-08 bundle patch
Looking at MOS I see two bugs that could match:
- Bug 22302666 ORA-753 ORA-756 or ORA-600  with KCOX_FUTURE after RMAN Restore / PITR with BCT after Open Resetlogs in 12c
but this bugfix is already included in 2017-08 bundle patch, and:
- Bug 23589471 ORA-600  KCOX_FUTURE or ORA-756 Lost Write detected during Recovery of a RMAN backup that used BCT
Looks like this matches quite well with my situation and the note has a really scary sentence in it: Incremental RMAN backups using BCT may miss to backup some blocks.
Bugs exists in modern software, bugs exist even in rock solid Oracle database backup and recovery procedures. It doesn’t matter if the backup was completed successfully, the state of the backup can only be determened when a restore is attempted. So please start testing your backups. Regularly. Daily.
We have hundreds of developers who need access to production database for incident management purposes. But we don’t want to use shared accounts to access the database, each user has their own access to the database that is audited and has privileges according to the users access level.
Managing all these users manually on Oracle Database side is just too much, especially that all their access details are already described in Active Directory. Wouldn’t it be nice if we can just syncronise the AD users to the database side? Luckily we have an Ansible module for this task.
First, on the database side need to create a dedicated profile for the syncronised users:
CREATE PROFILE ldap_user LIMIT password_life_time UNLIMITED;
I assume you are already familiary with Ansible, so I’ll go straight to the point.
First you need to clone the ansible-oracle-modules to your playbook directory:
git clone https://github.com/oravirt/ansible-oracle-modules library
This contains, among other goodies, a module that does exactly what is required 🙂 The module is called oracle_ldapuser.
This module requires extra python-ldap module to be installed. Install it using yum, not pip. Pip will install wrong version.
yum install python-ldap
The playbook looks like this:
Remember that it is a syncronisation, so new users are created and removed when the playbook is run.
Here is a slightly modified soft blog post I wrote to explain my technical OUG conference presentation about database backups to a non technical audience.
All interactions our customers do on our site generate data. Now imagine that due to some disaster this data is suddenly lost or corrupted and we cannot get it back. What would happen?
This is not a fantasy, events like this happen in real life. For example take a look what happened to Gitlab this year:
TL;DR: Incredible series of human errors and poorly tested recovery procedures. Resulting in a long downtime and data loss for the popular site.
Or take a look what happened to Google Music in 2012:
TL;DR: Buggy privacy-protection data pipeline removed 600000 audio tracks from Google Music. Over a month passed since Google engineers noticed the problem. About 436000 tracks were recovered from cold offsite tape backups, but a small number of the remaining tracks users had to re-upload to Google Music service again.
Can you imagine if we have to ask our users to re-enter their data again?
Quoting from Linux System Administrators Guide:
There are basically four reasons why you might lose data: hardware failures, software bugs, human action, or natural disasters. Although modern hardware tends to be quite reliable, it can still break seemingly spontaneously. … Modern software doesn’t even tend to be reliable; a rock solid program is an exception, not a rule. Humans are quite unreliable, they will either make a mistake, or they will be malicious and destroy data on purpose. Nature might not be evil, but it can wreak havoc even when being good. All in all, it is a small miracle that anything works at all.
These are the situations that our disaster recovery plan must protect us from. We need to imagine every possible situation that results in a data loss and we must plan and test how to recover from them. There are plenty of modern features in modern databases that help you recover corrupted data: view historical data versions, have a real-time updated copy of the data off-site or roll back the entire database or single committed transactions. All these features are part of our disaster recovery plan, but quite often people who are not deeply familiar with databases seem to look at these modern features and think that they are enough to keep the data safe. No, all these features have severe limitations – they only work on limited situations and they only work during a very limited time window.
Often it is quite sad to see new database products on the market that make it even difficult or impossible to put a proper fail-safe recovery plan in place. This is completely fine for the majority of today’s data of IOT metrics, clicks, likes and tweets – data where you only care about the overall statistical picture, but it is very confusing for me to see these products marketed to store for example financial data.
For example here is a quote from Google discussing online database replication limitation:
A classic but flawed response to the question “Do you have a backup?” is “We have something even better than a backup—replication!” Replication provides many benefits, including locality of data and protection from a site-specific disaster, but it can’t protect you from many sources of data loss. Datastores that automatically sync multiple replicas guarantee that a corrupt database row or errant delete are pushed to all of your copies, likely before you can isolate the problem.
All these features are necessary to the overall disaster recovery plan, but they are not enough. They work as designed and protect against losses they were designed against. They have limitations and this is OK and by design, because when a larger disaster strikes, you must always have and can always fall back to a time honoured disaster recovery solution – keeping an extra copy of your data separate from your main system – a backup – and backups need to be retained for months.
Quoting from Linux System Administrators Guide again:
When it comes to backups, paranoia is in the job description.
Backups also have a cost associated to them, it would be quite expensive to fully back up entire 100+TB data warehouse database every day and keep every daily copy for months. This is a topic I discuss in my Oracle User Group presentation.
Hopefully I’ve now convinced you that backups are essential, but let’s be honest, you really don’t care about backups, you only care about recovery – the ability to recover data from a disaster.
Let’s imagine now that you have a backup system that works flawlessly, every backup you execute finishes without errors. Maybe you even paid good money to a backup vendor for this system. Then disaster strikes and you need to restore from backup. The documented restore process starts fine, but during some point it returns an error – the backup product’s internal deduplication engine database is corrupted and it cannot restore the data, since it cannot find the data pieces anymore. Improbable and it will never happen you say? No, this did actually happen to me. But back then I did not actually regularly and frequently test that I was able to restore from backups, because restoring a single backup took days consuming a lot of resources and planning to execute. This is a very frequent situation companies encounter even today. There even is a law:
Schrödinger’s backup law
The state of any backup is unknown until a restore is attempted
A backup that has never been restored is equally likely to be good or bad. Like Herr Schrödinger’s poor cat!
After implementing the new backup system I discuss in my Oracle User Group presentation for our Oracle databases, we are now able to:
- test all database backups daily and automatically. Testing backups daily is actually very very rare even today due to the high cost involved. The new system makes it very cheap and effortless.
- use the backups for creating extra production database copies for creating test databases or refreshing performance testing environment or upgrading or …
- reduce the downtime of the site following a total disaster