Testing client failover with Data Guard 2/switchover
- Written by: ilmarkerm
- Category: Blog entry
- Published: February 24, 2023
All posts in this series
- Part 1: Setup
- Part 2: Switchover
- Part 3: Failover
Switchover
Planned maintenance, need to take primary server down for maintenance. The obvious Data Guard command for role change in this case is SWITCHOVER, that changes database roles in a nice coordinated way and no standby requires reinstatement (flashback) after the operation.
Lets test it from the client perspective.
Scenario: Only local SYNC standby – no ADG
First I disable and remove failtesti3 instance (remote standby) from data guard configuration, so I only have two local database instances in the configuration, in SYNC mode. Physical standby is in MOUNT mode.
DGMGRL> show configuration verbose; Configuration - failtest Protection Mode: MaxAvailability Members: failtesti1 - Primary database failtesti2 - (*) Physical standby database (*) Fast-Start Failover target Properties: FastStartFailoverThreshold = '25' OperationTimeout = '30' TraceLevel = 'USER' FastStartFailoverLagLimit = '30' CommunicationTimeout = '180' ObserverReconnect = '1800' FastStartFailoverAutoReinstate = 'TRUE' FastStartFailoverPmyShutdown = 'TRUE' BystandersFollowRoleChange = 'ALL' ObserverOverride = 'FALSE' ExternalDestination1 = '' ExternalDestination2 = '' PrimaryLostWriteAction = 'FAILOVER' ConfigurationWideServiceName = 'QKNPUXMZ_CFG' Fast-Start Failover: Enabled in Zero Data Loss Mode Lag Limit: 30 seconds (not in use) Threshold: 25 seconds Active Target: failtesti2 Potential Targets: "failtesti2" failtesti2 valid Observer: failtest-observer Shutdown Primary: TRUE Auto-reinstate: TRUE Observer Reconnect: 1800 seconds Observer Override: FALSE Configuration Status: SUCCESS
To execute DGMGRL commands I log remotely in using username and password.
dgmgrl "c##dataguard/sysdgpassword@connstr" "show configuration"
Start swingbench and start the testing program.
$ charbench ...
$ python failtest.py switchover_local_only_mount
All working well, now execute switchover
date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti2" ; date -Iseconds
2023-02-14T10:23:27+00:00
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Tue Feb 7 09:29:43 2023
Version 19.16.2.0.0
Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti2"
Connecting ...
Connected to "failtesti2"
Connected as SYSDG.
New primary database "failtesti2" is opening...
Operation requires start up of instance "failtesti1" on database "failtesti1"
Starting instance "failtesti1"...
Please complete the following steps to finish switchover:
start up and mount instance "failtesti1" of database "failtesti1"
2023-02-14T10:24:44+00:00
Alert.log from new primary failtesti2 (cleaned)
2023-02-14T10:23:35.264043+00:00 ALTER DATABASE SWITCHOVER TO PRIMARY (failtesti2) 2023-02-14T10:23:36.544532+00:00 Background Media Recovery process shutdown (failtesti2) 2023-02-14T10:23:39.430461+00:00 rmi (PID:3337550): Database role cleared from PHYSICAL STANDBY [kcvs.c:1114] Switchover: Complete - Database mounted as primary 2023-02-14T10:23:41.542728+00:00 ALTER DATABASE OPEN 2023-02-14T10:23:46.293995+00:00 Completed: ALTER DATABASE OPEN ALTER PLUGGABLE DATABASE ALL OPEN 2023-02-14T10:23:48.984619+00:00 SOEFAIL(3):Started service soe/soe/soe.prod1.dbs 2023-02-14T10:23:48.985369+00:00 Pluggable database SOEFAIL opened read write Completed: ALTER PLUGGABLE DATABASE ALL OPEN
Alert.log from old primary failtesti1
2023-02-14T10:23:31.344766+00:00 ALTER DATABASE SWITCHOVER TO 'failtesti2' 2023-02-14T10:23:32.055470+00:00 ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2418351] (failtesti1) 2023-02-14T10:23:33.067341+00:00 Primary will check for some target standby to have received all redo RSM0 (PID:2418351): Waiting for target standby to apply all redo 2023-02-14T10:23:39.432887+00:00 RSM0 (PID:2418351): Switchover complete. Database shutdown required TMI: dbsdrv switchover to target END 2023-02-14 10:23:39.432896 Completed: ALTER DATABASE SWITCHOVER TO 'failtesti2'
Looking at the alert.log I see that 2023-02-14T10:23:31.344766+00:00 old primary database initiated the switchover procedure and 2023-02-14T10:23:48.985369+00:00 the new primary database was ready to accept connections. Taking 17.64s.
Lets see what client saw at the same time.
What I can see from it is that reads are allowed for a few seconds after switchover was initiated, making total downtime 14 seconds. Writes are closed immediately, so downtime for the client 17s. I actually expeted reads to be allowed for 4 more seconds, right until 2023-02-14T10:23:39 when alert.log reports Switchover complete. Something is not right, must retest this.
Scenario: Only local SYNC standby – with ADG
Lets repeat the test, but now the target standby CDB is open in READ ONLY mode. But pluggable databases are in MOUNTED mode (to prevent the service starting). So Active Data Guard is activated (LICENSE WARNING!).
SQL>sho pdbs CON_ID CON_NAME OPEN MODE RESTRICTED ---------- ------------------------------ ---------- ---------- 2 PDB$SEED READ ONLY NO 3 SOEFAIL MOUNTED
Data Guard status
Configuration - failtest Protection Mode: MaxAvailability Members: failtesti2 - Primary database failtesti1 - (*) Physical standby database Fast-Start Failover: Enabled in Zero Data Loss Mode Configuration Status: SUCCESS (status updated 29 seconds ago)
Executing “switchover to failtesti1”
date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti1" ; date -Iseconds 2023-02-14T10:39:13+00:00 Connected to "failtesti1" Connected as SYSDG. Performing switchover NOW, please wait... New primary database "failtesti1" is opening... Operation requires start up of instance "failtesti2" on database "failtesti2" Starting instance "failtesti2"...
Alert.log from new primary failtesti1
2023-02-14T10:39:19.896716+00:00 ALTER DATABASE SWITCHOVER TO PRIMARY (failtesti1) 2023-02-14T10:39:21.177020+00:00 Background Media Recovery process shutdown (failtesti1) 2023-02-14T10:39:27.007310+00:00 Completed: alter pluggable database all close immediate 2023-02-14T10:39:32.184760+00:00 CLOSE: all sessions shutdown successfully. Completed: alter pluggable database all close immediate 2023-02-14T10:39:33.404812+00:00 Switchover: Complete - Database mounted as primary TMI: kcv_commit_to_so_to_primary Switchover from physical END 2023-02-14 10:39:33.532617 SWITCHOVER: completed request from primary database. 2023-02-14T10:39:37.033131+00:00 ALTER DATABASE OPEN 2023-02-14T10:39:43.467549+00:00 SOEFAIL(3):Started service soe/soe/soe.prod1.dbs 2023-02-14T10:39:43.468294+00:00 Pluggable database SOEFAIL opened read write 2023-02-14T10:39:43.694199+00:00 Completed: ALTER DATABASE OPEN ALTER PLUGGABLE DATABASE ALL OPEN Completed: ALTER PLUGGABLE DATABASE ALL OPEN
Alert.log from old primary failtesti2
2023-02-14T10:39:16.016238+00:00 ALTER DATABASE SWITCHOVER TO 'failtesti1' 2023-02-14T10:39:16.711884+00:00 ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2441633] (failtesti2) 2023-02-14T10:39:17.723971+00:00 RSM0 (PID:2441633): Waiting for target standby to apply all redo 2023-02-14T10:39:19.547345+00:00 Switchover: Complete - Database shutdown required TMI: kcv_switchover_to_target convert to physical END 2023-02-14 10:39:19.567773 RSM0 (PID:2441633): Sending request(convert to primary database) to switchover target failtesti1 2023-02-14T10:39:33.534253+00:00 RSM0 (PID:2441633): Switchover complete. Database shutdown required
According to alert.log switchover took from 2023-02-14T10:39:16.016238+00:00 to 2023-02-14T10:39:43.467549+00:00 (when service opened), so 27.45s… surprise, did not expect this.
And now the client perspective.
Same story as before, but having the target database open (Active Data Guard) adds about 9-10s to the downtime. From alert.log it looks like it needs to close all read only pluggable databases before switchover completes.
Scenario: Local SYNC standby and remote ASYNC standby – no ADG
Now in addition to local SYNC standby, I’ll add a remote ASYNC standby failtesti3 in the configuration, which is 53ms away (ping time). But I’m still ony interested in the role change between failtesti1 and failtesti2.
Configuration - failtest Protection Mode: MaxAvailability Members: failtesti1 - Primary database failtesti2 - (*) Physical standby database failtesti3 - Physical standby database Fast-Start Failover: Enabled in Zero Data Loss Mode Configuration Status: SUCCESS (status updated 31 seconds ago)
Start the tests and do switchover.
date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti2" ; date -Iseconds 2023-02-24T10:18:13+00:00 Connected to "failtesti1" Connected as SYSDG. Performing switchover NOW, please wait... Operation requires a connection to database "failtesti2" Connecting ... Connected to "failtesti2" Connected as SYSDG. New primary database "failtesti2" is opening... Operation requires start up of instance "failtesti1" on database "failtesti1" Starting instance "failtesti1"... 2023-02-24T10:19:35+00:00
From client perspective
The remote standby made little difference. It added a few seconds to the switchover time, during which the old primary was still readable. During switchover Data Guard Broker must sync the configuration to all standbys.
Alert log from old primary failtest1
2023-02-24T10:18:18.128228+00:00 ALTER DATABASE SWITCHOVER TO 'failtesti2' 2023-02-24T10:18:18.827865+00:00 ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 3284632] (failtesti1) ... 2023-02-24T10:18:25.855408+00:00 Switchover: Complete - Database shutdown required 2023-02-24T10:18:30.367681+00:00 RSM0 (PID:3284632): Switchover complete. Database shutdown required Completed: ALTER DATABASE SWITCHOVER TO 'failtesti2'
I see that reads on the old primary were allowed right until switchover was complete. Writes and disabled immediately. This is the behaviour I expect.
Scenario: Local SYNC standby and remote standby with lagging apply
Here my goal is to see what happens when the remote standby is lagging behind. Will it affect the local switchover? To simulate that I’ll stop apply on failtesti3 and let swingbench run for a longer time before initiating switchover.
DGMGRL> edit database failtesti3 set state='apply-off';
-- run swingbenh for a few hours
DGMGRL> edit database failtesti3 set state='apply-on';
-- immediately run switchover
Before the test lets check how big the lag is on the remote standby. Before switchover I have over 3h lag, with over 40GB logs to apply on failtesti3.
DGMGRL> show configuration; Configuration - failtest Protection Mode: MaxAvailability Members: failtesti2 - Primary database failtesti1 - (*) Physical standby database failtesti3 - Physical standby database Fast-Start Failover: Enabled in Zero Data Loss Mode Configuration Status: SUCCESS (status updated 37 seconds ago) DGMGRL> edit database failtesti3 set state='apply-on'; Succeeded. DGMGRL> show database failtesti3; Database - failtesti3 Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds (computed 0 seconds ago) Apply Lag: 3 hours 7 minutes 36 seconds (computed 0 seconds ago) Average Apply Rate: 88.35 MByte/s Real Time Query: ON Instance(s): failtesti3 Database Warning(s): ORA-16853: apply lag has exceeded specified threshold Database Status: WARNING -- switchover 2023-02-24T13:37:45+00:00 Connected to "failtesti2" Connected as SYSDG. Performing switchover NOW, please wait... Operation requires a connection to database "failtesti1" Connecting ... Connected to "failtesti1" Connected as SYSDG. New primary database "failtesti1" is opening... Operation requires start up of instance "failtesti2" on database "failtesti2" Starting instance "failtesti2"... 2023-02-24T13:39:06+00:00 DGMGRL> show database failtesti3; Database - failtesti3 Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds (computed 0 seconds ago) Apply Lag: 2 hours 28 minutes 47 seconds (computed 0 seconds ago) Average Apply Rate: 110.65 MByte/s Real Time Query: ON Instance(s): failtesti3
After the switchover I still have plenty of lag in failtesti3 to apply.
What did the client see?
I don’t see any real difference to the test without the lag.
Scenario: Local SYNC standby and remote standby with lagging transport
Lets try again, but now I’ll change transport to remote failtesti3 standby to ARCH mode, expecting that during switchover it would need to transport a few GB.
DGMGRL> edit database failtesti3 set property logxptmode='arch';
-- run swingbench, so it has quite a few GB of redo collected before logswitch
2023-02-24T14:14:43+00:00
DGMGRL> switchover to failtesti2;
Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti2"
Connecting ...
Connected to "failtesti2"
Connected as SYSDG.
New primary database "failtesti2" is opening...
Operation requires start up of instance "failtesti1" on database "failtesti1"
Starting instance "failtesti1"...
2023-02-24T14:19:49+00:00
Finally… significantly slower switchover, taking minutes. How did the client see it?
And it continues this way for minutes… until
Read queries were still only down for about 13s, but writes were down for 4min 3s! That is the time it took to send the archived logfile to the remote standby. In my test the logfile was 3GB is size.
Lets look at the old primary alert.log
2023-02-24T14:14:48.009629+00:00 ALTER DATABASE SWITCHOVER TO 'failtesti2' 2023-02-24T14:14:48.714313+00:00 ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2332229] (failtesti1) 2023-02-24T14:14:49.727293+00:00 RSM0 (PID:2332229): Generating and shipping final logs to target standby Switchover End-Of-Redo Log thread 1 sequence 858 has been fixed Switchover: Primary highest seen SCN set to 0x0000000017d7d99f RSM0 (PID:2332229): Noswitch archival of T-1.S-858 RSM0 (PID:2332229): End-Of-Redo Branch archival of T-1.S-858 RSM0 (PID:2332229): LGWR is actively archiving to LAD:2 2023-02-24T14:18:38.393413+00:00 RSM0 (PID:2332229): Archiving is disabled due to current logfile archival Primary will check for some target standby to have received all redo RSM0 (PID:2332229): Waiting for target standby to apply all redo 2023-02-24T14:18:43.333433+00:00 RSM0 (PID:2332229): Switchover complete. Database shutdown required
I don’t really see any log entry that specifically says that it is waiting for the remote standby to receive the latest redo, although this is what it must be waiting on.
Anyway, good thing to know in case of performing switchovers. Make sure ALL standbys in configuration have minimal amount of redo to ship during switchover.
RETEST! Only local SYNC standby – no ADG
Don’t really understand why in my first test without any remote long distance standby, reads were stopped too early. I expected the old standby to be readable up until the very last moment of the switchover. Lets try again.
DGMGRL> remove database failtesti3; Removed database "failtesti3" from the configuration DGMGRL> show configuration; Configuration - failtest Protection Mode: MaxAvailability Members: failtesti2 - Primary database failtesti1 - (*) Physical standby database Fast-Start Failover: Enabled in Zero Data Loss Mode Configuration Status: SUCCESS (status updated 11 seconds ago) 2023-02-24T14:53:48+00:00 DGMGRL> switchover to failtest1; Connected to "failtesti2" Connected as SYSDG. Performing switchover NOW, please wait... Operation requires a connection to database "failtesti1" Connecting ... Connected to "failtesti1" Connected as SYSDG. New primary database "failtesti1" is opening... Operation requires start up of instance "failtesti2" on database "failtesti2" Starting instance "failtesti2"... 2023-02-24T14:55:06+00:00
What did the client see?
A little bit faster this time. 12s downtime for reads, 15s downtime for writes.
Take-aways
- During switchover reads are permitted as long as possible, writes are blocked immediately when switchover starts with ORA-16456.
- Having Active Data Guard database as the switchover target adds about 10s to the switchover time.
- Having far away ASYNC standby databases will add a few seconds to the switchover time. Minimal.
- But if one of the BYSTANDER remote standby databases is having a transport lag, during switchover writes will be blocked for quite a long time, depending on how long it will take to transport the missing redo. Apply lag did not have the same effect.
Next…
Failover methods