Ilmar Kerm

Oracle, databases, Linux and maybe more

All posts in this series

Switchover

Planned maintenance, need to take primary server down for maintenance. The obvious Data Guard command for role change in this case is SWITCHOVER, that changes database roles in a nice coordinated way and no standby requires reinstatement (flashback) after the operation.

Lets test it from the client perspective.

Scenario: Only local SYNC standby – no ADG

First I disable and remove failtesti3 instance (remote standby) from data guard configuration, so I only have two local database instances in the configuration, in SYNC mode. Physical standby is in MOUNT mode.

DGMGRL> show configuration verbose;

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti1 - Primary database
    failtesti2 - (*) Physical standby database 

  (*) Fast-Start Failover target

  Properties:
    FastStartFailoverThreshold      = '25'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '1800'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'FAILOVER'
    ConfigurationWideServiceName    = 'QKNPUXMZ_CFG'

Fast-Start Failover: Enabled in Zero Data Loss Mode
  Lag Limit:          30 seconds (not in use)
  Threshold:          25 seconds
  Active Target:      failtesti2
  Potential Targets:  "failtesti2"
    failtesti2 valid
  Observer:           failtest-observer
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: 1800 seconds
  Observer Override:  FALSE

Configuration Status:
SUCCESS

To execute DGMGRL commands I log remotely in using username and password.

dgmgrl "c##dataguard/sysdgpassword@connstr" "show configuration"

Start swingbench and start the testing program.

$ charbench ...
$ python failtest.py switchover_local_only_mount

All working well, now execute switchover

date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti2" ; date -Iseconds

2023-02-14T10:23:27+00:00
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Tue Feb 7 09:29:43 2023
Version 19.16.2.0.0

Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti2"
Connecting ...
Connected to "failtesti2"
Connected as SYSDG.
New primary database "failtesti2" is opening...
Operation requires start up of instance "failtesti1" on database "failtesti1"
Starting instance "failtesti1"...

Please complete the following steps to finish switchover:
        start up and mount instance "failtesti1" of database "failtesti1"

2023-02-14T10:24:44+00:00

Alert.log from new primary failtesti2 (cleaned)

2023-02-14T10:23:35.264043+00:00
ALTER DATABASE SWITCHOVER TO PRIMARY (failtesti2)
2023-02-14T10:23:36.544532+00:00
Background Media Recovery process shutdown (failtesti2)
2023-02-14T10:23:39.430461+00:00
 rmi (PID:3337550): Database role cleared from PHYSICAL STANDBY [kcvs.c:1114]
Switchover: Complete - Database mounted as primary
2023-02-14T10:23:41.542728+00:00
ALTER DATABASE OPEN
2023-02-14T10:23:46.293995+00:00
Completed: ALTER DATABASE OPEN
ALTER PLUGGABLE DATABASE ALL OPEN
2023-02-14T10:23:48.984619+00:00
SOEFAIL(3):Started service soe/soe/soe.prod1.dbs
2023-02-14T10:23:48.985369+00:00
Pluggable database SOEFAIL opened read write
Completed: ALTER PLUGGABLE DATABASE ALL OPEN

Alert.log from old primary failtesti1

2023-02-14T10:23:31.344766+00:00
ALTER DATABASE SWITCHOVER TO 'failtesti2'
2023-02-14T10:23:32.055470+00:00
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2418351] (failtesti1)
2023-02-14T10:23:33.067341+00:00
Primary will check for some target standby to have received all redo
RSM0 (PID:2418351): Waiting for target standby to apply all redo
2023-02-14T10:23:39.432887+00:00
RSM0 (PID:2418351): Switchover complete. Database shutdown required
TMI: dbsdrv switchover to target END 2023-02-14 10:23:39.432896
Completed: ALTER DATABASE SWITCHOVER TO 'failtesti2'

Looking at the alert.log I see that 2023-02-14T10:23:31.344766+00:00 old primary database initiated the switchover procedure and 2023-02-14T10:23:48.985369+00:00 the new primary database was ready to accept connections. Taking 17.64s.

Lets see what client saw at the same time.

What I can see from it is that reads are allowed for a few seconds after switchover was initiated, making total downtime 14 seconds. Writes are closed immediately, so downtime for the client 17s. I actually expeted reads to be allowed for 4 more seconds, right until 2023-02-14T10:23:39 when alert.log reports Switchover complete. Something is not right, must retest this.

Scenario: Only local SYNC standby – with ADG

Lets repeat the test, but now the target standby CDB is open in READ ONLY mode. But pluggable databases are in MOUNTED mode (to prevent the service starting). So Active Data Guard is activated (LICENSE WARNING!).

SQL>sho pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 SOEFAIL                        MOUNTED

Data Guard status

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti2 - Primary database
    failtesti1 - (*) Physical standby database 

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
SUCCESS   (status updated 29 seconds ago)

Executing “switchover to failtesti1”

date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti1" ; date -Iseconds

2023-02-14T10:39:13+00:00

Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
New primary database "failtesti1" is opening...
Operation requires start up of instance "failtesti2" on database "failtesti2"
Starting instance "failtesti2"...

Alert.log from new primary failtesti1

2023-02-14T10:39:19.896716+00:00
ALTER DATABASE SWITCHOVER TO PRIMARY (failtesti1)
2023-02-14T10:39:21.177020+00:00
Background Media Recovery process shutdown (failtesti1)
2023-02-14T10:39:27.007310+00:00
Completed: alter pluggable database all close immediate
2023-02-14T10:39:32.184760+00:00
CLOSE: all sessions shutdown successfully.
Completed: alter pluggable database all close immediate
2023-02-14T10:39:33.404812+00:00
Switchover: Complete - Database mounted as primary
TMI: kcv_commit_to_so_to_primary Switchover from physical END 2023-02-14 10:39:33.532617
SWITCHOVER: completed request from primary database.
2023-02-14T10:39:37.033131+00:00
ALTER DATABASE OPEN
2023-02-14T10:39:43.467549+00:00
SOEFAIL(3):Started service soe/soe/soe.prod1.dbs
2023-02-14T10:39:43.468294+00:00
Pluggable database SOEFAIL opened read write
2023-02-14T10:39:43.694199+00:00
Completed: ALTER DATABASE OPEN
ALTER PLUGGABLE DATABASE ALL OPEN
Completed: ALTER PLUGGABLE DATABASE ALL OPEN

Alert.log from old primary failtesti2

2023-02-14T10:39:16.016238+00:00
ALTER DATABASE SWITCHOVER TO 'failtesti1'
2023-02-14T10:39:16.711884+00:00
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2441633] (failtesti2)
2023-02-14T10:39:17.723971+00:00
RSM0 (PID:2441633): Waiting for target standby to apply all redo
2023-02-14T10:39:19.547345+00:00
Switchover: Complete - Database shutdown required
TMI: kcv_switchover_to_target convert to physical END 2023-02-14 10:39:19.567773
RSM0 (PID:2441633): Sending request(convert to primary database) to switchover target failtesti1
2023-02-14T10:39:33.534253+00:00
RSM0 (PID:2441633): Switchover complete. Database shutdown required

According to alert.log switchover took from 2023-02-14T10:39:16.016238+00:00 to 2023-02-14T10:39:43.467549+00:00 (when service opened), so 27.45s… surprise, did not expect this.

And now the client perspective.

Same story as before, but having the target database open (Active Data Guard) adds about 9-10s to the downtime. From alert.log it looks like it needs to close all read only pluggable databases before switchover completes.

Scenario: Local SYNC standby and remote ASYNC standby – no ADG

Now in addition to local SYNC standby, I’ll add a remote ASYNC standby failtesti3 in the configuration, which is 53ms away (ping time). But I’m still ony interested in the role change between failtesti1 and failtesti2.

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti1 - Primary database
    failtesti2 - (*) Physical standby database 
    failtesti3 - Physical standby database 

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
SUCCESS   (status updated 31 seconds ago)

Start the tests and do switchover.

date -Iseconds ; dgmgrl "c##dataguard/sysdgpassword@connstr" "switchover to failtesti2" ; date -Iseconds

2023-02-24T10:18:13+00:00

Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti2"
Connecting ...
Connected to "failtesti2"
Connected as SYSDG.
New primary database "failtesti2" is opening...
Operation requires start up of instance "failtesti1" on database "failtesti1"
Starting instance "failtesti1"...

2023-02-24T10:19:35+00:00

From client perspective

The remote standby made little difference. It added a few seconds to the switchover time, during which the old primary was still readable. During switchover Data Guard Broker must sync the configuration to all standbys.

Alert log from old primary failtest1

2023-02-24T10:18:18.128228+00:00
ALTER DATABASE SWITCHOVER TO 'failtesti2'
2023-02-24T10:18:18.827865+00:00
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 3284632] (failtesti1)
...
2023-02-24T10:18:25.855408+00:00
Switchover: Complete - Database shutdown required
2023-02-24T10:18:30.367681+00:00
RSM0 (PID:3284632): Switchover complete. Database shutdown required
Completed: ALTER DATABASE SWITCHOVER TO 'failtesti2'

I see that reads on the old primary were allowed right until switchover was complete. Writes and disabled immediately. This is the behaviour I expect.

Scenario: Local SYNC standby and remote standby with lagging apply

Here my goal is to see what happens when the remote standby is lagging behind. Will it affect the local switchover? To simulate that I’ll stop apply on failtesti3 and let swingbench run for a longer time before initiating switchover.

DGMGRL> edit database failtesti3 set state='apply-off';
-- run swingbenh for a few hours
DGMGRL> edit database failtesti3 set state='apply-on';
-- immediately run switchover

Before the test lets check how big the lag is on the remote standby. Before switchover I have over 3h lag, with over 40GB logs to apply on failtesti3.

DGMGRL> show configuration;

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti2 - Primary database
    failtesti1 - (*) Physical standby database 
    failtesti3 - Physical standby database 

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
SUCCESS   (status updated 37 seconds ago)

DGMGRL> edit database failtesti3 set state='apply-on';
Succeeded.
DGMGRL> show database failtesti3;

Database - failtesti3

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          3 hours 7 minutes 36 seconds (computed 0 seconds ago)
  Average Apply Rate: 88.35 MByte/s
  Real Time Query:    ON
  Instance(s):
    failtesti3

  Database Warning(s):
    ORA-16853: apply lag has exceeded specified threshold

Database Status:
WARNING

-- switchover
2023-02-24T13:37:45+00:00
Connected to "failtesti2"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti1"
Connecting ...
Connected to "failtesti1"
Connected as SYSDG.
New primary database "failtesti1" is opening...
Operation requires start up of instance "failtesti2" on database "failtesti2"
Starting instance "failtesti2"...
2023-02-24T13:39:06+00:00

DGMGRL> show database failtesti3;

Database - failtesti3

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          2 hours 28 minutes 47 seconds (computed 0 seconds ago)
  Average Apply Rate: 110.65 MByte/s
  Real Time Query:    ON
  Instance(s):
    failtesti3

After the switchover I still have plenty of lag in failtesti3 to apply.

What did the client see?

I don’t see any real difference to the test without the lag.

Scenario: Local SYNC standby and remote standby with lagging transport

Lets try again, but now I’ll change transport to remote failtesti3 standby to ARCH mode, expecting that during switchover it would need to transport a few GB.

DGMGRL> edit database failtesti3 set property logxptmode='arch';
-- run swingbench, so it has quite a few GB of redo collected before logswitch

2023-02-24T14:14:43+00:00
DGMGRL> switchover to failtesti2;
Connected to "failtesti1"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti2"
Connecting ...
Connected to "failtesti2"
Connected as SYSDG.
New primary database "failtesti2" is opening...
Operation requires start up of instance "failtesti1" on database "failtesti1"
Starting instance "failtesti1"...
2023-02-24T14:19:49+00:00

Finally… significantly slower switchover, taking minutes. How did the client see it?

And it continues this way for minutes… until

Read queries were still only down for about 13s, but writes were down for 4min 3s! That is the time it took to send the archived logfile to the remote standby. In my test the logfile was 3GB is size.

Lets look at the old primary alert.log

2023-02-24T14:14:48.009629+00:00
ALTER DATABASE SWITCHOVER TO 'failtesti2'
2023-02-24T14:14:48.714313+00:00
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 2332229] (failtesti1)
2023-02-24T14:14:49.727293+00:00
RSM0 (PID:2332229): Generating and shipping final logs to target standby
Switchover End-Of-Redo Log thread 1 sequence 858 has been fixed
Switchover: Primary highest seen SCN set to 0x0000000017d7d99f
RSM0 (PID:2332229): Noswitch archival of T-1.S-858
RSM0 (PID:2332229): End-Of-Redo Branch archival of T-1.S-858
RSM0 (PID:2332229): LGWR is actively archiving to LAD:2
2023-02-24T14:18:38.393413+00:00
RSM0 (PID:2332229): Archiving is disabled due to current logfile archival
Primary will check for some target standby to have received all redo
RSM0 (PID:2332229): Waiting for target standby to apply all redo
2023-02-24T14:18:43.333433+00:00
RSM0 (PID:2332229): Switchover complete. Database shutdown required

I don’t really see any log entry that specifically says that it is waiting for the remote standby to receive the latest redo, although this is what it must be waiting on.

Anyway, good thing to know in case of performing switchovers. Make sure ALL standbys in configuration have minimal amount of redo to ship during switchover.

RETEST! Only local SYNC standby – no ADG

Don’t really understand why in my first test without any remote long distance standby, reads were stopped too early. I expected the old standby to be readable up until the very last moment of the switchover. Lets try again.

DGMGRL> remove database failtesti3;
Removed database "failtesti3" from the configuration

DGMGRL> show configuration;

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti2 - Primary database
    failtesti1 - (*) Physical standby database 

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
SUCCESS   (status updated 11 seconds ago)

2023-02-24T14:53:48+00:00
DGMGRL> switchover to failtest1;
Connected to "failtesti2"
Connected as SYSDG.
Performing switchover NOW, please wait...
Operation requires a connection to database "failtesti1"
Connecting ...
Connected to "failtesti1"
Connected as SYSDG.
New primary database "failtesti1" is opening...
Operation requires start up of instance "failtesti2" on database "failtesti2"
Starting instance "failtesti2"...
2023-02-24T14:55:06+00:00

What did the client see?

A little bit faster this time. 12s downtime for reads, 15s downtime for writes.

Take-aways

  • During switchover reads are permitted as long as possible, writes are blocked immediately when switchover starts with ORA-16456.
  • Having Active Data Guard database as the switchover target adds about 10s to the switchover time.
  • Having far away ASYNC standby databases will add a few seconds to the switchover time. Minimal.
  • But if one of the BYSTANDER remote standby databases is having a transport lag, during switchover writes will be blocked for quite a long time, depending on how long it will take to transport the missing redo. Apply lag did not have the same effect.

Next…

Failover methods