high availability – Ilmar Kerm

Written by: ilmarkerm

Category: Blog entry

Published: February 24, 2023

All posts in this series

Data Guard setup

My end goal is to find out how fast database clients (applications) can resume work (reading and writing) after Data Guard role change operation – either it happens as a planned or unplanned operation. Testing different options to find the shortest downtime – results apply for my environment only, your results will vary.

I’m doing it just to increase my knowledge about different Data Guard scenarios.

My database setup

I’m ONLY interested in the role change operation between instances failtesti1 and failtesti2. Whitch are in the same DC, different availability zones (ping 0.165 ms avg). They equal in every way and are intended to service application read+write traffic as primary databases. These two instances are in SYNC mode.

During the test I will also throw in a remote standby database failtesti3 that is located a few thousand km away (53ms ping). This remote standby is in ASYNC or ARCH mode. failtesti3 is only a BYSTANDER during all tests.

DGMGRL> show configuration verbose;

Configuration - failtest

  Protection Mode: MaxAvailability
  Members:
  failtesti1 - Primary database
    failtesti2 - (*) Physical standby database 
    failtesti3 - Physical standby database 

  (*) Fast-Start Failover target

  Properties:
    FastStartFailoverThreshold      = '25'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '1800'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'FAILOVER'
    ConfigurationWideServiceName    = 'QKNPUXMZ_CFG'

Fast-Start Failover: Enabled in Zero Data Loss Mode
  Lag Limit:          30 seconds (not in use)
  Threshold:          25 seconds
  Active Target:      failtesti2
  Potential Targets:  "failtesti2"
    failtesti2 valid
  Observer:           failtest-observer
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: 1800 seconds
  Observer Override:  FALSE

Configuration Status:
SUCCESS

Services

Both CDB and PDB have a simple service and on database open trigger created, to start the service when database opens.

exec dbms_service.create_service('dbauser_failtest','dbauser_failtest');
exec dbms_service.start_service('dbauser_failtest');

CREATE OR REPLACE TRIGGER sys.cdb_start_services AFTER STARTUP OR DB_ROLE_CHANGE ON DATABASE
DECLARE
    v_srv_name varchar2(30):= 'dbauser_failtest';
    cnt NUMBER;
BEGIN
    SELECT count(*) INTO cnt FROM v$active_services WHERE name = v_srv_name;
    IF cnt = 0 THEN
        dbms_service.start_service(v_srv_name);
    END IF;
END;
/

For PDB application use

alter session set container=soefail;

-- Lets create a simple service
exec dbms_service.create_service('soe','soe');

-- Now lets create one service with Application Continuity turned on
DECLARE
    new_service_params dbms_service.svc_parameter_array;
    p_srv_name varchar2(20):= 'soe_ac';
BEGIN
    new_service_params('FAILOVER_TYPE'):= 'AUTO';
    new_service_params('REPLAY_INITIATION_TIMEOUT'):= 1800;
    new_service_params('RETENTION_TIMEOUT'):= 86400;
    new_service_params('FAILOVER_DELAY'):= 5;
    new_service_params('FAILOVER_RETRIES'):= 10;
    new_service_params('FAILOVER_RESTORE'):= 'AUTO';
    new_service_params('commit_outcome'):= 'true';
    new_service_params('aq_ha_notifications'):= 'true';
    dbms_service.create_service(p_srv_name, p_srv_name);
    dbms_service.modify_service(p_srv_name, new_service_params);
    dbms_service.start_service(p_srv_name);
END;
/

CREATE OR REPLACE TRIGGER sys.pdb_start_services AFTER STARTUP ON DATABASE 
BEGIN
    dbms_service.start_service('soe');
    dbms_service.start_service('soe_ac');
    dbms_service.start_service('dbauser_soefail');
END;
/

Load swingbench SOE schema

swingbench/bin/oewizard -cl -cs tcp://failtest-1/dbauser_soefail.prod1.dbs -dba dbauser -dbap dbauser -u soe -p soe -scale 1 -create -tc 2 -ts soe

Command to run swingbench test

bin/charbench -cs "(description=(failover=on)(connect_timeout=2)(transport_connect_timeout=1 sec)(address_list=(address=(protocol=tcp)(host=failtest-1)(port=1521))(address=(protocol=tcp)(host=failtest-2)(port=1521)))(connect_data=(service_name=soe.prod1.dbs)))" \
-u soe -p soe -uc 4 -c SOE_Server_Side_V2.xml

During all test I have swingbench running at around 1600 TPS – which is enough for my small setup to keep the CPUs 40% idle. I want databases under load, but still enough free CPU time for it not to affect the test.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          52.11    0.05    5.96    7.39    0.00   34.49

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vdc             33.00 1506.40      1.68      9.85    66.00     0.00  66.67   0.00    0.95    1.48   2.49    52.07     6.70   0.40  61.38
vdd             38.80 1487.80      1.84      8.83    72.00     0.60  64.98   0.04    0.84    1.42   2.37    48.66     6.08   0.40  61.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          50.33    0.05    4.84    9.33    0.00   35.44

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
vdc             29.80 1388.80      1.41      8.91    23.60     1.20  44.19   0.09    0.84    2.05   3.11    48.32     6.57   0.42  59.42
vdd             34.20 1417.40      1.37      8.55    21.00     0.20  38.04   0.01    0.72    1.82   2.86    41.12     6.18   0.42  60.88

The client

I wrote a simple client in Python that will connect two threads to the database – one that only reads and one that only writes, in separate threads. And the program will record all errors the client experiences during the test and in case connection becomes unusable – reconnect to the database. Results are recorded in CSV file to be analysed later.

Side note: The client is indeed using the new python-oracledb module, but since it is currently not complete, specially in the HA area – I’m using it in thick mode with Instantclient 21.9.

Access the test client code here (github.com)

Examples queries for data analytics are also part of the github repository.

The plan

The next post will be about switchover… then about failover. Maybe more, who knows.

Ilmar Kerm

Tag: high availability

Testing client failover with Data Guard Fast-Start Failover 1/setup

All posts in this series

Data Guard setup

My database setup

Services

The client

The plan

About me

Pages

Recent Posts

Tags

Archive