Testing client failover with Data Guard Fast-Start Failover 1/setup
- Written by: ilmarkerm
- Category: Blog entry
- Published: February 24, 2023
All posts in this series
- Part 1: Setup
- Part 2: Switchover
- Part 3: Failover
Data Guard setup
My end goal is to find out how fast database clients (applications) can resume work (reading and writing) after Data Guard role change operation – either it happens as a planned or unplanned operation. Testing different options to find the shortest downtime – results apply for my environment only, your results will vary.
I’m doing it just to increase my knowledge about different Data Guard scenarios.
My database setup
I’m ONLY interested in the role change operation between instances failtesti1 and failtesti2. Whitch are in the same DC, different availability zones (ping 0.165 ms avg). They equal in every way and are intended to service application read+write traffic as primary databases. These two instances are in SYNC mode.
During the test I will also throw in a remote standby database failtesti3 that is located a few thousand km away (53ms ping). This remote standby is in ASYNC or ARCH mode. failtesti3 is only a BYSTANDER during all tests.
DGMGRL> show configuration verbose;
Configuration - failtest
Protection Mode: MaxAvailability
Members:
failtesti1 - Primary database
failtesti2 - (*) Physical standby database
failtesti3 - Physical standby database
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '25'
OperationTimeout = '30'
TraceLevel = 'USER'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '1800'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'FAILOVER'
ConfigurationWideServiceName = 'QKNPUXMZ_CFG'
Fast-Start Failover: Enabled in Zero Data Loss Mode
Lag Limit: 30 seconds (not in use)
Threshold: 25 seconds
Active Target: failtesti2
Potential Targets: "failtesti2"
failtesti2 valid
Observer: failtest-observer
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: 1800 seconds
Observer Override: FALSE
Configuration Status:
SUCCESS
Services
Both CDB and PDB have a simple service and on database open trigger created, to start the service when database opens.
exec dbms_service.create_service('dbauser_failtest','dbauser_failtest');
exec dbms_service.start_service('dbauser_failtest');
CREATE OR REPLACE TRIGGER sys.cdb_start_services AFTER STARTUP OR DB_ROLE_CHANGE ON DATABASE
DECLARE
v_srv_name varchar2(30):= 'dbauser_failtest';
cnt NUMBER;
BEGIN
SELECT count(*) INTO cnt FROM v$active_services WHERE name = v_srv_name;
IF cnt = 0 THEN
dbms_service.start_service(v_srv_name);
END IF;
END;
/
For PDB application use
alter session set container=soefail;
-- Lets create a simple service
exec dbms_service.create_service('soe','soe');
-- Now lets create one service with Application Continuity turned on
DECLARE
new_service_params dbms_service.svc_parameter_array;
p_srv_name varchar2(20):= 'soe_ac';
BEGIN
new_service_params('FAILOVER_TYPE'):= 'AUTO';
new_service_params('REPLAY_INITIATION_TIMEOUT'):= 1800;
new_service_params('RETENTION_TIMEOUT'):= 86400;
new_service_params('FAILOVER_DELAY'):= 5;
new_service_params('FAILOVER_RETRIES'):= 10;
new_service_params('FAILOVER_RESTORE'):= 'AUTO';
new_service_params('commit_outcome'):= 'true';
new_service_params('aq_ha_notifications'):= 'true';
dbms_service.create_service(p_srv_name, p_srv_name);
dbms_service.modify_service(p_srv_name, new_service_params);
dbms_service.start_service(p_srv_name);
END;
/
CREATE OR REPLACE TRIGGER sys.pdb_start_services AFTER STARTUP ON DATABASE
BEGIN
dbms_service.start_service('soe');
dbms_service.start_service('soe_ac');
dbms_service.start_service('dbauser_soefail');
END;
/
Load swingbench SOE schema
swingbench/bin/oewizard -cl -cs tcp://failtest-1/dbauser_soefail.prod1.dbs -dba dbauser -dbap dbauser -u soe -p soe -scale 1 -create -tc 2 -ts soe
Command to run swingbench test
bin/charbench -cs "(description=(failover=on)(connect_timeout=2)(transport_connect_timeout=1 sec)(address_list=(address=(protocol=tcp)(host=failtest-1)(port=1521))(address=(protocol=tcp)(host=failtest-2)(port=1521)))(connect_data=(service_name=soe.prod1.dbs)))" \
-u soe -p soe -uc 4 -c SOE_Server_Side_V2.xml
During all test I have swingbench running at around 1600 TPS – which is enough for my small setup to keep the CPUs 40% idle. I want databases under load, but still enough free CPU time for it not to affect the test.
avg-cpu: %user %nice %system %iowait %steal %idle 52.11 0.05 5.96 7.39 0.00 34.49 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util vdc 33.00 1506.40 1.68 9.85 66.00 0.00 66.67 0.00 0.95 1.48 2.49 52.07 6.70 0.40 61.38 vdd 38.80 1487.80 1.84 8.83 72.00 0.60 64.98 0.04 0.84 1.42 2.37 48.66 6.08 0.40 61.00 avg-cpu: %user %nice %system %iowait %steal %idle 50.33 0.05 4.84 9.33 0.00 35.44 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util vdc 29.80 1388.80 1.41 8.91 23.60 1.20 44.19 0.09 0.84 2.05 3.11 48.32 6.57 0.42 59.42 vdd 34.20 1417.40 1.37 8.55 21.00 0.20 38.04 0.01 0.72 1.82 2.86 41.12 6.18 0.42 60.88
The client
I wrote a simple client in Python that will connect two threads to the database – one that only reads and one that only writes, in separate threads. And the program will record all errors the client experiences during the test and in case connection becomes unusable – reconnect to the database. Results are recorded in CSV file to be analysed later.
Side note: The client is indeed using the new python-oracledb module, but since it is currently not complete, specially in the HA area – I’m using it in thick mode with Instantclient 21.9.
Access the test client code here (github.com)
Examples queries for data analytics are also part of the github repository.
The plan
The next post will be about switchover… then about failover. Maybe more, who knows.