Archives: High performance data unloading from Oracle to ODBC database
I got a task some time ago, to copy a-lot-of-millions of rows from Oracle to MySQL. As fast as possible.
In the last year I have used Oracle Heterogeneous Services (HSODBC) a lot to select data from different non-Oracle databases, but I had yet to try to unload data from Oracle with best performance.
My first thought was, of course, that the process needs to run in parallel. Here I’ll first take a look at the overall HSODBC performance and then look how can I make this process run in parallel.
In this test I use Oracle Database 11.1.0.7 as source, Oracle Heterogeneous Services 10.2.0.4 with MyODBC 5.1 driver, MySQL 5.1 destination database. In my setup, I have all these three components running on separate servers.
SQL> CREATE TABLE hsperftest AS SELECT * FROM all_objects UNION ALL SELECT * FROM all_objects UNION ALL SELECT * FROM all_objects; SQL> DESC hsperftest Name Null? Type ------------------------------------------- -------- --------------------- OWNER VARCHAR2(30) OBJECT_NAME VARCHAR2(30) SUBOBJECT_NAME VARCHAR2(30) OBJECT_ID NUMBER DATA_OBJECT_ID NUMBER OBJECT_TYPE VARCHAR2(19) CREATED DATE LAST_DDL_TIME DATE TIMESTAMP VARCHAR2(19) STATUS VARCHAR2(7) TEMPORARY VARCHAR2(1) GENERATED VARCHAR2(1) SECONDARY VARCHAR2(1) NAMESPACE NUMBER EDITION_NAME VARCHAR2(30) ID NOT NULL NUMBER(10)
And I create the same table in MySQL.
CREATE TABLE `hsperftest` ( `OWNER` varchar(30) DEFAULT NULL, `OBJECT_NAME` varchar(30) DEFAULT NULL, `SUBOBJECT_NAME` varchar(30) DEFAULT NULL, `OBJECT_ID` int(11) DEFAULT NULL, `DATA_OBJECT_ID` int(11) DEFAULT NULL, `OBJECT_TYPE` varchar(19) DEFAULT NULL, `CREATED` datetime DEFAULT NULL, `LAST_DDL_TIME` datetime DEFAULT NULL, `TIMESTAMP` varchar(19) DEFAULT NULL, `STATUS` varchar(7) DEFAULT NULL, `TEMPORARY` varchar(1) DEFAULT NULL, `GENERATED` varchar(1) DEFAULT NULL, `SECONDARY` varchar(1) DEFAULT NULL, `NAMESPACE` int(11) DEFAULT NULL, `EDITION_NAME` varchar(30) DEFAULT NULL, `ID` int(11) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Set up HSODBC to connect to the MySQL database and create a database link on Oracle side. When setting up HSODBC, do not enable logging/tracing (HS_FDS_TRACE_LEVEL = off)! Tracing will serialize all database calls to the remote database.
I tested this with HSODBC running on Windows Vista and Linux (using unixODBC). The results were similar.
Test that MySQL table is accessible for Oracle:
SQL> DESC "hsperftest"@hsdb Name Null? Type ----------------------------------------------------- -------- ----------------- OWNER VARCHAR2(30) OBJECT_NAME VARCHAR2(30) SUBOBJECT_NAME VARCHAR2(30) OBJECT_ID NUMBER(10) DATA_OBJECT_ID NUMBER(10) OBJECT_TYPE VARCHAR2(19) CREATED DATE LAST_DDL_TIME DATE TIMESTAMP VARCHAR2(19) STATUS VARCHAR2(7) TEMPORARY VARCHAR2(1) GENERATED VARCHAR2(1) SECONDARY VARCHAR2(1) NAMESPACE NUMBER(10) EDITION_NAME VARCHAR2(30) ID NOT NULL NUMBER(10)
Before continuing, check that all MySQL data types are mapped correctly to Oracle. And make sure, that none of the columns are using deprecated LONG datatype.
First of all… the following SQL will not work:
SQL> insert into "hsperftest"@hsdb select * from hsperftest; insert into "hsperftest"@hsdb select * from hsperftest * ERROR at line 1: ORA-02025: all tables in the SQL statement must be at the remote database
Some PL/SQL is needed to first select from Oracle cursor and then bind the values to the remote database INSERT statement. The easiest way is:
BEGIN FOR rec IN (SELECT * FROM HSPERFTEST) LOOP INSERT INTO "hsperftest"@hsdb VALUES rec; END LOOP; COMMIT; END; / PL/SQL procedure successfully completed. Elapsed: 00:14:28.38
When running this anonymous block I noticed, that the load on Oracle database and MySQL database was very light, but the HSODBC CPU usage was at the maximum. So HSODBC seems to be the bottleneck. What is it doing? With this kind of CPU usage its not going to scale much in parallel. If its performing some mappings between Oracle SQL and MySQL SQL, then can I remove this bottleneck by using DBMS_HS_PASSTHROUGH, that sends the query unmodified to the remote database?
DECLARE p_c BINARY_INTEGER; nr INTEGER; BEGIN p_c := DBMS_HS_PASSTHROUGH.OPEN_CURSOR@hsdb; DBMS_HS_PASSTHROUGH.PARSE@hsdb ( p_c, 'insert into hsperftest (OWNER,OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID, DATA_OBJECT_ID,OBJECT_TYPE,CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS, TEMPORARY,GENERATED,SECONDARY,NAMESPACE,EDITION_NAME,ID) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)' ); FOR rec IN (SELECT * FROM HSPERFTEST) LOOP DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 1, rec.owner); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 2, rec.object_name); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 3, rec.SUBOBJECT_NAME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 4, rec.OBJECT_ID); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 5, rec.DATA_OBJECT_ID); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 6, rec.OBJECT_TYPE); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 7, rec.CREATED); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 8, rec.LAST_DDL_TIME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 9, rec.TIMESTAMP); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 10, rec.STATUS); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 11, rec.TEMPORARY); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 12, rec.GENERATED); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 13, rec.SECONDARY); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 14, rec.NAMESPACE); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 15, rec.EDITION_NAME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 16, rec.ID); nr := DBMS_HS_PASSTHROUGH.EXECUTE_NON_QUERY@hsdb (p_c); END LOOP; DBMS_HS_PASSTHROUGH.CLOSE_CURSOR@hsdb (p_c); COMMIT; END; / PL/SQL procedure successfully completed. Elapsed: 00:07:37.77
That is a huge difference in execution time! And with DBMS_HS_PASSTHROUGH, the HSODBC process CPU usage is much less, making it more scalable running in parallel.
Now, how to run it in parallel?
Since the destination database is non-Oracle, the database built-in parallel execution cannot be used, so “DIY parallelism” is needed: split/partition the input data with some kind of criteria and start a new process for each partition.
In the real project I used each records CREATED_DATE value for partitioning, because underlying table was already partitioned by that column, but its also possible to use table ROWID values for partitioning. A good example of it is on book Expert Oracle database architecture By Thomas Kyte, chapter 14, Parallel Execution, Do-It-Yourself Parallelism.
Here I’m going to split the table just into 3 pieces to make it simple. To get the maximum IO performance from Oracle, Tom Kytes example needs to be followed. But I’ll make my example just simple.
Table hsperftest_job will keep the metadata for every parallel slave.
CREATE TABLE hsperftest_job ( job_no NUMBER NOT NULL, low_id ROWID NOT NULL, hi_id ROWID NOT NULL, max_in NUMBER NOT NULL, start_date DATE, end_date DATE, rows_processed NUMBER ); CREATE SEQUENCE seq_hsperftest_job; DECLARE p_min rowid; p_max rowid; p_row1 rowid; p_row2 rowid; p_count number; BEGIN SELECT MIN(rowid), MAX(rowid), COUNT(*) INTO p_min, p_max, p_count FROM hsperftest; SELECT MAX(a) INTO p_row1 FROM ( SELECT rowid a FROM hsperftest WHERE rownum < CEIL(p_count/3) ); SELECT MAX(a) INTO p_row2 FROM ( SELECT rowid a, rownum rn FROM hsperftest WHERE rowid > p_row1 ) WHERE rn < CEIL(p_count/3); INSERT INTO hsperftest_job (job_no, low_id, hi_id, max_in) VALUES (seq_hsperftest_job.nextval, p_min, p_row1, 0); INSERT INTO hsperftest_job (job_no, low_id, hi_id, max_in) VALUES (seq_hsperftest_job.nextval, p_row1, p_row2, 0); INSERT INTO hsperftest_job (job_no, low_id, hi_id, max_in) VALUES (seq_hsperftest_job.nextval, p_row2, p_max, 1); COMMIT; END; / SELECT j.job_no, COUNT (1) FROM hsperftest_job j, hsperftest h WHERE h.ROWID BETWEEN j.low_id AND j.hi_id GROUP BY j.job_no; JOB_NO COUNT(1) ---------- ---------- 1 59313 2 59314 3 59317
Looks like my very simple ROWID hack is good enough for this test.
Now I'll create a procedure that will be executed in parallel.
CREATE OR REPLACE PROCEDURE hsperftest_proc(p_job_no hsperftest_job.job_no%TYPE) IS p_min ROWID; p_max ROWID; p_max_in NUMBER; p_c BINARY_INTEGER; nr INTEGER; p_count NUMBER:= 0; BEGIN -- Read the job metadata SELECT low_id, hi_id, max_in INTO p_min, p_max, p_max_in FROM hsperftest_job WHERE job_no = p_job_no; UPDATE hsperftest_job SET start_date = sysdate WHERE job_no = p_job_no; COMMIT; -- Do the work p_c := DBMS_HS_PASSTHROUGH.OPEN_CURSOR@hsdb; DBMS_HS_PASSTHROUGH.PARSE@hsdb ( p_c, 'insert into hsperftest (OWNER,OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED,LAST_DDL_TIME,TIMESTAMP,STATUS, TEMPORARY,GENERATED,SECONDARY,NAMESPACE,EDITION_NAME,ID) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)' ); FOR rec IN (SELECT * FROM HSPERFTEST WHERE rowid BETWEEN p_min AND p_max AND (p_max_in = 1 OR (p_max_in = 0 AND rowid < p_max))) LOOP DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 1, rec.owner); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 2, rec.object_name); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 3, rec.SUBOBJECT_NAME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 4, rec.OBJECT_ID); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 5, rec.DATA_OBJECT_ID); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 6, rec.OBJECT_TYPE); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 7, rec.CREATED); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 8, rec.LAST_DDL_TIME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 9, rec.TIMESTAMP); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 10, rec.STATUS); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 11, rec.TEMPORARY); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 12, rec.GENERATED); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 13, rec.SECONDARY); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 14, rec.NAMESPACE); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 15, rec.EDITION_NAME); DBMS_HS_PASSTHROUGH.BIND_VARIABLE@hsdb (p_c, 16, rec.ID); nr := DBMS_HS_PASSTHROUGH.EXECUTE_NON_QUERY@hsdb (p_c); p_count:= p_count + 1; END LOOP; DBMS_HS_PASSTHROUGH.CLOSE_CURSOR@hsdb (p_c); COMMIT; -- Write the job end time UPDATE hsperftest_job SET end_date = sysdate, rows_processed = p_count WHERE job_no = p_job_no; COMMIT; END; /
And finally I'll execute the jobs by putting them on the job queue.
DECLARE x PLS_INTEGER; BEGIN FOR rec IN (SELECT job_no FROM hsperftest_job) LOOP DBMS_JOB.submit (job => x, what => 'hsperftest_proc(' || rec.job_no || ');'); END LOOP; COMMIT; END; / SQL> SELECT job, total_time, broken, what FROM user_jobs; JOB TOTAL_TIME B WHAT ---------- ---------- - ----------------- 10967 95 N hsperftest_proc(1); 10968 95 N hsperftest_proc(2); 10969 95 N hsperftest_proc(3);
When the jobs are running, monitor the HSODBC processes CPU and memory usage. If there are too much parallel slaves running, then they may become a bottleneck.
If HSODBC memory usage becomes a problem, then partition the source data into smaller chunks.
When everything is done, look at the results.
SQL> SELECT SUM(rows_processed) total_rows, (MAX(end_date)-MIN(start_date))*24*60 runtime_min FROM hsperftest_job; TOTAL_ROWS RUNTIME_MIN ---------- ----------- 177942 3.58333333
3.58 min = 3 min 34.8 sec. Nice 🙂
Categories
- Blog entry (97)
- Event (5)