Index Oracle RDBMS in Solr using Apache Spark

After I wrote multiprocessing python script which can be scaled with multiple docker containers to load billions of records from Oracle RDBMS to Solr Cloud I started to think of simpler solutions.

The basic idea is to handle all multiprocessing and scaling aspects by spark. As Spark is able to partition jdbc datasources out of the box the prove was needed that data can be saved in solr. Spakr-Solr provides the needed functionality.

Environment Preparation:

  • Setup Solr Cloud on Docker (one node in cloud mode is sufficient, create a collection “test”)
  • Run Oracle 18c on Docker
  • Run a Spark Cluster on Docker (I used this Spark images)

Copy the jdbc driver and the spark-solr driver

docker cp ojdbc8.jar spark-master:/
docker cp spark-solr-3.6.4-shaded.jar  spark-master:/
docker cp ojdbc8.jar spark-worker-1:/
docker cp spark-solr-3.6.4-shaded.jar  spark-worker-1:/

Start the Spark Shell (I use the pyspark) and load the date into a data frame

docker exec -it spark-master /spark/bin/pyspark --jars /ojdbc8.jar,/spark-solr-3.6.4-shaded.jar

empDF = \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:test/test@//") \
    .option("dbtable", "test.emp") \
    .option("user", "test") \
    .option("password", "****") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \

Now save the date to Solr Cloud


And the date is in Solr Cloud.

On a Test with more date I had to add the Option .option(“commit_within”, “5000”)
I could not find explicit commit on solr-spark.

This solution should able scale well on a spark cluster by partitioning the data on the jdbc side. Transformation can be added on Spark. And such a solution would much less complex and better maintainable then a python multiprocessing solution (who ever used python multiprocessing may know what I mean).

If you have used similar setup please share your experiences.

How check the jdbc client version in your database?

On the java side you can check the JDBC Client Version very easy.

java -jar ./mwhome/.../ojdbc7.jar -getversion
Oracle JDBC 4.1 compiled with JDK7 on Mon_Jun_30_11:30:34_PDT_2014
#Default Connection Properties Resource
#Tue Apr 17 14:13:51 CEST 2018

But somethimes you may not have access to the application server (here Weblogic) or there are many version installed on the application side.
So the best way is to check in you Oracle Database who your client are.

Crypto-checksumming service for Linux: Version - Production
Unknown                                  Heterogeneous                               jdbcthin

This is a reliable way to find how (jdbcthin, oci, ..) and with what JDBC Version the client are connecting.

Oracle: udpate a XML value in place with sql

Oracle sql provides a simple way to update values in a xml document by sql.
The “updatexml” function can udpate any fields in by xpath.

update SCOTT.DEMO" d
where id=1234;

to update multiple values at once:


Very handy to update large documents.

Login into Oracle Database without password / Autologin

Here I try to explain most simple and short way to login with a oracle wallen / without password.

create wallet:

mkstore -wrl -createCredential mkstore -wrl -listCredential

configure client side sqlnet.ora:


configure tns alias in tnsnames.ora: =

If the oracle instant client is used the location of sqlnet.ora and tnsnames.ora needs to defined with the TNS_ADMIN environment variable .


Oracle on Docker (I): Backup / Restore Oracle RDBMS on Docker

Running Oracle on Docker is a great way for Testing Features or Development environments, specially for offline development on notebooks which require an oracle database.

If you have setup an oracle Image according oracle rdbms docker and choosen to create your db on docker volumes. And mounted backup1 volume at /backup.

docker volume create oracledb1
docker volume create backup1

Create an offline backup:

docker exec -u 0 -it ora122 bash
bash-4.2# chown oracle /backup

docker exec -ti ora122 sqlplus / as sysdba
SQL> alter database close;

Database altered.

docker exec -ti ora122 rman target /
backup database format '/backup/%U' tag=bk1;

Restore Database from Backup

docker exec -ti ora122 rman target /
RMAN> restore database from tag=BK1

RMAN> shutdown immediate

RMAN> startup mount

RMAN> alter database open resetlogs;

PL/SQL Unit Test for SQL geneartor

If PL/SQL is unit tested utPLSQL ( is probably the most majure solution to implement it.
Here I suggest a solution to check if generated SQL is executable.

We have a package P_SERCH whith a function generate_sql to generate dynamic sql.
Since there is no native execption checking in utPLSQL we execute the generated SQL in a block.
On a error we to a False comparsion “ut.expect(‘ok’).to_equal(‘error’)”.
If the statement executes without error we to a True comparsion “ut.expect(‘ok’).to_equal(‘error’)”

To lower the impact of the sql execution the statement is wrap in a count select.
If theh statement execution is still inappropriate for performance reasons this could be replaced as well with a DBMS_SQL parse.

  DBMS_SQL.PARSE(c, 'select * from multi_tab order by 1', DBMS_SQL.NATIVE);

Package Header and Body of the unit test:

create or replace package ut3_P_SEARCH as
  -- %suite(p_search)
  -- %suitepath(

  procedure global_setup;

  procedure global_cleanup;

  procedure test_setup;

  procedure test_cleanup;
  -- %test
  -- %displayname(generate solr sql)
  procedure gensql_starship;

end ut3_P_SEARCH;

create or replace package body ut3_P_SEARCH as
  procedure global_setup is

  procedure test_setup is
  end test_setup;

  procedure global_cleanup is

  procedure test_cleanup is
  end test_cleanup;

  procedure gensql_starship is
    v_sql                 varchar2(4000);
    -- retrieve generated sql for person class
    select scott.p_search.generate_sql('STARSHIP') into v_sql from dual;
    -- check if executable
      execute immediate 'select count(*) from ('||v_sql||')';
    exception  when others then
    -- check tables 
    ut.expect(v_sql).to_be_like(a_mask=>'%'|| p_metadata_util.f_class_relational_name('PERSON') ||'%');
    -- check static attributes
  END gensql_person;

end ut3_P_SEARCH;

test execution
SQL> begin'scott.ut3_p_search.gensql_starship'); end;
  2  /
generate solr sql
Finished in .014174 seconds
1 tests, 0 failed, 0 errored, 0 disabled, 0 warning(s)

PL/SQL procedure successfully completed.

A good source for PL/SQL Testing and Error handling is Jacek Gebals blog (, he is also one of the main developer of utPLSQL (version 3)

Find Oracle SQL Profiles causing dynamic_sampling

Recently I hat memory lock issues with statements using dynamic sampling.
But after disabling dynamic_sampling at the system level there were still some statements causing issues with dynamic_sampling.
The execution plan showed that the problematic statements had SQL Profiles. Thanks to a blog from
Cristian Antognini ( all SQL Profiles with with a dynaimic sampling hint could be found.

Demo: Setup a statment with a sqlprofile

exec dbms_sqltune.drop_sql_profile('SQLPROFILE42');
     l_sql               clob;
     l_sql := q'!select id from t42 where id=:x!';

     dbms_sqltune.import_sql_profile( sql_text => l_sql, 
                                     name => 'SQLPROFILE42',
                                     profile => sqlprof_attr(q'!OPT_PARAM('optimizer_dynamic_sampling' 2)!'
                                                -- ,q'!FULL(@"SEL$1" "T42"@"SEL$1")!'
                                                ,q'!INDEX(@"SEL$1" "T42"@"SEL$1" "I42_3")!'
             force_match => true );

var x number;
exec :x := 2167658022;

explain plan for select id from t42 where id=:x;
select * from table(dbms_xplan.display(null,null,'ADVANCED'));

Finding the SQL Profiles with dynamic sampling

SELECT,extractValue(value(h),'.') AS hint
FROM sys.sqlobj$data od, sys.sqlobj$ so,
table(xmlsequence(extract(xmltype(od.comp_data),'/outline_data/hint'))) h
WHERE 1=1 -- = '&sqlprof'
  AND so.signature = od.signature
  AND so.category = od.category
  AND so.obj_type = od.obj_type
  AND so.plan_id = od.plan_id
  and extractValue(value(h),'.') like '%dynamic%'
OPT_PARAM('optimizer_dynamic_sampling' 2)

Will there be Oracel RDBMS 12.2 for ODA X6?

Will there be Oracel RDBMS 12.2 for ODA X6?

Doc ID 888888.1

* Currently X7-2 is released with but will not be provided for other ODA HW Types.
   The next bundle (pending) will be released for all other versions X6.2*, X5-2,X4-2, X3-2, and V1.

So it seems Oracle 12.2 is comming for the ODA x6 as well in the near future.

adjust column high and low value on column with hybrid histograms

On table where time series data get loaded it’s not uncommon to get out-of-range conditions for the CBO optimizer.
This means that the statistics are not uptodate and until they are gathered again, the optimizer may think that there is no data within a daterange, since the queries range may be higher than the columns high value.
As consequence you may end up with suboptimal plan which could include catesian joins und unwanted nested loops.

An even wors condition is if you have few outliers in a date range and bind variables are used. if all your data is from the past 10 Month but you have a record from the year 0001, the optimizer may guess without histogram that your data es even distributed. if the last month are queries the guess cardinality would be by magnitute to low. this usually end as well with bad plans.

A common solution to this issues is to set the high and/or low value explicitly with DBMS_STATS.set_column_stats (Doc ID 1276174.1).

This works well if there are no histograms on the column. It may even work if you have histograms (even they may get destroyed). On time series date there is usually a high-balanced histogram. But on 12.1 or 12.2 this often changes to a hybrid histogram.

And using the old methode may end up with:
ERROR at line 1:
ORA-20001: Invalid or inconsistent input values
ORA-06512: at "SYS.DBMS_STATS", line 13570
ORA-06512: at line 23

This is mostly because the bkvals array is incorrect.

Even the details are documented in the description of DBMS_STATS it may be no so obvious how set the high value in such a case.
Here I share my example which workes with a hybird histogram.

v_distcnt NUMBER;
v_density NUMBER;
v_nullcnt NUMBER;
v_avgclen NUMBER;
v_histogram_type user_tab_columns.histogram%type;
DBMS_STATS.get_column_stats (ownname => USER,
tabname => 'MYTAB',
colname => 'DCOL',
distcnt => v_distcnt,
density => v_density,
nullcnt => v_nullcnt,
srec => srec,
avgclen => v_avgclen
-- init a datearray with the correct number of values (buckets in the histogram)
v_datevals := DBMS_STATS.DATEARRAY();
-- check if there is a hybrid histogram
select histogram INTO v_histogram_type from user_tab_columns where table_name='MYTAB' and column_name='DCOL';
IF v_histogram_type='HYBRID' then
-- copy the values from the histogram to the datearray (! this is an example for date/timestamp columns only, Julian calendar !)
for i in 1..srec.novals.count loop
v_datevals(i) := to_date(srec.novals(i),'J');
end loop;
-- reset first and last (low/high value) !! according your needs!
v_datevals(1) := to_date('01.01.2017','DD.MM.YYYY');
v_datevals(datevals.last) := trunc(sysdate) + 14;
null; -- .. do what you do in other cases, this is only an example, think before run ..
-- srec.bkvals := null;
-- if oyu wnat a freq or hybrid histogram ... else null
-- srec.bkvals := dbms_stats.numarray(srec.bkvals(1),srec.bkvals(srec.bkvals.last));

DBMS_STATS.prepare_column_values (srec, v_datevals);
DBMS_STATS.set_column_stats (ownname => USER,
tabname => 'MYTAB',
colname => 'DCOL',
distcnt => v_distcnt,
density => v_density,
nullcnt => v_nullcnt,
srec => srec,
avgclen => v_avgclen,
force => true

A hybrid histogram on skewed time series can be really wonderful and the out-of-range condition can be prevented.

Please leave comments how you handle the out-of-range issues.