Index Oracle RDBMS in Solr using Apache Spark

After I wrote multiprocessing python script which can be scaled with multiple docker containers to load billions of records from Oracle RDBMS to Solr Cloud I started to think of simpler solutions.

The basic idea is to handle all multiprocessing and scaling aspects by spark. As Spark is able to partition jdbc datasources out of the box the prove was needed that data can be saved in solr. Spakr-Solr provides the needed functionality.

Environment Preparation:

  • Setup Solr Cloud on Docker (one node in cloud mode is sufficient, create a collection “test”)
  • Run Oracle 18c on Docker
  • Run a Spark Cluster on Docker (I used this Spark images)

Copy the jdbc driver and the spark-solr driver

docker cp ojdbc8.jar spark-master:/
docker cp spark-solr-3.6.4-shaded.jar  spark-master:/
docker cp ojdbc8.jar spark-worker-1:/
docker cp spark-solr-3.6.4-shaded.jar  spark-worker-1:/

Start the Spark Shell (I use the pyspark) and load the date into a data frame

docker exec -it spark-master /spark/bin/pyspark --jars /ojdbc8.jar,/spark-solr-3.6.4-shaded.jar

empDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:test/test@//192.168.1.12:1521/demo01") \
    .option("dbtable", "test.emp") \
    .option("user", "test") \
    .option("password", "****") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()
empDF.printSchema()
empDF.show()

Now save the date to Solr Cloud

empDF.write.format("solr").option("zkhost","192.168.1.12:9983").option("collection","test").mode("append").save()

And the date is in Solr Cloud.

On a Test with more date I had to add the Option .option(“commit_within”, “5000”)
I could not find explicit commit on solr-spark.

This solution should able scale well on a spark cluster by partitioning the data on the jdbc side. Transformation can be added on Spark. And such a solution would much less complex and better maintainable then a python multiprocessing solution (who ever used python multiprocessing may know what I mean).

If you have used similar setup please share your experiences.

How check the jdbc client version in your database?

On the java side you can check the JDBC Client Version very easy.

java -jar ./mwhome/.../ojdbc7.jar -getversion
Oracle 12.1.0.2.0 JDBC 4.1 compiled with JDK7 on Mon_Jun_30_11:30:34_PDT_2014
#Default Connection Properties Resource
#Tue Apr 17 14:13:51 CEST 2018

But somethimes you may not have access to the application server (here Weblogic) or there are many version installed on the application side.
So the best way is to check in you Oracle Database who your client are.

SQL> select SID,NETWORK_SERVICE_BANNER,CLIENT_CHARSET,CLIENT_CONNECTION,CLIENT_VERSION,CLIENT_DRIVER from v$session_connect_info;
       396
Crypto-checksumming service for Linux: Version 12.2.0.1.0 - Production
Unknown                                  Heterogeneous 12.1.0.2.0                               jdbcthin

This is a reliable way to find how (jdbcthin, oci, ..) and with what JDBC Version the client are connecting.