Index Oracle RDBMS in Solr using Apache Spark

After I wrote multiprocessing python script which can be scaled with multiple docker containers to load billions of records from Oracle RDBMS to Solr Cloud I started to think of simpler solutions.

The basic idea is to handle all multiprocessing and scaling aspects by spark. As Spark is able to partition jdbc datasources out of the box the prove was needed that data can be saved in solr. Spakr-Solr provides the needed functionality.

Environment Preparation:

  • Setup Solr Cloud on Docker (one node in cloud mode is sufficient, create a collection “test”)
  • Run Oracle 18c on Docker
  • Run a Spark Cluster on Docker (I used this Spark images)

Copy the jdbc driver and the spark-solr driver

docker cp ojdbc8.jar spark-master:/
docker cp spark-solr-3.6.4-shaded.jar  spark-master:/
docker cp ojdbc8.jar spark-worker-1:/
docker cp spark-solr-3.6.4-shaded.jar  spark-worker-1:/

Start the Spark Shell (I use the pyspark) and load the date into a data frame

docker exec -it spark-master /spark/bin/pyspark --jars /ojdbc8.jar,/spark-solr-3.6.4-shaded.jar

empDF = \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:test/test@//") \
    .option("dbtable", "test.emp") \
    .option("user", "test") \
    .option("password", "****") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \

Now save the date to Solr Cloud


And the date is in Solr Cloud.

On a Test with more date I had to add the Option .option(“commit_within”, “5000”)
I could not find explicit commit on solr-spark.

This solution should able scale well on a spark cluster by partitioning the data on the jdbc side. Transformation can be added on Spark. And such a solution would much less complex and better maintainable then a python multiprocessing solution (who ever used python multiprocessing may know what I mean).

If you have used similar setup please share your experiences.