After I wrote multiprocessing python script which can be scaled with multiple docker containers to load billions of records from Oracle RDBMS to Solr Cloud I started to think of simpler solutions.
The basic idea is to handle all multiprocessing and scaling aspects by spark. As Spark is able to partition jdbc datasources out of the box the prove was needed that data can be saved in solr. Spakr-Solr provides the needed functionality.
- Setup Solr Cloud on Docker (one node in cloud mode is sufficient, create a collection “test”)
- Run Oracle 18c on Docker
- Run a Spark Cluster on Docker (I used this Spark images)
Copy the jdbc driver and the spark-solr driver
docker cp ojdbc8.jar spark-master:/
docker cp spark-solr-3.6.4-shaded.jar spark-master:/
docker cp ojdbc8.jar spark-worker-1:/
docker cp spark-solr-3.6.4-shaded.jar spark-worker-1:/
Start the Spark Shell (I use the pyspark) and load the date into a data frame
docker exec -it spark-master /spark/bin/pyspark --jars /ojdbc8.jar,/spark-solr-3.6.4-shaded.jar
empDF = spark.read \
.option("url", "jdbc:oracle:thin:test/test@//192.168.1.12:1521/demo01") \
.option("dbtable", "test.emp") \
.option("user", "test") \
.option("password", "****") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
Now save the date to Solr Cloud
And the date is in Solr Cloud.
On a Test with more date I had to add the Option .option(“commit_within”, “5000”)
I could not find explicit commit on solr-spark.
This solution should able scale well on a spark cluster by partitioning the data on the jdbc side. Transformation can be added on Spark. And such a solution would much less complex and better maintainable then a python multiprocessing solution (who ever used python multiprocessing may know what I mean).
If you have used similar setup please share your experiences.