Reading data from oracle database with Spark can be done with these steps.
- Get The JDBC Thin Driver
Download the proper driver ojdbc6.jar for Oracle11.2 and ojdbc7.jar for Oracle12c.
Check the compatibility of the drivers and its supported JDK versions.
2. Invoke Spark-shell with jars option
spark-shell --jars "/home/work/ODBC/ojdbc7.jar"
3. Read the oracle table through spark read function
Use sparks read function to connect to jdbc source with proper connection string and options.
val empDF = spark.read.format("jdbc").
option("url", "jdbc:oracle:thin:@158.239.207.122:1521:DBSID").
option("dbtable", "employee").
option("user", "MYEMPDB").
option("password", "mypassword").
option("driver", "oracle.jdbc.driver.OracleDriver").
load()
In this example table employee is passed to dbtable option. Instaead of table you can also pass a full fledged query
Ensure your Connection string or URL is proper
jdbc:oracle:thin:@host_IP:portnumber:SSID
This example is tested with Spark 2.4 with Oracle 12c and JDK8 .
scala> val empDF = spark.read.format("jdbc").
| option("url", "jdbc:oracle:thin:@158.234.207.122:1521:TESTSID").
| option("dbtable", "employee").
| option("user", "EMPDB").
| option("password", "Pass12@!").
| option("driver", "oracle.jdbc.driver.OracleDriver").
| load()
empDF: org.apache.spark.sql.DataFrame = [EMPID: decimal(38,10), EMPNAME: string ... 2 more fields]
scala> empDF.printSchema
root
|-- EMPID: decimal(38,10) (nullable = true)
|-- EMPNAME: string (nullable = true)
|-- TITLE: string (nullable = true)
|-- SALARY: decimal(38,10) (nullable = true)
scala> empDF.show
+---------------+-------+-------+---------------+
| EMPID|EMPNAME| TITLE| SALARY|
+---------------+-------+-------+---------------+
|1002.0000000000| Rahul| WK|2000.0000000000|
|1002.0000000000| Raina| VC|2000.0000000000|
|1001.0000000000| Dhoni|Captain|9000.0000000000|
+---------------+-------+-------+---------------+
scala>