Ways to Create Spark RDD

RDDs(Resilient Distributed Datasets) can be created in many different ways.

Reading data from different sources

Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines.

val rdd1 = sc.textFile("hdfs://...")
//From Unix File path use file:/// as URI
val rdd2 = sc.textFile("file:///home/user1/Filename.txt") 
val rdd3 = sc.textFile("s3://...") 

By default the textFile method reads from HDFS .

scala> val emp = sc.textFile("/data/emp_data.txt")
emp: org.apache.spark.rdd.RDD[String] = /data/emp_data.txt MapPartitionsRDD[1] at textFile at <console>:24

With parallelized collection.

RDDs can be created using parallelize method of spark Context . sc is short for spark.sparkContext .

val rdd4 = sc.parallelize(Seq("Good Morning", "Happy Birthday"))  
val rdd5 = spark.sparkContext.parallelize(List(1,2,3,4,5),3)
val rdd6 = sc.parallelize(Range(1, 10000))
val rdd6 = sc.parallelize(Array(2.0, 1.0, 2.1, 5.4))

From existing apache spark RDDs.

Transform an existing RDD with map method. Here we are converting an RDD[int] to RDD[String] type .

scala> val rdd6 = sc.parallelize(Range(1, 10))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> rdd6.map(x => x.toString) //returns an RDD[String]
res0: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:26

Converting Datasets or Dataframes to rdd

rdd method can be used to convert datasets to RDDs.

val df = spark.read.csv("hdfs:///data/Employee.csv")
val rdd7 = df.rdd
val emprdd = spark.sql("select * from Employee").map(x=>x.toString()).rdd
val dataRDD = spark.read.json("path/of/json/file").rdd

Leave a Reply