RDDs(Resilient Distributed Datasets) can be created in many different ways.
Reading data from different sources
Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines.
val rdd1 = sc.textFile("hdfs://...")
//From Unix File path use file:/// as URI
val rdd2 = sc.textFile("file:///home/user1/Filename.txt")
val rdd3 = sc.textFile("s3://...")
By default the textFile method reads from HDFS .
scala> val emp = sc.textFile("/data/emp_data.txt")
emp: org.apache.spark.rdd.RDD[String] = /data/emp_data.txt MapPartitionsRDD[1] at textFile at <console>:24
With parallelized collection.
RDDs can be created using parallelize method of spark Context . sc is short for spark.sparkContext .
val rdd4 = sc.parallelize(Seq("Good Morning", "Happy Birthday"))
val rdd5 = spark.sparkContext.parallelize(List(1,2,3,4,5),3)
val rdd6 = sc.parallelize(Range(1, 10000))
val rdd6 = sc.parallelize(Array(2.0, 1.0, 2.1, 5.4))
From existing apache spark RDDs.
Transform an existing RDD with map method. Here we are converting an RDD[int] to RDD[String] type .
scala> val rdd6 = sc.parallelize(Range(1, 10))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> rdd6.map(x => x.toString) //returns an RDD[String]
res0: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:26
Converting Datasets or Dataframes to rdd
rdd method can be used to convert datasets to RDDs.
val df = spark.read.csv("hdfs:///data/Employee.csv")
val rdd7 = df.rdd
df.createOrReplaceTempView("Employee")
val emprdd = spark.sql("select * from Employee").map(x=>x.toString()).rdd
val dataRDD = spark.read.json("path/of/json/file").rdd