Spark select method is used to select the specific columns from the dataframe. Its is a transformation operation which is lazily evaluated to create a new dataframe. You should pass list of column names(string) or column expressions as argument.
The column names(org.apache.spark.sql.ColumnName) or column objects ( org.apache.spark.sql.Column) can be passed like this.
df(“columnName”) | On a specific df DataFrame. |
col(“columnName”) | A generic column not yet associated with a DataFrame |
col(“columnName.field”) | Extracting a struct field |
col(“a.column.with.dots “) | Escape . in column names |
$”columnName” | Scala short hand for a named column. |
Lets create a dataframe from sequence .
val df = Seq(("Mihir","Bangalore",36),
("Ranjan","Delhi",25), ("Prakash","Chennai",30)).toDF("name","city","age")
Select all the columns from the dataframe
scala> df.select("*").show
+-------+---------+---+
| name| city|age|
+-------+---------+---+
| Mihir|Bangalore| 36|
| Ranjan| Delhi| 25|
|Prakash| Chennai| 30|
+-------+---------+---+
Same results can be achieved with these statements.
- df.select(col(“*”)).show
- df.select(‘*).show
- df.select($”*”).show
Select specific columns from the dataframe
scala> df.select( $"name",$"city",$"age").show
+-------+---------+---+
| name| city|age|
+-------+---------+---+
| Mihir|Bangalore| 36|
| Ranjan| Delhi| 25|
|Prakash| Chennai| 30|
+-------+---------+---+
Same result can be obtained with these statements.
- df.select(“name”,”city”,”age”).show()
- df.select(‘name,’city,’age).show
- df.select( $”name”,$”city”,$”age”).show
- df.select(col(“name”),col(“age”),col(“city”)).show()
Please ensure that that all the column objects of same type object and should not be mixed.
Create a new column from existing numeric column by Numerical operation
scala> val df2 = df.select($"name", ($"age"+5) as "Age5")
df2: org.apache.spark.sql.DataFrame = [name: string, Age5: int]
scala> df2.show
+-------+----+
| name|Age5|
+-------+----+
| Mihir| 41|
| Ranjan| 30|
|Prakash| 35|
+-------+----+
scala> df.select(col("name"),col("age")+5).show
+-------+---------+
| name|(age + 5)|
+-------+---------+
| Mihir| 41|
| Ranjan| 30|
|Prakash| 35|
+-------+---------+
scala> val newdf = df.select('name,('age+5).as("NewAge"))
newdf: org.apache.spark.sql.DataFrame = [name: string, NewAge: int]
Select columns by joining two columns
scala> df.select(concat(col("name"),col("city"))).show
+------------------+
|concat(name, city)|
+------------------+
| MihirBangalore|
| RanjanDelhi|
| PrakashChennai|
+------------------+
scala> // join two columns with "-"
scala> val df3 = df.select(concat(col("name"),lit("-"),col("city")).as("NameCity"))
df3: org.apache.spark.sql.DataFrame = [NameCity: string]
scala> // Join with space using concat functions
scala> val df4 = df.select(concat($"name",lit(" "),$"city").as("NewCol"))
df4: org.apache.spark.sql.DataFrame = [NewCol: string]
Creating Column and Columnname Objects
scala> val idCol = $"id"
idCol: org.apache.spark.sql.ColumnName = id
scala> val col = $"name"
col: org.apache.spark.sql.ColumnName = name
scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name[name]