spark dataframe select columns

Spark select method is used to select the specific columns from the dataframe. Its is a transformation operation which is lazily evaluated to create a new dataframe. You should pass list of column names(string) or column expressions as argument.

The column names(org.apache.spark.sql.ColumnName) or column objects ( org.apache.spark.sql.Column) can be passed like this.

df(“columnName”)On a specific df DataFrame.
col(“columnName”)A generic column not yet associated with a DataFrame
col(“columnName.field”)Extracting a struct field
col(“a.column.with.dots“)Escape . in column names
$”columnName”Scala short hand for a named column.

Lets create a dataframe from sequence .

val df = Seq(("Mihir","Bangalore",36),
              ("Ranjan","Delhi",25),              ("Prakash","Chennai",30)).toDF("name","city","age")

Select all the columns from the dataframe

|   name|     city|age|
|  Mihir|Bangalore| 36|
| Ranjan|    Delhi| 25|
|Prakash|  Chennai| 30|

Same results can be achieved with these statements.


Select specific columns from the dataframe

scala> $"name",$"city",$"age").show
|   name|     city|age|
|  Mihir|Bangalore| 36|
| Ranjan|    Delhi| 25|
|Prakash|  Chennai| 30|

Same result can be obtained with these statements.

  • $”name”,$”city”,$”age”).show

Please ensure that that all the column objects of same type object and should not be mixed.

Create a new column from existing numeric column by Numerical operation

scala> val df2 =$"name", ($"age"+5) as "Age5")
df2: org.apache.spark.sql.DataFrame = [name: string, Age5: int]

|   name|Age5|
|  Mihir|  41|
| Ranjan|  30|
|Prakash|  35|

|   name|(age + 5)|
|  Mihir|       41|
| Ranjan|       30|
|Prakash|       35|
scala> val newdf ='name,('age+5).as("NewAge"))
newdf: org.apache.spark.sql.DataFrame = [name: string, NewAge: int]

Select columns by joining two columns

|concat(name, city)|
|    MihirBangalore|
|       RanjanDelhi|
|    PrakashChennai|
scala>  // join two columns with "-"
scala>  val df3 ="name"),lit("-"),col("city")).as("NameCity"))
df3: org.apache.spark.sql.DataFrame = [NameCity: string]
scala> // Join with space using concat functions
scala>  val df4 =$"name",lit(" "),$"city").as("NewCol"))
df4: org.apache.spark.sql.DataFrame = [NewCol: string]

Creating Column and Columnname Objects

scala> val idCol = $"id"
idCol: org.apache.spark.sql.ColumnName = id

scala> val col = $"name"
col: org.apache.spark.sql.ColumnName = name

scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name[name]

Leave a Reply