spark dataframe select columns

Spark select method is used to select the specific columns from the dataframe. Its is a transformation operation which is lazily evaluated to create a new dataframe. You should pass list of column names(string) or column expressions as argument.

The column names(org.apache.spark.sql.ColumnName) or column objects ( org.apache.spark.sql.Column) can be passed like this.

df(“columnName”)On a specific df DataFrame.
col(“columnName”)A generic column not yet associated with a DataFrame
col(“columnName.field”)Extracting a struct field
col(“a.column.with.dots“)Escape . in column names
$”columnName”Scala short hand for a named column.

Lets create a dataframe from sequence .

val df = Seq(("Mihir","Bangalore",36),
              ("Ranjan","Delhi",25),              ("Prakash","Chennai",30)).toDF("name","city","age")

Select all the columns from the dataframe

scala>  df.select("*").show
+-------+---------+---+
|   name|     city|age|
+-------+---------+---+
|  Mihir|Bangalore| 36|
| Ranjan|    Delhi| 25|
|Prakash|  Chennai| 30|
+-------+---------+---+

Same results can be achieved with these statements.

  • df.select(col(“*”)).show
  • df.select(‘*).show
  • df.select($”*”).show

Select specific columns from the dataframe

scala>  df.select( $"name",$"city",$"age").show
+-------+---------+---+
|   name|     city|age|
+-------+---------+---+
|  Mihir|Bangalore| 36|
| Ranjan|    Delhi| 25|
|Prakash|  Chennai| 30|
+-------+---------+---+

Same result can be obtained with these statements.

  • df.select(“name”,”city”,”age”).show()
  • df.select(‘name,’city,’age).show
  • df.select( $”name”,$”city”,$”age”).show
  • df.select(col(“name”),col(“age”),col(“city”)).show()

Please ensure that that all the column objects of same type object and should not be mixed.

Create a new column from existing numeric column by Numerical operation

scala> val df2 = df.select($"name", ($"age"+5) as "Age5")
df2: org.apache.spark.sql.DataFrame = [name: string, Age5: int]

scala> df2.show
+-------+----+
|   name|Age5|
+-------+----+
|  Mihir|  41|
| Ranjan|  30|
|Prakash|  35|
+-------+----+

scala> df.select(col("name"),col("age")+5).show
+-------+---------+
|   name|(age + 5)|
+-------+---------+
|  Mihir|       41|
| Ranjan|       30|
|Prakash|       35|
+-------+---------+
scala> val newdf = df.select('name,('age+5).as("NewAge"))
newdf: org.apache.spark.sql.DataFrame = [name: string, NewAge: int]

Select columns by joining two columns

scala>  df.select(concat(col("name"),col("city"))).show
+------------------+
|concat(name, city)|
+------------------+
|    MihirBangalore|
|       RanjanDelhi|
|    PrakashChennai|
+------------------+
scala>  // join two columns with "-"
scala>  val df3 = df.select(concat(col("name"),lit("-"),col("city")).as("NameCity"))
df3: org.apache.spark.sql.DataFrame = [NameCity: string]
scala> // Join with space using concat functions
scala>  val df4 = df.select(concat($"name",lit(" "),$"city").as("NewCol"))
df4: org.apache.spark.sql.DataFrame = [NewCol: string]

Creating Column and Columnname Objects

scala> val idCol = $"id"
idCol: org.apache.spark.sql.ColumnName = id

scala> val col = $"name"
col: org.apache.spark.sql.ColumnName = name

scala> val nameCol = col("name")
nameCol: org.apache.spark.sql.Column = name[name]

Leave a Reply