SPARK REDUCE VS REDUCEBYKEY

Spark RDD reduce aggregate function is used to aggregate the dataset i.e. calculate min, max of elements in a dataset.

SYNTAX: def reduce(f: (T, T) => T): T

  • The argument is a Commutative and Associative function
  • The parameter function should have two arguments (binary operator ) of the same data type
  • The return type of the function also must be same as argument types
  • aggregation function should have these two properties
    • Commutative : A+B = B+A
    • Associative: (A+B)+C = A+(B+C)

Spark RDD reduceByKey is a transformation function which merges the values for each key using an associative reduce function.

SYNTAX : def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

REDUCEREDUCEBYKEY
ActionTransformation
reduce must pull the entire dataset down into
a single location because it is reducing to one final value
reduceByKey reduces one value for each key
reduce() is a function that operates on an RDD of objects.reduceByKey() is a function that operates on an RDD of key-value pairs
reduce() function is a member of RDD[T] classreduceByKey() is a member of the PairRDDFunctions[K, V] class
Output is a collection not an RDD and is is not added to DAG .Output is an RDD and is added to DAG
The reduce cannot result in an RDD simply because it is a single value as output.
def reduce(f: (T, T) => T): Tdef reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one)reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using
the given reduce function func, which must be of type (V,V) => V.

Leave a Reply