Spark RDD reduce aggregate function is used to aggregate the dataset i.e. calculate min, max of elements in a dataset.
SYNTAX: def reduce(f: (T, T) => T): T
- The argument is a Commutative and Associative function
- The parameter function should have two arguments (binary operator ) of the same data type
- The return type of the function also must be same as argument types
- aggregation function should have these two properties
- Commutative : A+B = B+A
- Associative: (A+B)+C = A+(B+C)
Spark RDD reduceByKey is a transformation function which merges the values for each key using an associative reduce function.
SYNTAX : def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
REDUCE | REDUCEBYKEY |
Action | Transformation |
reduce must pull the entire dataset down into a single location because it is reducing to one final value | reduceByKey reduces one value for each key |
reduce() is a function that operates on an RDD of objects. | reduceByKey() is a function that operates on an RDD of key-value pairs |
reduce() function is a member of RDD[T] class | reduceByKey() is a member of the PairRDDFunctions[K, V] class |
Output is a collection not an RDD and is is not added to DAG . | Output is an RDD and is added to DAG The reduce cannot result in an RDD simply because it is a single value as output. |
def reduce(f: (T, T) => T): T | def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] |
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one) | reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. |