Spark RDD reduce aggregate function is used to aggregate the dataset i.e. calculate min, max of elements in a dataset.
SYNTAX: def reduce(f: (T, T) => T): T
- The argument is a Commutative and Associative function
- The parameter function should have two arguments (binary operator ) of the same data type
- The return type of the function also must be same as argument types
- aggregation function should have these two properties
- Commutative : A+B = B+A
- Associative: (A+B)+C = A+(B+C)
Spark RDD reduceByKey is a transformation function which merges the values for each key using an associative reduce function.
SYNTAX : def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
|reduce must pull the entire dataset down into|
a single location because it is reducing to one final value
|reduceByKey reduces one value for each key|
|reduce() is a function that operates on an RDD of objects.||reduceByKey() is a function that operates on an RDD of key-value pairs|
|reduce() function is a member of RDD[T] class||reduceByKey() is a member of the PairRDDFunctions[K, V] class|
|Output is a collection not an RDD and is is not added to DAG .||Output is an RDD and is added to DAG|
The reduce cannot result in an RDD simply because it is a single value as output.
|def reduce(f: (T, T) => T): T||def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]|
|reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one)||reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using|
the given reduce function func, which must be of type (V,V) => V.