SPARK REDUCE VS REDUCEBYKEY

Spark RDD reduce aggregate function is used to aggregate the dataset i.e. calculate min, max of elements in a dataset. SYNTAX: def reduce(f: (T, T) => T): T The argument is a Commutative and Associative function The parameter function should…

Spark write or save dataframes examples

Dataframes in spark can be saved into different file formats with DataframeWriter’s write API.Spark supports text,parquet,orc,json file formats. By default it saves in parquet file format. You can provide different compression options during saving the output . With mode you…

Spark and Hadoop Compression Codecs

The below table lists the available compression codes in spark and hadoop ecosystem. Compression Fully qualified class name Alias deflate org.apache.hadoop.io.compress.DefaultCodec deflate gzip org.apache.hadoop.io.compress.GzipCodec gzip bzip2 org.apache.hadoop.io.compress.BZip2Codec bzip2 lzo com.hadoop.compression.lzo.LzopCodec lzo LZ4 org.apache.hadoop.io.compress.Lz4Codecorg.apache.spark.io.LZ4CompressionCodec lz4 LZF org.apache.spark.io.LZFCompressionCodec   Snappy org.apache.hadoop.io.compress.SnappyCodecorg.apache.spark.io.SnappyCompressionCodec snappy…

Ways to Create Spark RDD

RDDs(Resilient Distributed Datasets) can be created in many different ways. Reading data from different sources Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine,…