Spark RDD reduce aggregate function is used to aggregate the dataset i.e. calculate min, max of elements in a dataset. SYNTAX: def reduce(f: (T, T) => T): T The argument is a Commutative and Associative function The parameter function should…
Dataframes in spark can be saved into different file formats with DataframeWriter’s write API.Spark supports text,parquet,orc,json file formats. By default it saves in parquet file format. You can provide different compression options during saving the output . With mode you…
DataFrameWriter API can be used to save Spark dataframes to different file formats and external sources. The common syntax usage is as below. The default write mode is PARQUET file and files are saved to hdfs . The mode can…
The below table lists the available compression codes in spark and hadoop ecosystem. Compression Fully qualified class name Alias deflate org.apache.hadoop.io.compress.DefaultCodec deflate gzip org.apache.hadoop.io.compress.GzipCodec gzip bzip2 org.apache.hadoop.io.compress.BZip2Codec bzip2 lzo com.hadoop.compression.lzo.LzopCodec lzo LZ4 org.apache.hadoop.io.compress.Lz4Codecorg.apache.spark.io.LZ4CompressionCodec lz4 LZF org.apache.spark.io.LZFCompressionCodec Snappy org.apache.hadoop.io.compress.SnappyCodecorg.apache.spark.io.SnappyCompressionCodec snappy…
RDDs(Resilient Distributed Datasets) can be created in many different ways. Reading data from different sources Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine,…