Spark and Hadoop Compression Codecs

The below table lists the available compression codes in spark and hadoop ecosystem. Compression Fully qualified class name Alias deflate org.apache.hadoop.io.compress.DefaultCodec deflate gzip org.apache.hadoop.io.compress.GzipCodec gzip bzip2 org.apache.hadoop.io.compress.BZip2Codec bzip2 lzo com.hadoop.compression.lzo.LzopCodec lzo LZ4 org.apache.hadoop.io.compress.Lz4Codecorg.apache.spark.io.LZ4CompressionCodec lz4 LZF org.apache.spark.io.LZFCompressionCodec   Snappy org.apache.hadoop.io.compress.SnappyCodecorg.apache.spark.io.SnappyCompressionCodec snappy…

Ways to Create Spark RDD

RDDs(Resilient Distributed Datasets) can be created in many different ways. Reading data from different sources Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine,…

spark dataframe select columns

Spark select method is used to select the specific columns from the dataframe. Its is a transformation operation which is lazily evaluated to create a new dataframe. You should pass list of column names(string) or column expressions as argument. The…