Spark and Hadoop Compression Codecs

The below table lists the available compression codes in spark and hadoop ecosystem.

Compression Fully qualified class name Alias
deflate org.apache.hadoop.io.compress.DefaultCodec deflate
gzip org.apache.hadoop.io.compress.GzipCodec gzip
bzip2 org.apache.hadoop.io.compress.BZip2Codec bzip2
lzo com.hadoop.compression.lzo.LzopCodec lzo
LZ4 org.apache.hadoop.io.compress.Lz4Codec
org.apache.spark.io.LZ4CompressionCodec
lz4
LZF org.apache.spark.io.LZFCompressionCodec  
Snappy org.apache.hadoop.io.compress.SnappyCodec
org.apache.spark.io.SnappyCompressionCodec
snappy
No compression   none
uncompressed
     

For saving dataframes into different file formats you can set these properties based on output type.

  • spark.sql.avro.compression.codec
  • spark.sql.parquet.compression.codec
  • spark.sql.orc.compression.codec

Also you can check the current or default properties with these statements

spark.conf.get("spark.sql.avro.compression.codec")
spark.conf.get("spark.sql.orc.compression.codec")
spark.conf.get("spark.sql.parquet.compression.codec")

Leave a Reply