The below table lists the available compression codes in spark and hadoop ecosystem.
Compression | Fully qualified class name | Alias |
---|---|---|
deflate | org.apache.hadoop.io.compress.DefaultCodec | deflate |
gzip | org.apache.hadoop.io.compress.GzipCodec | gzip |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec | bzip2 |
lzo | com.hadoop.compression.lzo.LzopCodec | lzo |
LZ4 | org.apache.hadoop.io.compress.Lz4Codec org.apache.spark.io.LZ4CompressionCodec |
lz4 |
LZF | org.apache.spark.io.LZFCompressionCodec | |
Snappy | org.apache.hadoop.io.compress.SnappyCodec org.apache.spark.io.SnappyCompressionCodec |
snappy |
No compression | none uncompressed |
|
For saving dataframes into different file formats you can set these properties based on output type.
- spark.sql.avro.compression.codec
- spark.sql.parquet.compression.codec
- spark.sql.orc.compression.codec
Also you can check the current or default properties with these statements
spark.conf.get("spark.sql.avro.compression.codec")
spark.conf.get("spark.sql.orc.compression.codec")
spark.conf.get("spark.sql.parquet.compression.codec")