Dataframes in spark can be saved into different file formats with DataframeWriter’s write API.Spark supports text,parquet,orc,json file formats. By default it saves in parquet file format. You can provide different compression options during saving the output . With mode you…
DataFrameWriter API can be used to save Spark dataframes to different file formats and external sources. The common syntax usage is as below. The default write mode is PARQUET file and files are saved to hdfs . The mode can…
Reading data from oracle database with Spark can be done with these steps. Get The JDBC Thin Driver Download the proper driver ojdbc6.jar for Oracle11.2 and ojdbc7.jar for Oracle12c. Check the compatibility…
The below table lists the available compression codes in spark and hadoop ecosystem. Compression Fully qualified class name Alias deflate org.apache.hadoop.io.compress.DefaultCodec deflate gzip org.apache.hadoop.io.compress.GzipCodec gzip bzip2 org.apache.hadoop.io.compress.BZip2Codec bzip2 lzo com.hadoop.compression.lzo.LzopCodec lzo LZ4 org.apache.hadoop.io.compress.Lz4Codecorg.apache.spark.io.LZ4CompressionCodec lz4 LZF org.apache.spark.io.LZFCompressionCodec Snappy org.apache.hadoop.io.compress.SnappyCodecorg.apache.spark.io.SnappyCompressionCodec snappy…
RDDs(Resilient Distributed Datasets) can be created in many different ways. Reading data from different sources Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine,…
Spark select method is used to select the specific columns from the dataframe. Its is a transformation operation which is lazily evaluated to create a new dataframe. You should pass list of column names(string) or column expressions as argument. The…