Apache Spark
Apache Spark Jump into Programming:
1. Spark Config:
Configuration for a spark Application and used to set various spark parameters as key and value pairs.
SparkConf conf=new SparkConf()
.setAppName("Line count")
.setMaster("local[2]")
.set("spark.executor.memory","1g");
2.Java Spark Context:
It is Java version of the Spark Context and only one Spark Context active per JVM.
JavaSparkContext jsc=new JavaSparkContext(conf);
JavaSparkContext jsc=new JavaSparkContext()
It has some useful methods:
File related:
- addFile()
- binaryFiles(),binaryRecords()
- clearFiles()
- textFile()----> returns Java RDD
- sequenceFile()
- objectFile()----> returns Java RDD
Job Related:
- cancelAllJobs(), cancelJobGroup()
- accumulator()
- wholeTextFiles()
- clearJobGroup()
- setJobGroup()
Hadoop Related:
In Hadoop related methods it returns JavaPairRDD
- hadoopRDD()
- hadoopConfiguration()
- hadoopFile()
- newAPIHadoopFile()
- newAPIHadoopRDD()
General:
- close(),stop()
- emptyRDD()
- getSparkHome()
- parallelise()----> returns Java RDD
- union()----> returns Java RDD
- sc()
- toSparkContext().....many more
3. JavaRDD:
It has some important methods
- distinct()
- sortBy()
- subtract()
- toString()
- randomSplit()
- filter()
- persist()
- unpersist()
- randomSplit()
- intersection()
- repartition()
- sample()
- wrapRDD()
- collect()
- flatMap(),map()
- reduce()
- foreach()
- max(),min()
- partitions()
Example:
SparkConf conf=new SparkConf()
.setAppName("Line count")
.setMaster("local[2]")
.set("spark.executor.memory","1g");
JavaSparkContext jsc=new JavaSparkContext(conf);
JavaRDD<String> textRDD=jsc.textFile("/usr/local/Cellar/apache-spark/2.1.0/README.md");
System.out.println(textRDD.count());
Important Packages in Apache Spark
1. org.apache.spark.api.java;
2.org.apache.spark.streaming.api.java;
3. org.apache.spark.streaming.kafka
4. org.apache.spark.sql

Apache Streaming:
Apache Spark provides with a Streaming module i.e Apache Streaming, it is high-throughput, scalable, fault tolerant stream processing of live data streams.
Apache Streaming provides an abstraction over the D-Streams. Internally each D-Stream is represented as sequence of RDD's.
D- Streams:
Discretized Streams (D-Streams) are short,stateless and deterministic which provides continues sequence of data.
D-Streams can be created from various input sources like Kafka,Flume etc.
Apache Streaming provides a special concept called checkpointing.
Apache Streaming Architecture:
Apache Streaming uses micro-batch architecture where streaming is treated as a continuous series of batch computations on small batches and new batches are created at regular interval of time called batch interval.
Usually batch interval ranges from 500 mill sec to several seconds.
Each input data stream are converted into mini batches of specified batch interval. These each batches forms an RDD and processed by spark jobs to create other RDD's. The processed results are then pushed out to external systems into batches.



Maven dependency :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>2.1.0</version>
</dependency>



Comments
Post a Comment