Apache Spark






Apache Spark Jump into Programming:


1. Spark Config:
    Configuration for a spark Application and used to set various spark parameters as key  and value pairs.

SparkConf conf=new SparkConf()
                     .setAppName("Line count")
                     .setMaster("local[2]")
                     .set("spark.executor.memory","1g");

2.Java Spark Context:
   It is Java version of the Spark Context and only one Spark Context active per JVM.

   JavaSparkContext jsc=new JavaSparkContext(conf);
   JavaSparkContext jsc=new JavaSparkContext()

   It has some useful methods:


    File related:
  • addFile()
  • binaryFiles(),binaryRecords()
  • clearFiles()
  • textFile()----> returns Java RDD
  • sequenceFile()
  • objectFile()----> returns Java RDD
    Job Related:
  • cancelAllJobs(), cancelJobGroup()
  • accumulator()
  • wholeTextFiles()
  • clearJobGroup()
  • setJobGroup()
     Hadoop Related:

      In Hadoop related methods it returns JavaPairRDD

  • hadoopRDD()
  • hadoopConfiguration()
  • hadoopFile()
  • newAPIHadoopFile()
  • newAPIHadoopRDD()

    General:
  • close(),stop()
  • emptyRDD()
  • getSparkHome()
  • parallelise()----> returns Java RDD
  • union()----> returns Java RDD
  • sc()
  • toSparkContext().....many more

3. JavaRDD:

It has some important methods

  • distinct()
  • sortBy()
  • subtract()
  • toString()
  • randomSplit()
  • filter()
  • persist()
  • unpersist()
  • randomSplit()
  • intersection()
  • repartition()
  • sample()
  • wrapRDD()
  • collect()
  • flatMap(),map()
  • reduce()
  • foreach()
  • max(),min()
  • partitions()

Example:


SparkConf conf=new SparkConf()
       .setAppName("Line count")
       .setMaster("local[2]")
       .set("spark.executor.memory","1g");

JavaSparkContext jsc=new JavaSparkContext(conf);
JavaRDD<String> textRDD=jsc.textFile("/usr/local/Cellar/apache-spark/2.1.0/README.md");
System.out.println(textRDD.count());


Important Packages in Apache Spark


1. org.apache.spark.api.java;


2.org.apache.spark.streaming.api.java;

3. org.apache.spark.streaming.kafka



4. org.apache.spark.sql



Apache Streaming:

Apache Spark provides with a Streaming module i.e Apache Streaming, it is high-throughput, scalable, fault tolerant stream processing of live data streams.

Apache Streaming provides an abstraction over the D-Streams. Internally each D-Stream is represented  as sequence of RDD's.

D- Streams:
Discretized Streams (D-Streams) are short,stateless and deterministic which provides continues sequence of data.

D-Streams can be created from various input sources like Kafka,Flume etc.


Apache Streaming provides a special concept called checkpointing.



Apache Streaming Architecture:

Apache Streaming uses micro-batch architecture where streaming is treated as a continuous series of batch computations on small batches and new batches are created at regular interval of time called batch interval.

Usually batch interval ranges from 500 mill sec to several seconds.

Each input data stream are converted into mini batches of specified batch interval.  These each batches forms an RDD and processed by spark jobs to create other RDD's. The processed results are then pushed out to external systems into batches.







Maven dependency :
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.10</artifactId>
    <version>2.1.0</version>
</dependency>























Comments

Popular posts from this blog

Streams In Java8