Spark – Streaming

Streaming Fundamentals

  1. Streaming Context :
    – Consumes a stream of data in Spark
    – Register an InputDStream to produce Receiver object
    – It is the main Entry point for spark funcionality
    – Spark provides a number of default implementation of sources like Twitter, Akka Actor and ZeroMQ that are accessible from the context
    – A streamingContext object can be created from a SparkContext object
    – A SparkContext represent the connection to a Spark Cluster and can be used to create RDDs, acumulators and broadcast variables on that clusterimport org.apache.spark._
    import org.apache.spark.streaming._
    var ssc = new StreamingContext(sc,Second(1));
  2. DStream :
    – Discretized Stream is the basic abstraction provided by spark streaming
    – It is a continues stream of data
    – It is received from source or from a processed data stream generated by transforming the input stream
    – Internally , A Dstream is represented by a continuous series of RDD and each RDD contains data from a certain interval
    – Any operation applied on a DSTream translates to operations on the underlying RDDs.
    – Input DStream are DStreams representing the stream of input data received from streaming  sources. There are 2 sourceof DSTream
    A.) Basic Source includes  File System & Socket connection
    B.) Advance Source includes Kafka, Flume, Kinesis
    – Every input DStream is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processingTransformation on DStream
    – Transformation allow the data from the input  DStream to be modified similar to RDD , DStream supports many of the transformation available on normal Spark RDD including map,flatmap,filter,reduce,groupby
    A.) map(func) : Return a new DStream by passing each element of source DStream through a function func
    B.) flatMap(func) is similar to map(func) but each input item can be mapped to 0 or more output items and returna  new DStream by passing each Dstream source
    C.) filter(func) returns a new DStream by selecting only the records that matches the criteria
    D.) reduce(func) return a new DStream of single-element RDDs
    E.) groupBy(func) return the new RDD which basically is made up with key and corresponding list of items of that group

    – DStream window :
    Spark Streaming also provides windowed computations which allow us to apply transformations over a sliding window of data
    – Output operation on DStream :
    Output operation allow Dstream’s data to be pushed out to external system like databases or file system
    output operation trigger the actual execution of all the DStream transformation
    It support print() , saveAsTextFile(prefix,[suffix]),saveAsObjectFiles(prefix,[suffix]), saveAsHadoopFiles(prefix,[suffix]), foreachRDD(func) types of output operation

    Example :
    dstream.foreachRDD(rdd => rdd.foreachPartition(pr =>
    val connection = ConnectionPool.getConnection()
    pr.foreach(record => connection.send(record))

  3. Caching and persistence  :
    – Dstream allows developer to cache/persis the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times
    – This can be done using the persist() method on Dstream
    – For input stream that receive data over the network(such as Kafka,flume, socket etc) the default persistence level is set to replicate the data to wo nodes or fault-tolerance.
  4. Acuumulators , Broadcast variables & Checkpoints
    – Accumulators are variables that are only added through an associatives and commutative operation
    – They are used to implement counters or sums
    – Tracking accumulators in the UI can be useful for understanding the progress of running stages
    – Spark natively supports numeric accumulators We can create named or unnamed accumulators
    Broadcast variables :
    – Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
    – They can be used to give every node a copy of a large input dataset in an efficient manner
    – Spark also attempts to distribute broadcast variables using efficient broadcast algorithm to reduce communication cost

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s