If the object & a class share the same name & are located in same source file, they are called companion.
A companion class can access private method & fields inside of it’s companion object & vice versa
companion object can be use to implement various design pattern like factory design pattern , singleton design pattern
companion object works same way the static method works in java
2. Case class & Case Object :
Data class represents data types in domain , Imagine an online store where customer, Account,order,Inventory are all types of domain
Service class represents works to be performed in an application. They know what to do but they donot hold data themselves , When application calls the service they pass data to service, and service transform it in some other way, Example : Persistance , Logger , Calculation engine.
Case class is a representation of data types that removes a lots of boilerplate code , it generates JVM specific convinient method, Makes every class parameter a field, is immutable by default hence thread safe , perform value-based equivalence by default.
3. Higher Order functions :
As scala is functional programming , it supports higher order function where function can take an argument as another function whose(the function as argument) declaration(calling part) is written inside function’s definition & it’s implementation is defined during either declaration of outer function in the form of anonymous function or directly declaring as any other function & calling it.
object higherOrder {
def applyF(i:Int,f:(Int)=>Int) {
println(f(i))
}
def multiply(x:Int):Int = {
x*x
}
def main(args:Array[String]) {
applyF(5,multiply)
}
}
]]>– DStream window :
Spark Streaming also provides windowed computations which allow us to apply transformations over a sliding window of data
– Output operation on DStream :
Output operation allow Dstream’s data to be pushed out to external system like databases or file system
output operation trigger the actual execution of all the DStream transformation
It support print() , saveAsTextFile(prefix,[suffix]),saveAsObjectFiles(prefix,[suffix]), saveAsHadoopFiles(prefix,[suffix]), foreachRDD(func) types of output operation
Example :
dstream.foreachRDD(rdd => rdd.foreachPartition(pr =>
val connection = ConnectionPool.getConnection()
pr.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
Now lets discuss how kafka works internally
Now here consider if we have kafka cluster of 4 nodes & replication factor for a topic 3. As we know kafka maintains list of in-sync replicas(ISR) based on 2 configuration
Sometimes while working on production we face a lots of this kinds of errors which is not a good sign. There are various reasons for a node or replica for being slow or down.
so the ideal practice to minimize the alerts is to only set replica.lag.max.ms & avoid setting replica.lag.max.messages .
Thanks to Neha nerkhade (https://www.confluent.io/blog/author/neha-narkhede/) for her awesome blog.
Reference :
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
]]>
Scala Code :
def findMissingNums(datax:Array[Int], len:Integer) = {
val total = datax.sum
val actual_total = (1 to len).sum
val diff = actual_total - total
var i = diff
var firstElem=0
var secondElem=0
while (i > 0) {
if (!datax.contains(i) && diff < len) {
firstElem = i
secondElem = diff - firstElem
i = 0
}
else if(!datax.contains(i) && diff > len){
if(i>=diff){
}
else{
firstElem = i
secondElem = diff - firstElem
}
}
i = i - 1
}
println(firstElem+","+secondElem)
}
2. Swap the 2 numbers without using temp variable
def swapNum(num1:Int,num2:Int){
var number1=num1
var number2=num2
number1 = number2 – number1
number2 = number2 – number1
number1 = number1 + number2
println(“num1 = “+number1+”, num2 = “+number2)
}
describe 'employee' {NAME => 'address', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'personal_info', BLOOMFILTER => 'ROW', VERSIONS => '2', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NO NE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'professional_info', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING = > 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
hbase> alter_async ‘t1’, NAME => ‘f1’, METHOD => ‘delete’or a shorter version:hbase> alter_async ‘t1’, ‘delete’ => ‘f1’
You can also change table-scope attributes like MAX_FILESIZE MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH. For example, to change the max size of a family to 128MB, do: hbase> alter ‘t1’, METHOD => ‘table_att’, MAX_FILESIZE => ‘134217728’ There could be more than one alteration in one command: hbase> alter ‘t1’, {NAME => ‘f1’}, {NAME => ‘f2’, METHOD => ‘delete’} To check if all the regions have been updated, use alter_status <table_name> |
Reference :
https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hbase_filtering.html
]]>
A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.
A RowMatrix
can be created from an RDD[Vector]
instance. Then we can compute its column summary statistics and decompositions. QR decomposition is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
Example :
val rows: RDD[Vector] = sc.parallelize(Seq(Vectors.dense(1.0, 0.0, 3.0),Vectors.dense(2.0,30.0,4.0))) O/p :rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[2] at parallelize at <console>:26 val mat: RowMatrix = new RowMatrix(rows) // Get its size. val m = mat.numRows() val n = mat.numCols()
IndexedRowMatrix :
It is similar to RowMatrix where each row represented by it’s index & local vector
An IndexedRowMatrix
can be created from an RDD[IndexedRow]
instance, where IndexedRow
is a wrapper over (Long, Vector)
.
Example :
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
val rows: RDD[IndexedRow] = sc.parallelize(Seq(IndexedRow(0,Vectors.dense(1.0,2.0,3.0)),IndexedRow(1,Vectors.dense(4.0,5.0,6.0)))
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()
Example :
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
When we want to perform multiple matrix addition & multiplication where each submatrix backed by it’s index we uses BlockMatrix.
A BlockMatrix
can be most easily created from an IndexedRowMatrix
or CoordinateMatrix
by calling toBlockMatrix which will create block size of 1024X1024
.
Example :
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
// Transform the CoordinateMatrix to a BlockMatrix
val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
// Nothing happens if it is valid.
matA.validate()
// Calculate A^T A.
val ata = matA.transpose.multiply(matA)
A local vector has 1st argument as indices that is integers in nature & 2nd argument as double type as values. There are 2 types of vectors
A dense vector is backed by a double array representing its entry values
Creates a dense vector from a double array.
Example :
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
o/p : dv: org.apache.spark.mllib.linalg.Vector = [1.0,0.0,3.0]
2.A sparse vector is backed by two parallel arrays: indices and values.
Creates a sparse vector providing its index array and value array.
We can declare sparse vector in other way to with Seq as 2nd parameter & size as 1st
Creates a sparse vector using unordered (index, value) pairs.
Example :
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
o/p : sv1: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
o/p : sv2: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
3. Labeled Point :
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either
0
(negative) or 1
(positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ...
.
A labeled point is represented by the case class LabeledPoint
.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
Matrix
Spark Mlib supports 2 types of Matrices 1. Local matrix & 2. Distributed Matrix
Local Matrix :
The base class of local matrices is Matrix
, and we provide two implementations: DenseMatrix
, and SparseMatrix
. We recommend using the factory methods implemented in Matrices
to create local matrices. Remember, local matrices in MLlib are stored in column-major order.
Creates a column-major dense matrix.
Example : val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) o/p : dm: org.apache.spark.mllib.linalg.Matrix = 1.0 2.0 3.0 4.0 5.0 6.0
Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.
numRows -> Describes number of rows in matrix
numCols -> Describe number of columns in matrix
colPtrs -> Describe index corrosponding to start of new column
RowIndices -> Row index of element in column-major way.
Values -> values in double
Example :1.0 0.0 4.0 0.0 3.0 5.0 2.0 0.0 6.0
is stored asvalues: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
,rowIndices=[0, 2, 1, 0, 1, 2]
,colPointers=[0, 2, 3, 6]
.
Another example : val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8)) O/p :sm: org.apache.spark.mllib.linalg.Matrix = 3 x 2 CSCMatrix (0,0) 9.0 (2,1) 6.0 (1,1) 8.0
We will see distributed Matrix in next session.
]]>
Here I am giving a simple example of CollectAsync() future method , also demonstrate how we can run scala scripts through spark-shell that is having oo code.
CollectAsync() returns FutureAction that returns object of type Seq[Int] in future so does allow parallel programming by allowing multiple operation to perform in parallel to support non-blocking IO.
Test.scala :
import org.apache.spark.FutureAction import org.apache.spark.{SPARK_VERSION, SparkContext} import org.apache.spark.FutureAction import scala.concurrent.Future object Test { def main(args: Array[String]) { //val sc1 = new SparkContext() val data = sc.parallelize(1 to 250, 1) var futureData: FutureAction[Seq[Int]] = data.collectAsync() while(!futureData.isCompleted){ println("Hello WOrld") } var itr:Iterator[Int] = data.toLocalIterator while(itr.hasNext){ println(itr.next()) } } } Test.main(null) /* This line is part of scala file Test.scala */
Now in order to run above program you have to write following command :
sudo spark-shell -i Test.scala
So in this case spark-shell is taking Test.scala file & create a module after that
Test.main(null) gets called that calls the main method of Test class.
1 thing make sure you are not creating another SparkContext in your code
you can re-utilize sparkContext of present spark session.
]]>]]>
Queue
follows First-In-First-Out model but sometimes we need to process the objects in the queue based on the priority. That is when JavaPriorityQueue
is used.
For example, let’s say we have an application that generates stocks reports for daily trading session. This application processes a lot of data and takes time to process it. So customers are sending request to the application that is actually getting queued but we want to process premium customers first and standard customers after them. So in this case PriorityQueue implementation in java can be really helpful.
PriorityQueue is an unbounded queue based on a priority heap and the elements of the priority queue are ordered by default in natural order. We can provide a Comparator for ordering at the time of instantiation of priority queue.
Java Priority Queue doesn’t allow null
values and we can’t create PriorityQueue of Objects that are non-comparable. We use java Comparable and Comparator for sorting Objects and Priority Queue use them for priority processing of it’s elements.
The simplest way to implement a priority queue data type is to keep an associative array mapping each priority to a list of elements with that priority. If association lists or hash tables are used to implement the associative array, adding an element takes constant time but removing or peeking at the element of highest priority takes linear (O(n)) time, because we must search all keys for the largest one. If a self-balancing binary search tree is used, all three operations take O(log n) time; this is a popular solution in environments that already provide balanced trees but nothing more sophisticated.
There are a number of specialized heap data structures that either supply additional operations or outperform the above approaches. The binary heap uses O(log n) time for both operations, but allows peeking at the element of highest priority without removing it in constant time. Binomial heaps add several more operations, but require O(log n) time for peeking. Fibonacci heaps can insert elements, peek at the maximum priority element, and decrease an element’s priority in amortized constant time (deletions are still O(log n)).
// BinaryHeap class
//
// CONSTRUCTION: empty or with initial array.
//
// ******************PUBLIC OPERATIONS*********************
// void insert( x ) –> Insert x
// Comparable deleteMin( )–> Return and remove smallest item
// Comparable findMin( ) –> Return smallest item
// boolean isEmpty( ) –> Return true if empty; else false
// void makeEmpty( ) –> Remove all items
// ******************ERRORS********************************
// Throws UnderflowException for findMin and deleteMin when empty
/**
* Implements a binary heap.
* Note that all “matching” is based on the compareTo method.
* @author Bhavesh Gadoya
*/
public class BinaryHeap implements PriorityQueue {
/**
* Construct the binary heap.
*/
public BinaryHeap( ) {
currentSize = 0;
array = new Comparable[ DEFAULT_CAPACITY + 1 ];
}
/**
* Construct the binary heap from an array.
* @param items the inital items in the binary heap.
*/
public BinaryHeap( Comparable [ ] items ) {
currentSize = items.length;
array = new Comparable[ items.length + 1 ];
for( int i = 0; i < items.length; i++ )
array[ i + 1 ] = items[ i ];
buildHeap( );
}
/**
* Insert into the priority queue.
* Duplicates are allowed.
* @param x the item to insert.
* @return null, signifying that decreaseKey cannot be used.
*/
public PriorityQueue.Position insert( Comparable x ) {
if( currentSize + 1 == array.length )
doubleArray( );
// Percolate up
int hole = ++currentSize;
array[ 0 ] = x;
for( ; x.compareTo( array[ hole / 2 ] ) < 0; hole /= 2 )
array[ hole ] = array[ hole / 2 ];
array[ hole ] = x;
return null;
}
/**
* @throws UnsupportedOperationException because no Positions are returned
* by the insert method for BinaryHeap.
*/
public void decreaseKey( PriorityQueue.Position p, Comparable newVal ) {
throw new UnsupportedOperationException( “Cannot use decreaseKey for binary heap” );
}
/**
* Find the smallest item in the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
public Comparable findMin( ) {
if( isEmpty( ) )
throw new UnderflowException( “Empty binary heap” );
return array[ 1 ];
}
/**
* Remove the smallest item from the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
public Comparable deleteMin( ) {
Comparable minItem = findMin( );
array[ 1 ] = array[ currentSize– ];
percolateDown( 1 );
return minItem;
}
/**
* Establish heap order property from an arbitrary
* arrangement of items. Runs in linear time.
*/
private void buildHeap( ) {
for( int i = currentSize / 2; i > 0; i– )
percolateDown( i );
}
/**
* Test if the priority queue is logically empty.
* @return true if empty, false otherwise.
*/
public boolean isEmpty( ) {
return currentSize == 0;
}
/**
* Returns size.
* @return current size.
*/
public int size( ) {
return currentSize;
}
/**
* Make the priority queue logically empty.
*/
public void makeEmpty( ) {
currentSize = 0;
}
private static final int DEFAULT_CAPACITY = 100;
private int currentSize; // Number of elements in heap
private Comparable [ ] array; // The heap array
/**
* Internal method to percolate down in the heap.
* @param hole the index at which the percolate begins.
*/
private void percolateDown( int hole ) {
int child;
Comparable tmp = array[ hole ];
for( ; hole * 2 <= currentSize; hole = child ) {
child = hole * 2;
if( child != currentSize &&
array[ child + 1 ].compareTo( array[ child ] ) < 0 )
child++;
if( array[ child ].compareTo( tmp ) < 0 )
array[ hole ] = array[ child ];
else
break;
}
array[ hole ] = tmp;
}
/**
* Internal method to extend array.
*/
private void doubleArray( ) {
Comparable [ ] newArray;
newArray = new Comparable[ array.length * 2 ];
for( int i = 0; i < array.length; i++ )
newArray[ i ] = array[ i ];
array = newArray;
}
// Test program
public static void main( String [ ] args ) {
int numItems = 10000;
BinaryHeap h1 = new BinaryHeap( );
Integer [ ] items = new Integer[ numItems – 1 ];
int i = 37;
int j;
for( i = 37, j = 0; i != 0; i = ( i + 37 ) % numItems, j++ ) {
h1.insert( new Integer( i ) );
items[ j ] = new Integer( i );
}
for( i = 1; i < numItems; i++ )
if( ((Integer)( h1.deleteMin( ) )).intValue( ) != i )
System.out.println( “Oops! ” + i );
BinaryHeap h2 = new BinaryHeap( items );
for( i = 1; i < numItems; i++ )
if( ((Integer)( h2.deleteMin( ) )).intValue( ) != i )
System.out.println( “Oops! ” + i );
}
}
// PriorityQueue interface
//
// ******************PUBLIC OPERATIONS*********************
// Position insert( x ) –> Insert x
// Comparable deleteMin( )–> Return and remove smallest item
// Comparable findMin( ) –> Return smallest item
// boolean isEmpty( ) –> Return true if empty; else false
// void makeEmpty( ) –> Remove all items
// int size( ) –> Return size
// void decreaseKey( p, v)–> Decrease value in p to v
// ******************ERRORS********************************
// Throws UnderflowException for findMin and deleteMin when empty
/**
* PriorityQueue interface.
* Some priority queues may support a decreaseKey operation,
* but this is considered an advanced operation. If so,
* a Position is returned by insert.
* Note that all “matching” is based on the compareTo method.
* @author Bhavesh Gadoya
*/
public interface PriorityQueue {
/**
* The Position interface represents a type that can
* be used for the decreaseKey operation.
*/
public interface Position {
/**
* Returns the value stored at this position.
* @return the value stored at this position.
*/
Comparable getValue( );
}
/**
* Insert into the priority queue, maintaining heap order.
* Duplicates are allowed.
* @param x the item to insert.
* @return may return a Position useful for decreaseKey.
*/
Position insert( Comparable x );
/**
* Find the smallest item in the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
Comparable findMin( );
/**
* Remove the smallest item from the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
Comparable deleteMin( );
/**
* Test if the priority queue is logically empty.
* @return true if empty, false otherwise.
*/
boolean isEmpty( );
/**
* Make the priority queue logically empty.
*/
void makeEmpty( );
/**
* Returns the size.
* @return current size.
*/
int size( );
/**
* Change the value of the item stored in the pairing heap.
* This is considered an advanced operation and might not
* be supported by all priority queues. A priority queue
* will signal its intention to not support decreaseKey by
* having insert return null consistently.
* @param p any non-null Position returned by insert.
* @param newVal the new value, which must be smaller
* than the currently stored value.
* @throws IllegalArgumentException if p invalid.
* @throws UnsupportedOperationException if appropriate.
*/
void decreaseKey( Position p, Comparable newVal );
}
/**
* Exception class for access in empty containers
* such as stacks, queues, and priority queues.
* @author Bhavesh Gadoya
*/
public class UnderflowException extends RuntimeException {
/**
* Construct this exception object.
* @param message the error message.
*/
public UnderflowException( String message ) {
super( message );
}
}
In-built java implementation of priorityQueue :
package com.journaldev.collections; public class Customer { private int id; private String name; public Customer(int i, String n){ this.id=i; this.name=n; } public int getId() { return id; } public String getName() { return name; } }
We will use java random number generation to generate random customer objects. For natural ordering, I will use Integer that is also a java wrapper class.
Here is our final test code that shows how to use priority queue in java.
package com.journaldev.collections; import java.util.Comparator; import java.util.PriorityQueue; import java.util.Queue; import java.util.Random; public class PriorityQueueExample { public static void main(String[] args) { //natural ordering example of priority queue Queue<Integer> integerPriorityQueue = new PriorityQueue<>(7); Random rand = new Random(); for(int i=0;i<7;i++){ integerPriorityQueue.add(new Integer(rand.nextInt(100))); } for(int i=0;i<7;i++){ Integer in = integerPriorityQueue.poll(); System.out.println("Processing Integer:"+in); } //PriorityQueue example with Comparator Queue<Customer> customerPriorityQueue = new PriorityQueue<>(7, idComparator); addDataToQueue(customerPriorityQueue); pollDataFromQueue(customerPriorityQueue); } //Comparator anonymous class implementation public static Comparator<Customer> idComparator = new Comparator<Customer>(){ @Override public int compare(Customer c1, Customer c2) { return (int) (c1.getId() - c2.getId()); } }; //utility method to add random data to Queue private static void addDataToQueue(Queue<Customer> customerPriorityQueue) { Random rand = new Random(); for(int i=0; i<7; i++){ int id = rand.nextInt(100); customerPriorityQueue.add(new Customer(id, "Pankaj "+id)); } } //utility method to poll data from queue private static void pollDataFromQueue(Queue<Customer> customerPriorityQueue) { while(true){ Customer cust = customerPriorityQueue.poll(); if(cust == null) break; System.out.println("Processing Customer with ID="+cust.getId()); } } }]]>