describe 'employee' {NAME => 'address', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'personal_info', BLOOMFILTER => 'ROW', VERSIONS => '2', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NO NE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} {NAME => 'professional_info', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING = > 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
hbase> alter_async ‘t1’, NAME => ‘f1’, METHOD => ‘delete’or a shorter version:hbase> alter_async ‘t1’, ‘delete’ => ‘f1’
You can also change tablescope attributes like MAX_FILESIZE MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH. For example, to change the max size of a family to 128MB, do: hbase> alter ‘t1’, METHOD => ‘table_att’, MAX_FILESIZE => ‘134217728’ There could be more than one alteration in one command: hbase> alter ‘t1’, {NAME => ‘f1’}, {NAME => ‘f2’, METHOD => ‘delete’} To check if all the regions have been updated, use alter_status <table_name> 
Reference :
https://learnhbase.wordpress.com/2013/03/02/hbaseshellcommands/
https://www.cloudera.com/documentation/enterprise/55x/topics/admin_hbase_filtering.html
A distributed matrix has longtyped row and column indices and doubletyped values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.
A RowMatrix
can be created from an RDD[Vector]
instance. Then we can compute its column summary statistics and decompositions. QR decomposition is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
Example :
val rows: RDD[Vector] = sc.parallelize(Seq(Vectors.dense(1.0, 0.0, 3.0),Vectors.dense(2.0,30.0,4.0))) O/p :rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[2] at parallelize at <console>:26 val mat: RowMatrix = new RowMatrix(rows) // Get its size. val m = mat.numRows() val n = mat.numCols()
IndexedRowMatrix :
It is similar to RowMatrix where each row represented by it’s index & local vector
An IndexedRowMatrix
can be created from an RDD[IndexedRow]
instance, where IndexedRow
is a wrapper over (Long, Vector)
.
Example :
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
val rows: RDD[IndexedRow] = sc.parallelize(Seq(IndexedRow(0,Vectors.dense(1.0,2.0,3.0)),IndexedRow(1,Vectors.dense(4.0,5.0,6.0)))
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()
Example :
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
When we want to perform multiple matrix addition & multiplication where each submatrix backed by it’s index we uses BlockMatrix.
A BlockMatrix
can be most easily created from an IndexedRowMatrix
or CoordinateMatrix
by calling toBlockMatrix which will create block size of 1024X1024
.
Example :
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
// Transform the CoordinateMatrix to a BlockMatrix
val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
// Nothing happens if it is valid.
matA.validate()
// Calculate A^T A.
val ata = matA.transpose.multiply(matA)
A local vector has 1st argument as indices that is integers in nature & 2nd argument as double type as values. There are 2 types of vectors
A dense vector is backed by a double array representing its entry values
Creates a dense vector from a double array.
Example :
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
o/p : dv: org.apache.spark.mllib.linalg.Vector = [1.0,0.0,3.0]
2.A sparse vector is backed by two parallel arrays: indices and values.
Creates a sparse vector providing its index array and value array.
We can declare sparse vector in other way to with Seq as 2nd parameter & size as 1st
Creates a sparse vector using unordered (index, value) pairs.
Example :
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
o/p : sv1: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
o/p : sv2: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
3. Labeled Point :
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either
0
(negative) or 1
(positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ...
.
A labeled point is represented by the case class LabeledPoint
.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
Matrix
Spark Mlib supports 2 types of Matrices 1. Local matrix & 2. Distributed Matrix
Local Matrix :
The base class of local matrices is Matrix
, and we provide two implementations: DenseMatrix
, and SparseMatrix
. We recommend using the factory methods implemented in Matrices
to create local matrices. Remember, local matrices in MLlib are stored in columnmajor order.
Creates a columnmajor dense matrix.
Example : val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) o/p : dm: org.apache.spark.mllib.linalg.Matrix = 1.0 2.0 3.0 4.0 5.0 6.0
Creates a columnmajor sparse matrix in Compressed Sparse Column (CSC) format.
numRows > Describes number of rows in matrix
numCols > Describe number of columns in matrix
colPtrs > Describe index corrosponding to start of new column
RowIndices > Row index of element in columnmajor way.
Values > values in double
Example :1.0 0.0 4.0 0.0 3.0 5.0 2.0 0.0 6.0
is stored asvalues: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
,rowIndices=[0, 2, 1, 0, 1, 2]
,colPointers=[0, 2, 3, 6]
.
Another example : val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8)) O/p :sm: org.apache.spark.mllib.linalg.Matrix = 3 x 2 CSCMatrix (0,0) 9.0 (2,1) 6.0 (1,1) 8.0
We will see distributed Matrix in next session.
Here I am giving a simple example of CollectAsync() future method , also demonstrate how we can run scala scripts through sparkshell that is having oo code.
CollectAsync() returns FutureAction that returns object of type Seq[Int] in future so does allow parallel programming by allowing multiple operation to perform in parallel to support nonblocking IO.
Test.scala :
import org.apache.spark.FutureAction import org.apache.spark.{SPARK_VERSION, SparkContext} import org.apache.spark.FutureAction import scala.concurrent.Future object Test { def main(args: Array[String]) { //val sc1 = new SparkContext() val data = sc.parallelize(1 to 250, 1) var futureData: FutureAction[Seq[Int]] = data.collectAsync() while(!futureData.isCompleted){ println("Hello WOrld") } var itr:Iterator[Int] = data.toLocalIterator while(itr.hasNext){ println(itr.next()) } } } Test.main(null) /* This line is part of scala file Test.scala */
Now in order to run above program you have to write following command :
sudo sparkshell i Test.scala
So in this case sparkshell is taking Test.scala file & create a module after that
Test.main(null) gets called that calls the main method of Test class.
1 thing make sure you are not creating another SparkContext in your code
you can reutilize sparkContext of present spark session.
Queue
follows FirstInFirstOut model but sometimes we need to process the objects in the queue based on the priority. That is when JavaPriorityQueue
is used.
For example, let’s say we have an application that generates stocks reports for daily trading session. This application processes a lot of data and takes time to process it. So customers are sending request to the application that is actually getting queued but we want to process premium customers first and standard customers after them. So in this case PriorityQueue implementation in java can be really helpful.
PriorityQueue is an unbounded queue based on a priority heap and the elements of the priority queue are ordered by default in natural order. We can provide a Comparator for ordering at the time of instantiation of priority queue.
Java Priority Queue doesn’t allow null
values and we can’t create PriorityQueue of Objects that are noncomparable. We use java Comparable and Comparator for sorting Objects and Priority Queue use them for priority processing of it’s elements.
The simplest way to implement a priority queue data type is to keep an associative array mapping each priority to a list of elements with that priority. If association lists or hash tables are used to implement the associative array, adding an element takes constant time but removing or peeking at the element of highest priority takes linear (O(n)) time, because we must search all keys for the largest one. If a selfbalancing binary search tree is used, all three operations take O(log n) time; this is a popular solution in environments that already provide balanced trees but nothing more sophisticated.
There are a number of specialized heap data structures that either supply additional operations or outperform the above approaches. The binary heap uses O(log n) time for both operations, but allows peeking at the element of highest priority without removing it in constant time. Binomial heaps add several more operations, but require O(log n) time for peeking. Fibonacci heaps can insert elements, peek at the maximum priority element, and decrease an element’s priority in amortized constant time (deletions are still O(log n)).
// BinaryHeap class
//
// CONSTRUCTION: empty or with initial array.
//
// ******************PUBLIC OPERATIONS*********************
// void insert( x ) –> Insert x
// Comparable deleteMin( )–> Return and remove smallest item
// Comparable findMin( ) –> Return smallest item
// boolean isEmpty( ) –> Return true if empty; else false
// void makeEmpty( ) –> Remove all items
// ******************ERRORS********************************
// Throws UnderflowException for findMin and deleteMin when empty
/**
* Implements a binary heap.
* Note that all “matching” is based on the compareTo method.
* @author Bhavesh Gadoya
*/
public class BinaryHeap implements PriorityQueue {
/**
* Construct the binary heap.
*/
public BinaryHeap( ) {
currentSize = 0;
array = new Comparable[ DEFAULT_CAPACITY + 1 ];
}
/**
* Construct the binary heap from an array.
* @param items the inital items in the binary heap.
*/
public BinaryHeap( Comparable [ ] items ) {
currentSize = items.length;
array = new Comparable[ items.length + 1 ];
for( int i = 0; i < items.length; i++ )
array[ i + 1 ] = items[ i ];
buildHeap( );
}
/**
* Insert into the priority queue.
* Duplicates are allowed.
* @param x the item to insert.
* @return null, signifying that decreaseKey cannot be used.
*/
public PriorityQueue.Position insert( Comparable x ) {
if( currentSize + 1 == array.length )
doubleArray( );
// Percolate up
int hole = ++currentSize;
array[ 0 ] = x;
for( ; x.compareTo( array[ hole / 2 ] ) < 0; hole /= 2 )
array[ hole ] = array[ hole / 2 ];
array[ hole ] = x;
return null;
}
/**
* @throws UnsupportedOperationException because no Positions are returned
* by the insert method for BinaryHeap.
*/
public void decreaseKey( PriorityQueue.Position p, Comparable newVal ) {
throw new UnsupportedOperationException( “Cannot use decreaseKey for binary heap” );
}
/**
* Find the smallest item in the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
public Comparable findMin( ) {
if( isEmpty( ) )
throw new UnderflowException( “Empty binary heap” );
return array[ 1 ];
}
/**
* Remove the smallest item from the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
public Comparable deleteMin( ) {
Comparable minItem = findMin( );
array[ 1 ] = array[ currentSize– ];
percolateDown( 1 );
return minItem;
}
/**
* Establish heap order property from an arbitrary
* arrangement of items. Runs in linear time.
*/
private void buildHeap( ) {
for( int i = currentSize / 2; i > 0; i– )
percolateDown( i );
}
/**
* Test if the priority queue is logically empty.
* @return true if empty, false otherwise.
*/
public boolean isEmpty( ) {
return currentSize == 0;
}
/**
* Returns size.
* @return current size.
*/
public int size( ) {
return currentSize;
}
/**
* Make the priority queue logically empty.
*/
public void makeEmpty( ) {
currentSize = 0;
}
private static final int DEFAULT_CAPACITY = 100;
private int currentSize; // Number of elements in heap
private Comparable [ ] array; // The heap array
/**
* Internal method to percolate down in the heap.
* @param hole the index at which the percolate begins.
*/
private void percolateDown( int hole ) {
int child;
Comparable tmp = array[ hole ];
for( ; hole * 2 <= currentSize; hole = child ) {
child = hole * 2;
if( child != currentSize &&
array[ child + 1 ].compareTo( array[ child ] ) < 0 )
child++;
if( array[ child ].compareTo( tmp ) < 0 )
array[ hole ] = array[ child ];
else
break;
}
array[ hole ] = tmp;
}
/**
* Internal method to extend array.
*/
private void doubleArray( ) {
Comparable [ ] newArray;
newArray = new Comparable[ array.length * 2 ];
for( int i = 0; i < array.length; i++ )
newArray[ i ] = array[ i ];
array = newArray;
}
// Test program
public static void main( String [ ] args ) {
int numItems = 10000;
BinaryHeap h1 = new BinaryHeap( );
Integer [ ] items = new Integer[ numItems – 1 ];
int i = 37;
int j;
for( i = 37, j = 0; i != 0; i = ( i + 37 ) % numItems, j++ ) {
h1.insert( new Integer( i ) );
items[ j ] = new Integer( i );
}
for( i = 1; i < numItems; i++ )
if( ((Integer)( h1.deleteMin( ) )).intValue( ) != i )
System.out.println( “Oops! ” + i );
BinaryHeap h2 = new BinaryHeap( items );
for( i = 1; i < numItems; i++ )
if( ((Integer)( h2.deleteMin( ) )).intValue( ) != i )
System.out.println( “Oops! ” + i );
}
}
// PriorityQueue interface
//
// ******************PUBLIC OPERATIONS*********************
// Position insert( x ) –> Insert x
// Comparable deleteMin( )–> Return and remove smallest item
// Comparable findMin( ) –> Return smallest item
// boolean isEmpty( ) –> Return true if empty; else false
// void makeEmpty( ) –> Remove all items
// int size( ) –> Return size
// void decreaseKey( p, v)–> Decrease value in p to v
// ******************ERRORS********************************
// Throws UnderflowException for findMin and deleteMin when empty
/**
* PriorityQueue interface.
* Some priority queues may support a decreaseKey operation,
* but this is considered an advanced operation. If so,
* a Position is returned by insert.
* Note that all “matching” is based on the compareTo method.
* @author Bhavesh Gadoya
*/
public interface PriorityQueue {
/**
* The Position interface represents a type that can
* be used for the decreaseKey operation.
*/
public interface Position {
/**
* Returns the value stored at this position.
* @return the value stored at this position.
*/
Comparable getValue( );
}
/**
* Insert into the priority queue, maintaining heap order.
* Duplicates are allowed.
* @param x the item to insert.
* @return may return a Position useful for decreaseKey.
*/
Position insert( Comparable x );
/**
* Find the smallest item in the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
Comparable findMin( );
/**
* Remove the smallest item from the priority queue.
* @return the smallest item.
* @throws UnderflowException if empty.
*/
Comparable deleteMin( );
/**
* Test if the priority queue is logically empty.
* @return true if empty, false otherwise.
*/
boolean isEmpty( );
/**
* Make the priority queue logically empty.
*/
void makeEmpty( );
/**
* Returns the size.
* @return current size.
*/
int size( );
/**
* Change the value of the item stored in the pairing heap.
* This is considered an advanced operation and might not
* be supported by all priority queues. A priority queue
* will signal its intention to not support decreaseKey by
* having insert return null consistently.
* @param p any nonnull Position returned by insert.
* @param newVal the new value, which must be smaller
* than the currently stored value.
* @throws IllegalArgumentException if p invalid.
* @throws UnsupportedOperationException if appropriate.
*/
void decreaseKey( Position p, Comparable newVal );
}
/**
* Exception class for access in empty containers
* such as stacks, queues, and priority queues.
* @author Bhavesh Gadoya
*/
public class UnderflowException extends RuntimeException {
/**
* Construct this exception object.
* @param message the error message.
*/
public UnderflowException( String message ) {
super( message );
}
}
Inbuilt java implementation of priorityQueue :
package com.journaldev.collections; public class Customer { private int id; private String name; public Customer(int i, String n){ this.id=i; this.name=n; } public int getId() { return id; } public String getName() { return name; } }
We will use java random number generation to generate random customer objects. For natural ordering, I will use Integer that is also a java wrapper class.
Here is our final test code that shows how to use priority queue in java.
package com.journaldev.collections; import java.util.Comparator; import java.util.PriorityQueue; import java.util.Queue; import java.util.Random; public class PriorityQueueExample { public static void main(String[] args) { //natural ordering example of priority queue Queue<Integer> integerPriorityQueue = new PriorityQueue<>(7); Random rand = new Random(); for(int i=0;i<7;i++){ integerPriorityQueue.add(new Integer(rand.nextInt(100))); } for(int i=0;i<7;i++){ Integer in = integerPriorityQueue.poll(); System.out.println("Processing Integer:"+in); } //PriorityQueue example with Comparator Queue<Customer> customerPriorityQueue = new PriorityQueue<>(7, idComparator); addDataToQueue(customerPriorityQueue); pollDataFromQueue(customerPriorityQueue); } //Comparator anonymous class implementation public static Comparator<Customer> idComparator = new Comparator<Customer>(){ @Override public int compare(Customer c1, Customer c2) { return (int) (c1.getId()  c2.getId()); } }; //utility method to add random data to Queue private static void addDataToQueue(Queue<Customer> customerPriorityQueue) { Random rand = new Random(); for(int i=0; i<7; i++){ int id = rand.nextInt(100); customerPriorityQueue.add(new Customer(id, "Pankaj "+id)); } } //utility method to poll data from queue private static void pollDataFromQueue(Queue<Customer> customerPriorityQueue) { while(true){ Customer cust = customerPriorityQueue.poll(); if(cust == null) break; System.out.println("Processing Customer with ID="+cust.getId()); } } }
Spark GraphX is a graph processing framework built on top of Spark.
GraphX models graphs as property graphs where vertices and edges can have properties.
GraphX comes with its own package org.apache.spark.graphx
.
Graph
abstract class represents a collection of vertices
and edges
.
abstract class Graph[VD: ClassTag, ED: ClassTag]
vertices
attribute is of type VertexRDD
while edges
is of type EdgeRDD
.
Graph
class comes with a small set of API.
Transformations
mapVertices
mapEdges
mapTriplets
reverse
subgraph
mask
groupEdges
Joins
outerJoinVertices
Computation
aggregateMessages
Graph
object comes with the following factory methods to create instances of Graph
:
fromEdgeTuples
fromEdges
apply
Package summery for apache graphx
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/graphx/packagesummary.html
Class  Description 

Edge 
A single directed edge consisting of a source id, target id, and the data associated with the edge.

EdgeContext 
Represents an edge along with its neighboring vertices and allows sending messages along the edge.

EdgeDirection 
The direction of a directed edge relative to a vertex.

EdgeRDD 
EdgeRDD[ED, VD] extends RDD[Edge[ED} by storing the edges in columnar format on each partition for performance. 
EdgeTriplet 
An edge triplet represents an edge along with the vertex attributes of its neighboring vertices.

Graph 
The Graph abstractly represents a graph with arbitrary objects associated with vertices and edges.

GraphKryoRegistrator 
Registers GraphX classes with Kryo for improved performance.

GraphLoader 
Provides utilities for loading
Graph s from files. 
GraphOps 
Contains additional functionality for
Graph . 
GraphXUtils  
PartitionStrategy.CanonicalRandomVertexCut$ 
Assigns edges to partitions by hashing the source and destination vertex IDs in a canonical direction, resulting in a random vertex cut that colocates all edges between two vertices, regardless of direction.

PartitionStrategy.EdgePartition1D$ 
Assigns edges to partitions using only the source vertex ID, colocating edges with the same source.

PartitionStrategy.EdgePartition2D$ 
Assigns edges to partitions using a 2D partitioning of the sparse edge adjacency matrix, guaranteeing a
2 * sqrt(numParts)  1 bound on vertex replication. 
PartitionStrategy.RandomVertexCut$ 
Assigns edges to partitions by hashing the source and destination vertex IDs, resulting in a random vertex cut that colocates all samedirection edges between two vertices.

Pregel 
Implements a Pregellike bulksynchronous messagepassing API.

TripletFields 
Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]].

VertexRDD 
Extends
RDD[(VertexId, VD)] by ensuring that there is only one entry for each vertex and by preindexing the entries for fast, efficient joins. 
Suppose we want to construct a property graph consisting of the various collaborators on the GraphX project. The vertex property might contain the username and occupation. We could annotate edges with a string describing the relationships between collaborators:
The resulting graph would have the type signature:
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic generators and these are discussed in more detail in the section on graph builders. Probably the most general method is to use the Graph object. For example the following code constructs a graph from a collection of RDDs:
In the above example we make use of the Edge
case class. Edges have a srcId
and a dstId
corresponding to the source and destination vertex identifiers. In addition, the Edge
class has an attr
member which stores the edge property.
We can deconstruct a graph into the respective vertex and edge views by using the graph.vertices
and graph.edges
members respectively.
Note that
graph.vertices
returns anVertexRDD[(String, String)]
which extendsRDD[(VertexId, (String, String))]
and so we use the scalacase
expression to deconstruct the tuple. On the other hand,graph.edges
returns anEdgeRDD
containingEdge[String]
objects. We could have also used the case class type constructor as in the following:
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view. The triplet view logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet
class. This join can be expressed in the following SQL expression:
or graphically as:
The EdgeTriplet
class extends the Edge
class by adding the srcAttr
and dstAttr
members which contain the source and destination properties respectively. We can use the triplet view of a graph to render a collection of strings describing relationships between users.
Just as RDDs have basic operations like map
, filter
, and reduceByKey
, property graphs also have a collection of basic operators that take user defined functions and produce new graphs with transformed properties and structure. The core operators that have optimized implementations are defined in Graph
and convenient operators that are expressed as a compositions of the core operators are defined in GraphOps
. However, thanks to Scala implicits the operators in GraphOps
are automatically available as members of Graph
. For example, we can compute the indegree of each vertex (defined in GraphOps
) by the following:
The reason for differentiating between core graph operations and GraphOps
is to be able to support different graph representations in the future. Each graph representation must provide implementations of the core operations and reuse many of the useful operations defined in GraphOps
.
The following is a quick summary of the functionality defined in both Graph
and GraphOps
but presented as members of Graph for simplicity. Note that some function signatures have been simplified (e.g., default arguments and type constraints removed) and some more advanced functionality has been removed so please consult the API docs for the official list of operations.
Like the RDD map
operator, the property graph contains the following:
Each of these operators yields a new graph with the vertex or edge properties modified by the user defined map
function.
Note that in each case the graph structure is unaffected. This is a key feature of these operators which allows the resulting graph to reuse the structural indices of the original graph. The following snippets are logically equivalent, but the first one does not preserve the structural indices and would not benefit from the GraphX system optimizations:
Instead, use
mapVertices
to preserve the indices:
These operators are often used to initialize the graph for a particular computation or project away unnecessary properties. For example, given a graph with the out degrees as the vertex properties (we describe how to construct such a graph later), we initialize it for PageRank:
Currently GraphX supports only a simple set of commonly used structural operators and we expect to add more in the future. The following is a list of the basic structural operators.
The reverse
operator returns a new graph with all the edge directions reversed. This can be useful when, for example, trying to compute the inverse PageRank. Because the reverse operation does not modify vertex or edge properties or change the number of edges, it can be implemented efficiently without data movement or duplication.
The subgraph
operator takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge predicate and connect vertices that satisfy the vertex predicate. The subgraph
operator can be used in number of situations to restrict the graph to the vertices and edges of interest or eliminate broken links. For example in the following code we remove broken links:
Note in the above example only the vertex predicate is provided. The
subgraph
operator defaults totrue
if the vertex or edge predicates are not provided.
The mask
operator constructs a subgraph by returning a graph that contains the vertices and edges that are also found in the input graph. This can be used in conjunction with the subgraph
operator to restrict a graph based on the properties in another related graph. For example, we might run connected components using the graph with missing vertices and then restrict the answer to the valid subgraph.
The groupEdges
operator merges parallel edges (i.e., duplicate edges between pairs of vertices) in the multigraph. In many numerical applications, parallel edges can be added (their weights combined) into a single edge thereby reducing the size of the graph.
In many cases it is necessary to join data from external collections (RDDs) with graphs. For example, we might have extra user properties that we want to merge with an existing graph or we might want to pull vertex properties from one graph into another. These tasks can be accomplished using the join operators. Below we list the key join operators:
The joinVertices
operator joins the vertices with the input RDD and returns a new graph with the vertex properties obtained by applying the user defined map
function to the result of the joined vertices. Vertices without a matching value in the RDD retain their original value.
Note that if the RDD contains more than one value for a given vertex only one will be used. It is therefore recommended that the input RDD be made unique using the following which will also preindex the resulting values to substantially accelerate the subsequent join.
The more general outerJoinVertices
behaves similarly to joinVertices
except that the user defined map
function is applied to all vertices and can change the vertex property type. Because not all vertices may have a matching value in the input RDD the map
function takes an Option
type. For example, we can setup a graph for PageRank by initializing vertex properties with their outDegree
.
You may have noticed the multiple parameter lists (e.g.,
f(a)(b)
) curried function pattern used in the above examples. While we could have equally writtenf(a)(b)
asf(a,b)
this would mean that type inference onb
would not depend ona
. As a consequence, the user would need to provide type annotation for the user defined function:
A key step in many graph analytics tasks is aggregating information about the neighborhood of each vertex. For example, we might want to know the number of followers each user has or the average age of the the followers of each user. Many iterative graph algorithms (e.g., PageRank, Shortest Path, and connected components) repeatedly aggregate properties of neighboring vertices (e.g., current PageRank Value, shortest path to the source, and smallest reachable vertex id).
To improve performance the primary aggregation operator changed from
graph.mapReduceTriplets
to the newgraph.AggregateMessages
. While the changes in the API are relatively small, we provide a transition guide below.
The core aggregation operation in GraphX is aggregateMessages
. This operator applies a user defined sendMsg
function to each edge triplet in the graph and then uses the mergeMsg
function to aggregate those messages at their destination vertex.
The user defined sendMsg
function takes an EdgeContext
, which exposes the source and destination attributes along with the edge attribute and functions (sendToSrc
, and sendToDst
) to send messages to the source and destination attributes. Think of sendMsg
as the map function in mapreduce. The user defined mergeMsg
function takes two messages destined to the same vertex and yields a single message. Think of mergeMsg
as the reduce function in mapreduce. The aggregateMessages
operator returns a VertexRDD[Msg]
containing the aggregate message (of type Msg
) destined to each vertex. Vertices that did not receive a message are not included in the returned VertexRDD
VertexRDD.
In addition, aggregateMessages
takes an optional tripletsFields
which indicates what data is accessed in the EdgeContext
(i.e., the source vertex attribute but not the destination vertex attribute). The possible options for the tripletsFields
are defined in TripletFields
and the default value is TripletFields.All
which indicates that the user defined sendMsg
function may access any of the fields in the EdgeContext
. ThetripletFields
argument can be used to notify GraphX that only part of the EdgeContext
will be needed allowing GraphX to select an optimized join strategy. For example if we are computing the average age of the followers of each user we would only require the source field and so we would use TripletFields.Src
to indicate that we only require the source field
In earlier versions of GraphX we used byte code inspection to infer the
TripletFields
however we have found that bytecode inspection to be slightly unreliable and instead opted for more explicit user control.
In the following example we use the aggregateMessages
operator to compute the average age of the more senior followers of each user.
import org.apache.spark.graphx.{Graph, VertexRDD}
import org.apache.spark.graphx.util.GraphGenerators
// Create a graph with "age" as the vertex property.
// Here we use a random graph for simplicity.
val graph: Graph[Double, Int] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toDouble )
// Compute the number of older followers and their total age
val olderFollowers: VertexRDD[(Int, Double)] = graph.aggregateMessages[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
triplet.sendToDst(1, triplet.srcAttr)
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
// Divide total age by number of older followers to get average age of older followers
val avgAgeOfOlderFollowers: VertexRDD[Double] =
olderFollowers.mapValues( (id, value) =>
value match { case (count, totalAge) => totalAge / count } )
// Display the results
avgAgeOfOlderFollowers.collect.foreach(println(_))
The
aggregateMessages
operation performs optimally when the messages (and the sums of messages) are constant sized (e.g., floats and addition instead of lists and concatenation).
In earlier versions of GraphX neighborhood aggregation was accomplished using the mapReduceTriplets
operator:
The mapReduceTriplets
operator takes a user defined map function which is applied to each triplet and can yield messages which are aggregated using the user defined reduce
function. However, we found the user of the returned iterator to be expensive and it inhibited our ability to apply additional optimizations (e.g., local vertex renumbering). In aggregateMessages
we introduced the EdgeContext which exposes the triplet fields and also functions to explicitly send messages to the source and destination vertex. Furthermore we removed bytecode inspection and instead require the user to indicate what fields in the triplet are actually required.
The following code block using mapReduceTriplets
:
can be rewritten using aggregateMessages
as:
A common aggregation task is computing the degree of each vertex: the number of edges adjacent to each vertex. In the context of directed graphs it is often necessary to know the indegree, outdegree, and the total degree of each vertex. The GraphOps
class contains a collection of operators to compute the degrees of each vertex. For example in the following we compute the max in, out, and total degrees:
In some cases it may be easier to express computation by collecting neighboring vertices and their attributes at each vertex. This can be easily accomplished using the collectNeighborIds
and the collectNeighbors
operators.
These operators can be quite costly as they duplicate information and require substantial communication. If possible try expressing the same computation using the
aggregateMessages
operator directly.
Graphx is more faster then Spark naive when graph computation is needed.
An example of a CueSheet application is shown below. Any Scala object extending CueSheet
becomes a CueSheet application; the object body can then use the variables like sc
, sqlContext
, and spark
to write the business logic, as if it is inside sparkshell
:
import com.kakao.cuesheet.CueSheet
object Example extends CueSheet {{
val rdd = sc.parallelize(1 to 100)
println(s"sum = ${rdd.sum()}")
println(s"sum2 = ${rdd.map(_ + 1).sum()}")
}}
CueSheet will take care of creating SparkContext
or SparkSession
according to the configuration given in a separate file, so that your application code can contain just the business logic. Furthermore, CueSheet will launch the application locally or to a YARN cluster by simply running your object as a Java application, eliminating the need to use sparksubmit
and accompanying shell scripts.
CueSheet also supports Spark Streaming applications, via ssc
. When it is used in the object body, it automatically becomes a Spark Streaming application, and ssc
provides access to the StreamingContext
.
libraryDependencies += "com.kakao.cuesheet" %% "cuesheet" % "0.10.0"
CueSheet can be used in Scala projects by configuring SBT as above. Note that this dependency is not specified as"provided"
, which makes it possible to launch the application right in the IDE, and even debug using breakpoints in driver code when launched in client mode.
Configurations for your CueSheet application, including Spark configurations and the arguments in sparksubmit
, are specified using the HOCON format. It is by default application.conf
in your classpath root, but an alternate configuration file can be specified using Dconfig.resource
or Dconfig.file
. Below is an example configuration file.
spark {
master = "yarn:classpath:com.kakao.cuesheet.launcher.test"
deploy.mode = cluster
hadoop.user.name = "cloudera"
executor.instances = 2
executor.memory = 1g
driver.memory = 1g
streaming.blockInterval = 10000
eventLog.enabled = false
eventLog.dir = "hdfs:///user/spark/applicationHistory"
yarn.historyServer.address = "http://history.server:18088"
driver.extraJavaOptions = "XX:MaxPermSize=512m"
}
Unlike the standard spark configuration, spark.master
for YARN should include an indicator for finding YARN/Hive/Hadoop configurations. It is the easiest to put the XML files inside your classpath, usually by putting them undersrc/main/resources
, and specify the package classpath as above. Alternatively, spark.master
can contain a URL to download the configuration in a ZIP file, e.g. yarn:http://cloudera.manager/hive/configuration.zip
, copied from Cloudera Manager’s ‘Download Client Configuration’ link. The usual local
or local[8]
can also be used asspark.master
.
deploy.mode
can be either client
or cluster
, and spark.hadoop.user.name
should be the username to be used as the Hadoop user. CueSheet assumes that this user has the write permission to the home directory.
While submitting an application to YARN, CueSheet will copy Spark and CueSheet’s dependency jars to HDFS. This way, in the next time you submit your application, CueSheet will analyze your classpath to find and assemble only the classes that are not part of the already installed jars.
When given a tag name as system property cuesheet.install
, CueSheet will print a rather long shell command which can launch your application from anywhere hdfs
command is available. Below is an example of the oneliner shell command that CueSheet produces when given Dcuesheet.install=v0.0.1
as a JVM argument.
rm rf SimpleExample_2.10v0.0.1 && mkdir SimpleExample_2.10v0.0.1 && cd SimpleExample_2.10v0.0.1 &&
echo '<configuration><property><name>dfs.ha.automaticfailover.enabled</name><value>false</value></property><property><name>fs.defaultFS</name><value>hdfs://quickstart.cloudera:8020</value></property></configuration>' > coresite.xml &&
hdfs config . dfs get hdfs:///user/cloudera/.cuesheet/applications/com.kakao.cuesheet.SimpleExample/v0.0.1/SimpleExample_2.10.jar \!SimpleExample_2.10.jar &&
hdfs config . dfs get hdfs:///user/cloudera/.cuesheet/lib/0.10.0SNAPSHOTscala2.10spark2.1.0/*.jar &&
java classpath "*" com.kakao.cuesheet.SimpleExample "hello" "world" && cd .. && rm rf SimpleExample_2.10v0.0.1
What this command does is to download the CueSheet and Spark jars as well as your application assembly from HDFS, and launch the application in the same environment that was launched in the IDE. This way, it is not required to haveHADOOP_CONF_DIR
or SPARK_HOME
properly installed and set on every node, making it much easier to use it in distributed schedulers like Marathon, Chronos, or Aurora. These schedulers typically allow a singleline shell command as their job specification, so you can simply paste what CueSheet gives you in the scheduler’s Web UI.
Being started as a library of reusable Spark functions, CueSheet contains a number of additional features, not in an extremely coherent manner. Many parts of CueSheet including these features are powered by Mango library, another opensource project by Kakao.
One additional quirk is the “stop” tab CueSheet adds to the Spark UI. As shown below, it features three buttons with an increasing degree of seriousness. To stop a Spark Streaming application, to possibly trigger a restart by a scheduler like Marathon, one of the left two buttons will do the job. If you need to halt a Spark application ASAP, the red button will immediately kill the Spark driver.
Algorithm  Data Structure  Time Complexity  Space Complexity  

Average  Worst  Worst  
Depth First Search (DFS)  Graph of V vertices and E edges   
O(E + V) 
O(V) 

Breadth First Search (BFS)  Graph of V vertices and E edges   
O(E + V) 
O(V) 

Binary search  Sorted array of n elements  O(log(n)) 
O(log(n)) 
O(1) 

Linear (Brute Force)  Array  O(n) 
O(n) 
O(1) 

Shortest path by Dijkstra, using a Minheap as priority queue 
Graph with V vertices and E edges  O((V + E) log V) 
O((V + E) log V) 
O(V) 

Shortest path by Dijkstra, using an unsorted array as priority queue 
Graph with V vertices and E edges  O(V^2) 
O(V^2) 
O(V) 

Shortest path by BellmanFord  Graph with V vertices and E edges  O(VE) 
O(VE) 
O(V) 
Algorithm  Data Structure  Time Complexity  Worst Case Auxiliary Space Complexity  

Best  Average  Worst  Worst  
Quicksort  Array  O(n log(n)) 
O(n log(n)) 
O(n^2) 
O(log(n)) 

Mergesort  Array  O(n log(n)) 
O(n log(n)) 
O(n log(n)) 
O(n) 

Heapsort  Array  O(n log(n)) 
O(n log(n)) 
O(n log(n)) 
O(1) 

Bubble Sort  Array  O(n) 
O(n^2) 
O(n^2) 
O(1) 

Insertion Sort  Array  O(n) 
O(n^2) 
O(n^2) 
O(1) 

Select Sort  Array  O(n^2) 
O(n^2) 
O(n^2) 
O(1) 

Bucket Sort  Array  O(n+k) 
O(n+k) 
O(n^2) 
O(nk) 

Radix Sort  Array  O(nk) 
O(nk) 
O(nk) 
O(n+k) 
Data Structure  Time Complexity  Space Complexity  

Average  Worst  Worst  
Indexing  Search  Insertion  Deletion  Indexing  Search  Insertion  Deletion  
Basic Array  O(1) 
O(n) 
 
 
O(1) 
O(n) 
 
 
O(n) 
Dynamic Array  O(1) 
O(n) 
O(n) 
 
O(1) 
O(n) 
O(n) 
 
O(n) 
SinglyLinked List  O(n) 
O(n) 
O(1) 
O(1) 
O(n) 
O(n) 
O(1) 
O(1) 
O(n) 
DoublyLinked List  O(n) 
O(n) 
O(1) 
O(1) 
O(n) 
O(n) 
O(1) 
O(1) 
O(n) 
Skip List  O(n) 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(n) 
O(n) 
O(n) 
O(n) 
O(n log(n)) 
Hash Table   
O(1) 
O(1) 
O(1) 
 
O(n) 
O(n) 
O(n) 
O(n) 
Binary Search Tree   
O(log(n)) 
O(log(n)) 
O(log(n)) 
 
O(n) 
O(n) 
O(n) 
O(n) 
BTree   
O(log(n)) 
O(log(n)) 
O(log(n)) 
 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(n) 
RedBlack Tree   
O(log(n)) 
O(log(n)) 
O(log(n)) 
 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(n) 
AVL Tree   
O(log(n)) 
O(log(n)) 
O(log(n)) 
 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(n) 
Heaps  Time Complexity  

Heapify  Find Max  Extract Max  Increase Key  Insert  Delete  Merge  
Linked List (sorted)   
O(1) 
O(1) 
O(n) 
O(n) 
O(1) 
O(m+n) 

Linked List (unsorted)   
O(n) 
O(n) 
O(1) 
O(1) 
O(1) 
O(1) 

Binary Heap  O(log(n)) 
O(1) 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(m+n) 

Binomial Heap   
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(log(n)) 
O(log(n)) 

Fibonacci Heap   
O(1) 
O(log(n))* 
O(1)* 
O(1) 
O(log(n))* 
O(1) 
Node / Edge Management  Storage  Add Vertex  Add Edge  Remove Vertex  Remove Edge  Query 

Adjacency list  O(V+E) 
O(1) 
O(1) 
O(V + E) 
O(E) 
O(V) 
Incidence list  O(V+E) 
O(1) 
O(1) 
O(E) 
O(E) 
O(E) 
Adjacency matrix  O(V^2) 
O(V^2) 
O(1) 
O(V^2) 
O(1) 
O(1) 
Incidence matrix  O(V ⋅ E) 
O(V ⋅ E) 
O(V ⋅ E) 
O(V ⋅ E) 
O(V ⋅ E) 
O(E) 
Reference :
http://sandbox.runjs.cn/show/vsr4wsy7
Repartitioning : increases partition , it rebalance the partition after filter &it increases parallelism.
You can define partition in spark at the time of creating RDD as follow :
val users = sc.textFile(“hdfs://atr3p11:8020/project/users.csv”,1);
where 2nd argument is nothing but number of partition.
By default if not used hdfs spark creates partition based on number of cores. & if used hdfs path it will create partition based on input split (default block size of hdfs).
To know the partition size , just enter in sparkshell
users.partitions.size
Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 23x times that).
As far as choosing a “good” number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling sc.defaultParallelism
.
Also, the number of partitions determines how many files get generated by actions that save RDDs to files.
The maximum size of a partition is ultimately limited by the available memory of an executor.
In the first RDD transformation, e.g. reading from a file using sc.textFile(path, partition)
, thepartition
parameter will be applied to all further transformations and actions on this RDD.
When using textFile
with compressed files (file.txt.gz
not file.txt
or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do repartitioning.
Some operations, e.g. map
, flatMap
, filter
, don’t preserve partitioning.
map
, flatMap
, filter
operations apply a function to every partition.
rdd = sc.textFile('demo.gz')
rdd = rdd.repartition(100)
With the lines, you end up with rdd
to be exactly 100 partitions of roughly equal in size.
rdd.repartition(N)
does a shuffle
to split data to match N
partitioning is done on round robin basis
Note : If partitioning scheme doesn’t work for you, you can write your own custom partitioner.
coalesce
Transformation :
The coalesce
transformation is used to change the number of partitions. It can trigger RDD shufflingdepending on the shuffle
flag (disabled by default, i.e. false
).
shuffle
is false
by default and it’s explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD rdd
.