GraphX is a new component in Spark for graphs and graphparallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators
Spark GraphX is a graph processing framework built on top of Spark.
GraphX models graphs as property graphs where vertices and edges can have properties.
GraphX comes with its own package org.apache.spark.graphx
.
Graph
Graph
abstract class represents a collection of vertices
and edges
.
abstract class Graph[VD: ClassTag, ED: ClassTag]
vertices
attribute is of type VertexRDD
while edges
is of type EdgeRDD
.
Standard GraphX API
Graph
class comes with a small set of API.

Transformations

mapVertices

mapEdges

mapTriplets

reverse

subgraph
mask

groupEdges


Joins

outerJoinVertices


Computation

aggregateMessages

Creating Graphs (Graph object)
Graph
object comes with the following factory methods to create instances of Graph
:

fromEdgeTuples

fromEdges

apply
Package summery for apache graphx
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/graphx/packagesummary.html
Class  Description 

Edge 
A single directed edge consisting of a source id, target id, and the data associated with the edge.

EdgeContext 
Represents an edge along with its neighboring vertices and allows sending messages along the edge.

EdgeDirection 
The direction of a directed edge relative to a vertex.

EdgeRDD 
EdgeRDD[ED, VD] extends RDD[Edge[ED} by storing the edges in columnar format on each partition for performance. 
EdgeTriplet 
An edge triplet represents an edge along with the vertex attributes of its neighboring vertices.

Graph 
The Graph abstractly represents a graph with arbitrary objects associated with vertices and edges.

GraphKryoRegistrator 
Registers GraphX classes with Kryo for improved performance.

GraphLoader 
Provides utilities for loading
Graph s from files. 
GraphOps 
Contains additional functionality for
Graph . 
GraphXUtils  
PartitionStrategy.CanonicalRandomVertexCut$ 
Assigns edges to partitions by hashing the source and destination vertex IDs in a canonical direction, resulting in a random vertex cut that colocates all edges between two vertices, regardless of direction.

PartitionStrategy.EdgePartition1D$ 
Assigns edges to partitions using only the source vertex ID, colocating edges with the same source.

PartitionStrategy.EdgePartition2D$ 
Assigns edges to partitions using a 2D partitioning of the sparse edge adjacency matrix, guaranteeing a
2 * sqrt(numParts)  1 bound on vertex replication. 
PartitionStrategy.RandomVertexCut$ 
Assigns edges to partitions by hashing the source and destination vertex IDs, resulting in a random vertex cut that colocates all samedirection edges between two vertices.

Pregel 
Implements a Pregellike bulksynchronous messagepassing API.

TripletFields 
Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]].

VertexRDD 
Extends
RDD[(VertexId, VD)] by ensuring that there is only one entry for each vertex and by preindexing the entries for fast, efficient joins. 
Example Property Graph
Suppose we want to construct a property graph consisting of the various collaborators on the GraphX project. The vertex property might contain the username and occupation. We could annotate edges with a string describing the relationships between collaborators:
The resulting graph would have the type signature:
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic generators and these are discussed in more detail in the section on graph builders. Probably the most general method is to use the Graph object. For example the following code constructs a graph from a collection of RDDs:
In the above example we make use of the Edge
case class. Edges have a srcId
and a dstId
corresponding to the source and destination vertex identifiers. In addition, the Edge
class has an attr
member which stores the edge property.
We can deconstruct a graph into the respective vertex and edge views by using the graph.vertices
and graph.edges
members respectively.
Note that
graph.vertices
returns anVertexRDD[(String, String)]
which extendsRDD[(VertexId, (String, String))]
and so we use the scalacase
expression to deconstruct the tuple. On the other hand,graph.edges
returns anEdgeRDD
containingEdge[String]
objects. We could have also used the case class type constructor as in the following:
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view. The triplet view logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]]
containing instances of the EdgeTriplet
class. This join can be expressed in the following SQL expression:
or graphically as:
The EdgeTriplet
class extends the Edge
class by adding the srcAttr
and dstAttr
members which contain the source and destination properties respectively. We can use the triplet view of a graph to render a collection of strings describing relationships between users.
Graph Operators
Just as RDDs have basic operations like map
, filter
, and reduceByKey
, property graphs also have a collection of basic operators that take user defined functions and produce new graphs with transformed properties and structure. The core operators that have optimized implementations are defined in Graph
and convenient operators that are expressed as a compositions of the core operators are defined in GraphOps
. However, thanks to Scala implicits the operators in GraphOps
are automatically available as members of Graph
. For example, we can compute the indegree of each vertex (defined in GraphOps
) by the following:
The reason for differentiating between core graph operations and GraphOps
is to be able to support different graph representations in the future. Each graph representation must provide implementations of the core operations and reuse many of the useful operations defined in GraphOps
.
Summary List of Operators
The following is a quick summary of the functionality defined in both Graph
and GraphOps
but presented as members of Graph for simplicity. Note that some function signatures have been simplified (e.g., default arguments and type constraints removed) and some more advanced functionality has been removed so please consult the API docs for the official list of operations.
Property Operators
Like the RDD map
operator, the property graph contains the following:
Each of these operators yields a new graph with the vertex or edge properties modified by the user defined map
function.
Note that in each case the graph structure is unaffected. This is a key feature of these operators which allows the resulting graph to reuse the structural indices of the original graph. The following snippets are logically equivalent, but the first one does not preserve the structural indices and would not benefit from the GraphX system optimizations:
Instead, use
mapVertices
to preserve the indices:
These operators are often used to initialize the graph for a particular computation or project away unnecessary properties. For example, given a graph with the out degrees as the vertex properties (we describe how to construct such a graph later), we initialize it for PageRank:
Structural Operators
Currently GraphX supports only a simple set of commonly used structural operators and we expect to add more in the future. The following is a list of the basic structural operators.
The reverse
operator returns a new graph with all the edge directions reversed. This can be useful when, for example, trying to compute the inverse PageRank. Because the reverse operation does not modify vertex or edge properties or change the number of edges, it can be implemented efficiently without data movement or duplication.
The subgraph
operator takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge predicate and connect vertices that satisfy the vertex predicate. The subgraph
operator can be used in number of situations to restrict the graph to the vertices and edges of interest or eliminate broken links. For example in the following code we remove broken links:
Note in the above example only the vertex predicate is provided. The
subgraph
operator defaults totrue
if the vertex or edge predicates are not provided.
The mask
operator constructs a subgraph by returning a graph that contains the vertices and edges that are also found in the input graph. This can be used in conjunction with the subgraph
operator to restrict a graph based on the properties in another related graph. For example, we might run connected components using the graph with missing vertices and then restrict the answer to the valid subgraph.
The groupEdges
operator merges parallel edges (i.e., duplicate edges between pairs of vertices) in the multigraph. In many numerical applications, parallel edges can be added (their weights combined) into a single edge thereby reducing the size of the graph.
Join Operators
In many cases it is necessary to join data from external collections (RDDs) with graphs. For example, we might have extra user properties that we want to merge with an existing graph or we might want to pull vertex properties from one graph into another. These tasks can be accomplished using the join operators. Below we list the key join operators:
The joinVertices
operator joins the vertices with the input RDD and returns a new graph with the vertex properties obtained by applying the user defined map
function to the result of the joined vertices. Vertices without a matching value in the RDD retain their original value.
Note that if the RDD contains more than one value for a given vertex only one will be used. It is therefore recommended that the input RDD be made unique using the following which will also preindex the resulting values to substantially accelerate the subsequent join.
The more general outerJoinVertices
behaves similarly to joinVertices
except that the user defined map
function is applied to all vertices and can change the vertex property type. Because not all vertices may have a matching value in the input RDD the map
function takes an Option
type. For example, we can setup a graph for PageRank by initializing vertex properties with their outDegree
.
You may have noticed the multiple parameter lists (e.g.,
f(a)(b)
) curried function pattern used in the above examples. While we could have equally writtenf(a)(b)
asf(a,b)
this would mean that type inference onb
would not depend ona
. As a consequence, the user would need to provide type annotation for the user defined function:
Neighborhood Aggregation
A key step in many graph analytics tasks is aggregating information about the neighborhood of each vertex. For example, we might want to know the number of followers each user has or the average age of the the followers of each user. Many iterative graph algorithms (e.g., PageRank, Shortest Path, and connected components) repeatedly aggregate properties of neighboring vertices (e.g., current PageRank Value, shortest path to the source, and smallest reachable vertex id).
To improve performance the primary aggregation operator changed from
graph.mapReduceTriplets
to the newgraph.AggregateMessages
. While the changes in the API are relatively small, we provide a transition guide below.
Aggregate Messages (aggregateMessages)
The core aggregation operation in GraphX is aggregateMessages
. This operator applies a user defined sendMsg
function to each edge triplet in the graph and then uses the mergeMsg
function to aggregate those messages at their destination vertex.
The user defined sendMsg
function takes an EdgeContext
, which exposes the source and destination attributes along with the edge attribute and functions (sendToSrc
, and sendToDst
) to send messages to the source and destination attributes. Think of sendMsg
as the map function in mapreduce. The user defined mergeMsg
function takes two messages destined to the same vertex and yields a single message. Think of mergeMsg
as the reduce function in mapreduce. The aggregateMessages
operator returns a VertexRDD[Msg]
containing the aggregate message (of type Msg
) destined to each vertex. Vertices that did not receive a message are not included in the returned VertexRDD
VertexRDD.
In addition, aggregateMessages
takes an optional tripletsFields
which indicates what data is accessed in the EdgeContext
(i.e., the source vertex attribute but not the destination vertex attribute). The possible options for the tripletsFields
are defined in TripletFields
and the default value is TripletFields.All
which indicates that the user defined sendMsg
function may access any of the fields in the EdgeContext
. ThetripletFields
argument can be used to notify GraphX that only part of the EdgeContext
will be needed allowing GraphX to select an optimized join strategy. For example if we are computing the average age of the followers of each user we would only require the source field and so we would use TripletFields.Src
to indicate that we only require the source field
In earlier versions of GraphX we used byte code inspection to infer the
TripletFields
however we have found that bytecode inspection to be slightly unreliable and instead opted for more explicit user control.
In the following example we use the aggregateMessages
operator to compute the average age of the more senior followers of each user.
import org.apache.spark.graphx.{Graph, VertexRDD}
import org.apache.spark.graphx.util.GraphGenerators
// Create a graph with "age" as the vertex property.
// Here we use a random graph for simplicity.
val graph: Graph[Double, Int] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toDouble )
// Compute the number of older followers and their total age
val olderFollowers: VertexRDD[(Int, Double)] = graph.aggregateMessages[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
triplet.sendToDst(1, triplet.srcAttr)
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
// Divide total age by number of older followers to get average age of older followers
val avgAgeOfOlderFollowers: VertexRDD[Double] =
olderFollowers.mapValues( (id, value) =>
value match { case (count, totalAge) => totalAge / count } )
// Display the results
avgAgeOfOlderFollowers.collect.foreach(println(_))
The
aggregateMessages
operation performs optimally when the messages (and the sums of messages) are constant sized (e.g., floats and addition instead of lists and concatenation).
Map Reduce Triplets Transition Guide (Legacy)
In earlier versions of GraphX neighborhood aggregation was accomplished using the mapReduceTriplets
operator:
The mapReduceTriplets
operator takes a user defined map function which is applied to each triplet and can yield messages which are aggregated using the user defined reduce
function. However, we found the user of the returned iterator to be expensive and it inhibited our ability to apply additional optimizations (e.g., local vertex renumbering). In aggregateMessages
we introduced the EdgeContext which exposes the triplet fields and also functions to explicitly send messages to the source and destination vertex. Furthermore we removed bytecode inspection and instead require the user to indicate what fields in the triplet are actually required.
The following code block using mapReduceTriplets
:
can be rewritten using aggregateMessages
as:
Computing Degree Information
A common aggregation task is computing the degree of each vertex: the number of edges adjacent to each vertex. In the context of directed graphs it is often necessary to know the indegree, outdegree, and the total degree of each vertex. The GraphOps
class contains a collection of operators to compute the degrees of each vertex. For example in the following we compute the max in, out, and total degrees:
Collecting Neighbors
In some cases it may be easier to express computation by collecting neighboring vertices and their attributes at each vertex. This can be easily accomplished using the collectNeighborIds
and the collectNeighbors
operators.
These operators can be quite costly as they duplicate information and require substantial communication. If possible try expressing the same computation using the
aggregateMessages
operator directly.
Graphx is more faster then Spark naive when graph computation is needed.