Data types of Spark Mlib – Part 2

Friends, so far we have gone through Basic types of spark mllib data type. Spark mllib also supports distributed matrices that includes Row Matrix , IndexedRowMatrix, CoordinateMatrix, BlockMatrix.

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

  1. RowMatrix :A RowMatrix is made up of collection of row which made up of local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

    A RowMatrix can be created from an RDD[Vector] instance. Then we can compute its column summary statistics and decompositions. QR decomposition is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.

    Example :

  2.  
    val rows: RDD[Vector] = sc.parallelize(Seq(Vectors.dense(1.0, 0.0, 3.0),Vectors.dense(2.0,30.0,4.0))) 
    O/p :rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[2] at parallelize at <console>:26
    val mat: RowMatrix = new RowMatrix(rows) // Get its size. val m = mat.numRows() val n = mat.numCols()
    
  3. IndexedRowMatrix :

    It is similar to RowMatrix where each row represented by it’s index & local vector
    An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector).

    Example :

    import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
    
    val rows: RDD[IndexedRow] = sc.parallelize(Seq(IndexedRow(0,Vectors.dense(1.0,2.0,3.0)),IndexedRow(1,Vectors.dense(4.0,5.0,6.0)))
    // Create an IndexedRowMatrix from an RDD[IndexedRow].
    val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
    
    // Get its size.
    val m = mat.numRows()
    val n = mat.numCols()
    
    // Drop its row indices.
    val rowMat: RowMatrix = mat.toRowMatrix()

     

  4. Coordinate Matrix :
    A coordinate matrix is backed by rowindices , columnindices & value in double & can be created from an RDD[MatrixEntry]. We can also convert Coordinate matrix to indexedRowMatrix by calling toIndexRowMatrix

    Example :

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
    
    val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    val mat: CoordinateMatrix = new CoordinateMatrix(entries)
    
    // Get its size.
    val m = mat.numRows()
    val n = mat.numCols()
    
    // Convert it to an IndexRowMatrix whose rows are sparse vectors.
    val indexedRowMatrix = mat.toIndexedRowMatrix()
  5. BlockMatrix  :

    When we want to perform multiple matrix addition & multiplication where each submatrix backed by it’s index we uses BlockMatrix.

    A BlockMatrix can be most easily created from an IndexedRowMatrix or CoordinateMatrix by calling toBlockMatrix which will create block size of 1024X1024.

    Example :

    import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
    
    val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
    // Transform the CoordinateMatrix to a BlockMatrix
    val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
    
    // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
    // Nothing happens if it is valid.
    matA.validate()
    
    // Calculate A^T A.
    val ata = matA.transpose.multiply(matA)

Data types of Spark Mlib – Part 1

Hello friends Spark Mlib does support multiple data types in the form of vectors & matrices.

A local vector has 1st argument as indices that is integers in nature & 2nd argument as double type as values. There are 2 types of vectors

  1. Dense vector :

A dense vector is backed by a double array representing its entry values

def dense(values: Array[Double]): Vector

Creates a dense vector from a double array.

Example : 
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
o/p : dv: org.apache.spark.mllib.linalg.Vector = [1.0,0.0,3.0]

 
 2.A sparse vector is backed by two parallel arrays: indices and values.

def sparse(size: Int, indices: Array[Int], values: Array[Double]):  Vector

Creates a sparse vector providing its index array and value array.

We can declare sparse vector in other way to with Seq as 2nd  parameter & size as 1st

def sparse(size: Int, elements: Seq[(Int, Double)]): Vector

Creates a sparse vector using unordered (index, value) pairs.

Example : 

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
o/p : sv1: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])

val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
o/p : sv2: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])


3. Labeled Point :

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

A labeled point is represented by the case class LabeledPoint.

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Matrix

Spark Mlib supports 2 types of Matrices 1. Local matrix & 2. Distributed Matrix

Local Matrix :

The base class of local matrices is Matrix, and we provide two implementations: DenseMatrix, and SparseMatrix. We recommend using the factory methods implemented in Matrices to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

  1. DenseMatrix :

    def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix

    Permalink

    Creates a column-major dense matrix.

    Example :
    
    val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
    
    o/p : dm: org.apache.spark.mllib.linalg.Matrix = 
    
    1.0 2.0 
    
    3.0 4.0 
    
    5.0 6.0
  2. SparseMatrix :

    def sparse(numRows: Int, numCols: Int, colPtrs: Array[Int], rowIndices: Array[Int], values:Array[Double]): Matrix

    Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.

    numRows -> Describes number of rows in matrix
    numCols -> Describe number of columns in matrix
    colPtrs -> Describe index corrosponding to start of new column
    RowIndices -> Row index of element in column-major way.
    Values -> values in double

    Example : 
    
     1.0 0.0 4.0
     0.0 3.0 5.0
     2.0 0.0 6.0
     
    
    is stored as values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], rowIndices=[0, 2, 1, 0, 1, 2], colPointers=[0, 2, 3, 6].
    Another example :
    
    val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))
    
    O/p :sm: org.apache.spark.mllib.linalg.Matrix = 
    
    3 x 2 CSCMatrix
    
    (0,0) 9.0
    
    (2,1) 6.0
    
    (1,1) 8.0

    We will see distributed Matrix in next session.

 

Python – java integration (Jython)

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, http://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation.

The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications.

Jython is an implementation of Python for the JVM. Jython takes the Python programming language syntax and enables it to run on the Java platform. This allows seamless integration with the use of Java libraries and other Java-based applications. The Jython project strives to make all Python modules run on the JVM, but there are a few differences between the implementations. Perhaps the major difference between the two implementations is that Jython does not work with C extensions. Therefore, most of the Python modules will run without changes under Jython, but if they use C extensions then they will probably not work. Likewise, Jython code works with Java but CPython does not. Jython code should run seamlessly under CPython unless it contains Java integration.

For further information read the following :

http://www.jython.org/docs/tutorial/indexprogress.html