Friends, so far we have gone through Basic types of spark mllib data type. Spark mllib also supports distributed matrices that includes Row Matrix , IndexedRowMatrix, CoordinateMatrix, BlockMatrix. A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right … More Data types of Spark Mlib – Part 2
Hello friends Spark Mlib does support multiple data types in the form of vectors & matrices. A local vector has 1st argument as indices that is integers in nature & 2nd argument as double type as values. There are 2 types of vectors Dense vector : A dense vector is backed by a double array … More Data types of Spark Mlib – Part 1
Writing a script in scala , but still want to follow object oriented programming as most of the programmer are from OOPs background due to java practice, can still execute the spark scripts using spark-shell in OOPs way. Here I am giving a simple example of CollectAsync() future method , also demonstrate how we can … More Running scala oops files from command line- spark-shell
Scala collections come into 2 categories mutable & immutable collections. Scala’s core power is the collection framework.. let see it’s diagram below.
In Hadoop, partitioning a data allows processing of huge volume of data in parallel such that it takes minimum amount of time to process entire dataset. Apache spark decides partitioning based on different factors. Factor that decide default partitioning On hadoop split by HDFS cores. Filter or map function don’t change partitioning Number of … More Re-partitioning & partition in spark