HBase shell commands

describe ‘tablename’ : Displays the metadata of particular table Example : describe ’employee’  {NAME => ‘address’, BLOOMFILTER => ‘ROW’, VERSIONS => ‘5’, IN_MEMORY => ‘false’, KEEP_DELETED_CELLS => ‘FALSE’, DATA_BLOCK_ENCODING => ‘NONE’, TTL => ‘FOREVER’, COMPRESSION => ‘GZ’, MIN_VERSIONS => ‘0’, BLOCKCACHE => ‘true’, BLOCKSIZE => ‘65536’, REPLICATION_SCOPE => ‘0’} {NAME => ‘personal_info’, BLOOMFILTER => ‘ROW’, … More HBase shell commands

Data types of Spark Mlib – Part 2

Friends, so far we have gone through Basic types of spark mllib data type. Spark mllib also supports distributed matrices that includes Row Matrix , IndexedRowMatrix, CoordinateMatrix, BlockMatrix. A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right … More Data types of Spark Mlib – Part 2

Running scala oops files from command line- spark-shell

Writing a script in scala , but still want to follow object oriented programming as most of the programmer are from OOPs background due to java practice, can still execute the spark scripts using spark-shell in OOPs way. Here I am giving a simple example of CollectAsync() future method , also demonstrate how we can … More Running scala oops files from command line- spark-shell

Apache Graphx

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators Spark GraphX is a graph processing … More Apache Graphx

CueSheet – Easy spark application deployment guide

CueSheet is a framework for writing Apache Spark 2.x applications more conveniently, designed to neatly separate the concerns of the business logic and the deployment environment, as well as to minimize the usage of shell scripts which are inconvenient to write and do not support validation. To jump-start, check out cuesheet-starter-kit which provides the skeleton … More CueSheet – Easy spark application deployment guide

Complexity analysis – Big o notation table

Searching Algorithm Data Structure Time Complexity Space Complexity Average Worst Worst Depth First Search (DFS) Graph of |V| vertices and |E| edges – O(|E| + |V|) O(|V|) Breadth First Search (BFS) Graph of |V| vertices and |E| edges – O(|E| + |V|) O(|V|) Binary search Sorted array of n elements O(log(n)) O(log(n)) O(1) Linear (Brute … More Complexity analysis – Big o notation table

Re-partitioning & partition in spark

  In Hadoop, partitioning a data allows processing of huge volume of data in parallel such that it takes minimum amount of time to process entire dataset. Apache spark decides partitioning based on different factors. Factor that decide default partitioning On hadoop split by HDFS cores. Filter or map function don’t change partitioning Number of … More Re-partitioning & partition in spark