HBase shell commands

  1. describe ‘tablename’ :
    Displays the metadata of particular table
    Example :

    describe 'employee' 
    
    {NAME => 'address', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', 
    
    TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 
    
    {NAME => 'personal_info', BLOOMFILTER => 'ROW', VERSIONS => '2', IN_MEMORY => 'true', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NO
    
    NE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 
    
    {NAME => 'professional_info', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =
    
    > 'NONE', TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  2. disable ’employee’ :
    Always disable table before performing any DDL operation
  3. alter ’employee’, {NAME => ‘col_fam1’, COMPRESSION => ‘GZ’}
    With alter command you can add columns on fly as well as add different parameter to columns like storing IN_MEMORY , Setting compression for particular column family , setting number of versions etc.
  4. list :
    List all the tables stored in Hbase :
    Example :
    list
    [“employee”, “table1”, “user_hcat_load_table”]
  5. enable_all ‘t.*’ :
    Enables all the table matching regex
  6. exists ‘student’ :
    Check weather student table exist or not
  7. show_filters :
    Shows all the available filter in Hbase
  8. alter_status :
    Get the status of the alter command. Indicates the number of regions of the table that have received the updated schema Pass table name.
  9. alter_async :
    Alter column family schema, does not wait for all regions to receive the
    schema changes. Pass table name and a dictionary specifying new column
    family schema. Dictionaries are described on the main help command output.
    Dictionary must include name of column family to alter.
    To change or add the ‘f1’ column family in table ‘t1’ from defaults
    to instead keep a maximum of 5 cell VERSIONS, do:hbase> alter_async ‘t1’, NAME => ‘f1’, VERSIONS => 5To delete the ‘f1’ column family in table ‘t1’, do:

    hbase> alter_async ‘t1’, NAME => ‘f1’, METHOD => ‘delete’or a shorter version:hbase> alter_async ‘t1’, ‘delete’ => ‘f1’
    You can also change table-scope attributes like MAX_FILESIZE
    MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH.

    For example, to change the max size of a family to 128MB, do:

    hbase> alter ‘t1’, METHOD => ‘table_att’, MAX_FILESIZE => ‘134217728’

    There could be more than one alteration in one command:

    hbase> alter ‘t1’, {NAME => ‘f1’}, {NAME => ‘f2’, METHOD => ‘delete’}

    To check if all the regions have been updated, use alter_status <table_name>

  10. count :
    Counts the number of rows in table. COUNT interval is by default 1000, one can increase the interval as well as set scan caching on count scan by default.
    hbase> count ‘t1’, INTERVAL => 100000
    hbase> count ‘t1’, CACHE => 1000
    hbase> count ‘t1’, INTERVAL => 10, CACHE => 1000

  11. delete :
    Put a delete cell value at specified table/row/column and optionally
    timestamp coordinates. Deletes must match the deleted cell’s
    coordinates exactly. When scanning, a delete cell suppresses older
    versions. To delete a cell from ‘t1’ at row ‘r1’ under column ‘c1’
    marked with the time ‘ts1’, do:hbase> delete ‘t1’, ‘r1’, ‘c1’, ts1
  12. deleteall :
    Delete all cells in a given row; pass a table name, row, and optionally
    a column and timestamp. Examples:hbase> deleteall ‘t1’, ‘r1’
    hbase> deleteall ‘t1’, ‘r1’, ‘c1’
    hbase> deleteall ‘t1’, ‘r1’, ‘c1’, ts1

  13. get :
    Get row or cell contents; pass table name, row, and optionally
    a dictionary of column(s), timestamp, timerange and versions.
    Examples:
    hbase> get ‘t1’, ‘r1’
    hbase> get ‘t1’, ‘r1’, {TIMERANGE => [ts1, ts2]}
    hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’}
    hbase> get ‘t1’, ‘r1’, {COLUMN => [‘c1’, ‘c2’, ‘c3’]}
    hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMESTAMP => ts1}
    hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMERANGE => [ts1, ts2], VERSIONS => 4}
    hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMESTAMP => ts1, VERSIONS => 4}
    hbase> get ‘t1’, ‘r1’, {FILTER => “ValueFilter(=, ‘binary:abc’)”}
    hbase> get ‘t1’, ‘r1’, ‘c1’
    hbase> get ‘t1’, ‘r1’, ‘c1’, ‘c2’
    hbase> get ‘t1’, ‘r1’, [‘c1’, ‘c2’]
  14.  put :
    Put a cell ‘value’ at specified table/row/column and optionally
    timestamp coordinates. To put a cell value into table ‘t1’ at
    row ‘r1’ under column ‘c1’ marked with the time ‘ts1’, do:hbase> put ‘t1’, ‘r1’, ‘c1’, ‘value’, ts1

  15. scan :
    Scan a table; pass table name and optionally a dictionary of scanner
    specifications. Scanner specifications may include one or more of:
    TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
    or COLUMNS, CACHEIf no columns are specified, all columns will be scanned.
    To scan all members of a column family, leave the qualifier empty as in
    ‘col_family:’.The filter can be specified in two ways:
    1. Using a filterString – more information on this is available in the
    Filter Language document attached to the HBASE-4176 JIRA
    2. Using the entire package name of the filter.Some examples:hbase> scan ‘.META.’
    hbase> scan ‘.META.’, {COLUMNS => ‘info:regioninfo’}
    hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’}
    hbase> scan ‘t1’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]}
    hbase> scan ‘t1’, {FILTER => “(PrefixFilter (‘row2’) AND
    (QualifierFilter (>=, ‘binary:xyz’))) AND (TimestampsFilter ( 123, 456))”}
    hbase> scan ‘t1’, {FILTER =>
    org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  16. truncate :
    Disables, drops and recreates the specified table.
    Examples:
    hbase>truncate ‘t1’

    Reference :
    https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/

    https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hbase_filtering.html

 

 

Advertisements

Data types of Spark Mlib – Part 2

Friends, so far we have gone through Basic types of spark mllib data type. Spark mllib also supports distributed matrices that includes Row Matrix , IndexedRowMatrix, CoordinateMatrix, BlockMatrix.

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

  1. RowMatrix :A RowMatrix is made up of collection of row which made up of local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

    A RowMatrix can be created from an RDD[Vector] instance. Then we can compute its column summary statistics and decompositions. QR decomposition is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.

    Example :

  2.  
    val rows: RDD[Vector] = sc.parallelize(Seq(Vectors.dense(1.0, 0.0, 3.0),Vectors.dense(2.0,30.0,4.0))) 
    O/p :rows: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[2] at parallelize at <console>:26
    val mat: RowMatrix = new RowMatrix(rows) // Get its size. val m = mat.numRows() val n = mat.numCols()
    
  3. IndexedRowMatrix :

    It is similar to RowMatrix where each row represented by it’s index & local vector
    An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector).

    Example :

    import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
    
    val rows: RDD[IndexedRow] = sc.parallelize(Seq(IndexedRow(0,Vectors.dense(1.0,2.0,3.0)),IndexedRow(1,Vectors.dense(4.0,5.0,6.0)))
    // Create an IndexedRowMatrix from an RDD[IndexedRow].
    val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
    
    // Get its size.
    val m = mat.numRows()
    val n = mat.numCols()
    
    // Drop its row indices.
    val rowMat: RowMatrix = mat.toRowMatrix()

     

  4. Coordinate Matrix :
    A coordinate matrix is backed by rowindices , columnindices & value in double & can be created from an RDD[MatrixEntry]. We can also convert Coordinate matrix to indexedRowMatrix by calling toIndexRowMatrix

    Example :

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
    
    val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    val mat: CoordinateMatrix = new CoordinateMatrix(entries)
    
    // Get its size.
    val m = mat.numRows()
    val n = mat.numCols()
    
    // Convert it to an IndexRowMatrix whose rows are sparse vectors.
    val indexedRowMatrix = mat.toIndexedRowMatrix()
  5. BlockMatrix  :

    When we want to perform multiple matrix addition & multiplication where each submatrix backed by it’s index we uses BlockMatrix.

    A BlockMatrix can be most easily created from an IndexedRowMatrix or CoordinateMatrix by calling toBlockMatrix which will create block size of 1024X1024.

    Example :

    import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
    
    val entries: RDD[MatrixEntry] = sc.parallelize(Seq(MatrixEntry(0,1,1.2),MatrixEntry(0,2,1.5),MatrixEntry(0,3,2.5)))
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
    // Transform the CoordinateMatrix to a BlockMatrix
    val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
    
    // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
    // Nothing happens if it is valid.
    matA.validate()
    
    // Calculate A^T A.
    val ata = matA.transpose.multiply(matA)