Pig – Functions

Friends , Functions in Pig come in four types
1. Eval function
– A function that takes one or more expressions and returns another expression.
– Some function are aggregate function like MAX
– Some functions are algebraic, which means that the result of the function may be calculated incrementally.
– In MapReduce term algebric functions make use of combiner and are much more efficient to calculate .
– Supports UDF by importing org.apache.pig.EvalFunc , extend EvalFunc & overriding exec method

2. Filter function :
– It returns logical boolean results
– FILTER removes unwanted rows
– EX: IsEmpty
– Supports UDF by importing org.apache.pig.FilterFunc , extend FilterFunc & overriding exec method

3. Load function
– Loads the data into a relation from external storage
– Supports UDF by importing org.apache.pig.LoadFunc , extend LoadFunc but override different other function like setLocation , getInputFormat , prepareTORead , getNext methods.

4. Store function
– Specifies how to save the contents of a relation to external storage
– Ex: PigStorage which loads data from delimited text files , can store data in the same format.

Detailed list is given below:

Pig Built-in Function
Eval AVG Calculate Avg(Mean) value of entries in a bag
CONCAT Concatenates byte arrays or chareacter array together
COUNT Calculate number of non-null entries in a bag
COUNT_STAR Calculate all entries including nulls
DIFF Calculates the set difference of two bags. If the two arguments are not bags
, returns a bag containing both if they are equal;otherwise,returns a nempty bag
MAX Calculate max
MIN Calculate Min
SIZE for character arrays, it is the num of char. For byte arrays the number of bytes
for containers(tuple , bag,map) it is number of entries
SUM Calculate summation of the values of entries in a bag
TOBAG Convert one or emore expresssions to individual tuple which are then put in a bag
TOKANIZE Tokenizes a character array into a bag of it’s constituent words
TOMAP Converts an even number of expressions to a map of key-value pairs
TOP Calculate top n tuples in a bag
TOTUPLE Convert one or more expresssions to a tuple
Filter IsEmpty Test weather bag or map is empty
Load/Sttore PigStorage Loads or stores relations using a field-delimited text format defaults to a tab character
BinStorage Loads or store relations from or to binary files in a pig specific format that uses HadoopWritable Object
TextLoader Loads relations from a plain-text format.
JsonLoader,JsonStorage Loads or store s relations from or to a JSON format.
HBaseStorage Loads or stores relation from or to Hbase

Pig Latin Relational Operator

   Pig Latin Relational Operator
Category Operator Description
Loading & Storing LOAD Loads data from the filesystem or other storage into a relation
STORE Saves a relation to the filesystem or other storage
DUMP Prints a relation to the console
Filtering FILTER Removes unwanted rows from relation
DISTINCT Removes duplicate rows from a relation
FOREACH..GENERATE Adds or removes fields from relation
MAPREDUCE Runs a mapreduce job using a relations as input
STREAM Transforms a relation using an external program
SAMPLE selects ar andom sample of a relation
Grouping & Joining JOIN Joins two or more relations
COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS creates the croos-product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation to a maximum number of tuples
Combining & Splitting UNION Combines two or more relations into one
SPLIT Split a relation into two or more relations

Pig – A programmer friendly MapReduce tool

Pig raises the level of abstraction for processing large datasets. MapReduce allow you , to specify map function followed by reduce function , but working out how to fit your data processing into this pattern , which often require multiple MapReduce stages, can be a challange.
Pig supports richer data structure, typically being multivalued & nested, and the set of transformations you can apply to the data are much more powerful.
One of the powerful feature of PIG is join, which are not for the faint of heart in MapReduce.
Pig is made up of 2 pieces:
1. The language used to express data flows, called Pig Latin
2. The Execution enviornment to run Pig Latin Program.
For MultiQuery execution it is always recommended to use STORE instead of DUMP as DUMP is a diagnostic tool, it will always trigger execution even in batch mode which STORE command doesn’t.
Consider the following example:
A = LOAD ‘input/pig/multiquery/A’
B = FILTER A by $1 == ‘banana’;
C = FILTER A BY $1 != ‘banana’;
STORE B INTO ‘output/b’;
STORE C INTO ‘output/c’;
Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B & C. This features is called multiquery execution.