Pig – Functions

Friends , Functions in Pig come in four types
1. Eval function
– A function that takes one or more expressions and returns another expression.
– Some function are aggregate function like MAX
– Some functions are algebraic, which means that the result of the function may be calculated incrementally.
– In MapReduce term algebric functions make use of combiner and are much more efficient to calculate .
– Supports UDF by importing org.apache.pig.EvalFunc , extend EvalFunc & overriding exec method

2. Filter function :
– It returns logical boolean results
– FILTER removes unwanted rows
– EX: IsEmpty
– Supports UDF by importing org.apache.pig.FilterFunc , extend FilterFunc & overriding exec method

3. Load function
– Loads the data into a relation from external storage
– Supports UDF by importing org.apache.pig.LoadFunc , extend LoadFunc but override different other function like setLocation , getInputFormat , prepareTORead , getNext methods.

4. Store function
– Specifies how to save the contents of a relation to external storage
– Ex: PigStorage which loads data from delimited text files , can store data in the same format.

Detailed list is given below:

Pig Built-in Function
Eval	AVG	Calculate Avg(Mean) value of entries in a bag
	CONCAT	Concatenates byte arrays or chareacter array together
	COUNT	Calculate number of non-null entries in a bag
	COUNT_STAR	Calculate all entries including nulls
	DIFF	Calculates the set difference of two bags. If the two arguments are not bags
		, returns a bag containing both if they are equal;otherwise,returns a nempty bag
	MAX	Calculate max
	MIN	Calculate Min
	SIZE	for character arrays, it is the num of char. For byte arrays the number of bytes
		for containers(tuple , bag,map) it is number of entries
	SUM	Calculate summation of the values of entries in a bag
	TOBAG	Convert one or emore expresssions to individual tuple which are then put in a bag
	TOKANIZE	Tokenizes a character array into a bag of it’s constituent words
	TOMAP	Converts an even number of expressions to a map of key-value pairs
	TOP	Calculate top n tuples in a bag
	TOTUPLE	Convert one or more expresssions to a tuple

Filter	IsEmpty	Test weather bag or map is empty

Load/Sttore	PigStorage	Loads or stores relations using a field-delimited text format defaults to a tab character
	BinStorage	Loads or store relations from or to binary files in a pig specific format that uses HadoopWritable Object
	TextLoader	Loads relations from a plain-text format.
	JsonLoader,JsonStorage	Loads or store s relations from or to a JSON format.
	HBaseStorage	Loads or stores relation from or to Hbase

Pig Latin Relational Operator

Pig Latin Relational Operator
	Category	Operator	Description
	Loading & Storing	LOAD	Loads data from the filesystem or other storage into a relation
		STORE	Saves a relation to the filesystem or other storage
		DUMP	Prints a relation to the console

	Filtering	FILTER	Removes unwanted rows from relation
		DISTINCT	Removes duplicate rows from a relation
		FOREACH..GENERATE	Adds or removes fields from relation
		MAPREDUCE	Runs a mapreduce job using a relations as input
		STREAM	Transforms a relation using an external program
		SAMPLE	selects ar andom sample of a relation

	Grouping & Joining	JOIN	Joins two or more relations
		COGROUP	Groups the data in two or more relations
		GROUP	Groups the data in a single relation
		CROSS	creates the croos-product of two or more relations

	Sorting	ORDER	Sorts a relation by one or more fields
		LIMIT	Limits the size of a relation to a maximum number of tuples

	Combining & Splitting	UNION	Combines two or more relations into one
		SPLIT	Split a relation into two or more relations

Pig – A programmer friendly MapReduce tool

Pig raises the level of abstraction for processing large datasets. MapReduce allow you , to specify map function followed by reduce function , but working out how to fit your data processing into this pattern , which often require multiple MapReduce stages, can be a challange.
Pig supports richer data structure, typically being multivalued & nested, and the set of transformations you can apply to the data are much more powerful.
One of the powerful feature of PIG is join, which are not for the faint of heart in MapReduce.
Pig is made up of 2 pieces:
1. The language used to express data flows, called Pig Latin
2. The Execution enviornment to run Pig Latin Program.
For MultiQuery execution it is always recommended to use STORE instead of DUMP as DUMP is a diagnostic tool, it will always trigger execution even in batch mode which STORE command doesn’t.
Consider the following example:
A = LOAD ‘input/pig/multiquery/A’
B = FILTER A by $1 == ‘banana’;
C = FILTER A BY $1 != ‘banana’;
STORE B INTO ‘output/b’;
STORE C INTO ‘output/c’;
Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B & C. This features is called multiquery execution.

	Ban Ăn Chơi on Asynchronous processing in…
	www.websiteboerse.ch on HBase shell commands
	The Hunger Games Pan… on Reading / Writing in Java
	Front Controller des… on Front Controller design p…
	ninad on Processing multimedia data usi…