Pig – A programmer friendly MapReduce tool

Pig raises the level of abstraction for processing large datasets. MapReduce allow you , to specify map function followed by reduce function , but working out how to fit your data processing into this pattern , which often require multiple MapReduce stages, can be a challange.
Pig supports richer data structure, typically being multivalued & nested, and the set of transformations you can apply to the data are much more powerful.
One of the powerful feature of PIG is join, which are not for the faint of heart in MapReduce.
Pig is made up of 2 pieces:
1. The language used to express data flows, called Pig Latin
2. The Execution enviornment to run Pig Latin Program.
For MultiQuery execution it is always recommended to use STORE instead of DUMP as DUMP is a diagnostic tool, it will always trigger execution even in batch mode which STORE command doesn’t.
Consider the following example:
A = LOAD ‘input/pig/multiquery/A’
B = FILTER A by $1 == ‘banana’;
C = FILTER A BY $1 != ‘banana’;
STORE B INTO ‘output/b’;
STORE C INTO ‘output/c’;
Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B & C. This features is called multiquery execution.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s