Page 51 - Building Big Data Applications

P. 51

Chapter 2 Infrastructure and technology 45

FOREACHdApply expression to each record and output one or more records
FILTERdApply predicate and remove records that do not return true
GROUP/COGROUPdCollect records with the same key from one or more inputs
JOINdJoin two or more inputs based on a key
ORDERdSort records based on a key
DISTINCTdRemove duplicate records
UNIONdMerge two data sets
SPLITdSplit data into two or more sets, based on ﬁlter conditions
STREAMdSend all records through a user provided binary
DUMPdWrite output to stdout
LIMITdLimit the number of records
During program execution, Pig ﬁrst validates the syntax and semantics of statements
and continues to process them, when it encounters a DUMP or STORE it completes the
execution of the statement. For example, a Pig job to process compliance logs and
extract words and phrases will look like
A ¼ load “compliance_log”
B ¼ foreach A generate
ﬂatten(TOKENIZE((chararray)$0)) as word;
C ¼ ﬁlter B by word matches ‘\\wþ’;
D ¼ group C by word;
E ¼ foreach D generate COUNT(C), group;
store E into “ompliance_log_freq”;
Now let us say that we want to analyze how many of these words are in FDA
mandates

A ¼ load “FDA_Data”;
B ¼ foreach A generate
ﬂatten(TOKENIZE((chararray)$0)) as word;
C ¼ ﬁlter B by word matches “\\wþ”;
D ¼ group C by word;
E ¼ foreach D generate COUNT(C), group;
store E into “FDA_Data_freq”;

We can then join these two outputs to create a result set:
compliance ¼ LOAD “compliance_log_freq” AS (freq, word)
FDA ¼ LOAD “FDA_Data_freq” AS (freq, word)
inboth ¼ JOIN compliance BY word, FDA BY word
STORE inboth INTO “output”;

46 47 48 49 50 51 52 53 54 55 56