Page 51 - Building Big Data Applications
P. 51

Chapter 2   Infrastructure and technology  45


                   FOREACHdApply expression to each record and output one or more records
                   FILTERdApply predicate and remove records that do not return true
                   GROUP/COGROUPdCollect records with the same key from one or more inputs
                   JOINdJoin two or more inputs based on a key
                   ORDERdSort records based on a key
                   DISTINCTdRemove duplicate records
                   UNIONdMerge two data sets
                   SPLITdSplit data into two or more sets, based on filter conditions
                   STREAMdSend all records through a user provided binary
                   DUMPdWrite output to stdout
                   LIMITdLimit the number of records
                   During program execution, Pig first validates the syntax and semantics of statements
                 and continues to process them, when it encounters a DUMP or STORE it completes the
                 execution of the statement. For example, a Pig job to process compliance logs and
                 extract words and phrases will look like
                   A ¼ load “compliance_log”
                   B ¼ foreach A generate
                   flatten(TOKENIZE((chararray)$0)) as word;
                   C ¼ filter B by word matches ‘\\wþ’;
                   D ¼ group C by word;
                   E ¼ foreach D generate COUNT(C), group;
                   store E into “ompliance_log_freq”;
                   Now let us say that we want to analyze how many of these words are in FDA
                 mandates

                   A ¼ load “FDA_Data”;
                   B ¼ foreach A generate
                   flatten(TOKENIZE((chararray)$0)) as word;
                   C ¼ filter B by word matches “\\wþ”;
                   D ¼ group C by word;
                   E ¼ foreach D generate COUNT(C), group;
                   store E into “FDA_Data_freq”;

                   We can then join these two outputs to create a result set:
                   compliance ¼ LOAD “compliance_log_freq” AS (freq, word)
                   FDA ¼ LOAD “FDA_Data_freq” AS (freq, word)
                   inboth ¼ JOIN compliance BY word, FDA BY word
                   STORE inboth INTO “output”;
   46   47   48   49   50   51   52   53   54   55   56