Apache Pig – Cross Operator

Apache Pig – Cross Operator ”; Previous Next The CROSS operator computes the cross-product of two or more relations. This chapter explains with example how to use the cross operator in Pig Latin. Syntax Given below is the syntax of the CROSS operator. grunt> Relation3_name = CROSS Relation1_name, Relation2_name; Example Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060 And we have loaded these two files into Pig with the relations customers and orders as shown below. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”) as (oid:int, date:chararray, customer_id:int, amount:int); Let us now get the cross-product of these two relations using the cross operator on these two relations as shown below. grunt> cross_data = CROSS customers, orders; Verification Verify the relation cross_data using the DUMP operator as shown below. grunt> Dump cross_data; Output It will produce the following output, displaying the contents of the relation cross_data. (7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) (7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) (7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) (6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) (6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560) (6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) (6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000) (5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) (5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) (5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) (4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) (4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) (3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) (3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000) Print Page Previous Next Advertisements ”;

Apache Pig – Describe Operator

Apache Pig – Describe Operator ”; Previous Next The describe operator is used to view the schema of a relation. Syntax The syntax of the describe operator is as follows − grunt> Describe Relation_name Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us describe the relation named student and verify the schema as shown below. grunt> describe student; Output Once you execute the above Pig Latin statement, it will produce the following output. grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray } Print Page Previous Next Advertisements ”;

Apache Pig – Distinct Operator

Apache Pig – Distinct Operator ”; Previous Next The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation. Syntax Given below is the syntax of the DISTINCT operator. grunt> Relation_name2 = DISTINCT Relatin_name1; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai 006,Archana,Mishra,9848022335,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); Let us now remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator, and store it as another relation named distinct_data as shown below. grunt> distinct_data = DISTINCT student_details; Verification Verify the relation distinct_data using the DUMP operator as shown below. grunt> Dump distinct_data; Output It will produce the following output, displaying the contents of the relation distinct_data as follows. (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) Print Page Previous Next Advertisements ”;

Apache Pig – Reading Data

Apache Pig – Reading Data ”; Previous Next In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HDFS. Preparing HDFS In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HDFS and create the following sample data in HDFS. Student ID First Name Last Name Phone City 001 Rajiv Reddy 9848022337 Hyderabad 002 siddarth Battacharya 9848022338 Kolkata 003 Rajesh Khanna 9848022339 Delhi 004 Preethi Agarwal 9848022330 Pune 005 Trupthi Mohanthy 9848022336 Bhuwaneshwar 006 Archana Mishra 9848022335 Chennai The above dataset contains personal details like id, first name, last name, phone number and city, of six students. Step 1: Verifying Hadoop First of all, verify the installation using Hadoop version command, as shown below. $ hadoop version If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output − Hadoop 2.6.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 Compiled by jenkins on 2014-11-13T21:10Z Compiled with protoc 2.5.0 From source with checksum 18e43357c8f927c0695f1e9522859d6a This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop common-2.6.0.jar Step 2: Starting HDFS Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below. cd /$Hadoop_Home/sbin/ $ start-dfs.sh localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out Starting secondary namenodes [0.0.0.0] starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out $ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out Step 3: Create a Directory in HDFS In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below. $cd /$Hadoop_Home/bin/ $ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data Step 4: Placing the data in HDFS The input file of Pig contains each tuple/record in individual lines. And the entities of the record are separated by a delimiter (In our example we used “,”). In the local file system, create an input file student_data.txt containing data as shown below. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. Now, move the file from the local file system to HDFS using put command as shown below. (You can use copyFromLocal command as well.) $ cd $HADOOP_HOME/bin $ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/ Verifying the file You can use the cat command to verify whether the file has been moved into the HDFS, as shown below. $ cd $HADOOP_HOME/bin $ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt Output You can see the content of the file as shown below. 15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai The Load Operator You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin. Syntax The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator. Relation_name = LOAD ”Input file path” USING function as schema; Where, relation_name − We have to mention the relation in which we want to store the data. Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode) function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). Schema − We have to define the schema of the data. We can define the required schema as follows − (column1 : data type, column2 : data type, column3 : data type); Note − We load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check). Example As an example, let us load the data in student_data.txt in Pig under the schema named Student using the LOAD command. Start the Pig Grunt Shell First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below. $ Pig –x mapreduce It will start the Pig Grunt shell as shown below. 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main – Logging error messages to: /home/Hadoop/pig_1443683018078.log 2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/Hadoop/.pigbootup not found 2015-10-01 12:33:39,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://localhost:9000 grunt> Execute the Load Statement Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Following is the description of the above statement. Relation name We have stored the data in the schema student. Input file path We are reading data from the file student_data.txt, which is in the /pig_data/ directory of HDFS. Storage function We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘t’ as a parameter. schema We have stored the data using the following schema. column id firstname lastname phone city datatype int char array char array char array char array Note − The load statement will simply load the data into the specified relation in Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators which are discussed in

Apache Pig – Group Operator

Apache Pig – Group Operator ”; Previous Next The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. Syntax Given below is the syntax of the group operator. grunt> Group_data = GROUP Relation_name BY age; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Apache Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); Now, let us group the records/tuples in the relation by age as shown below. grunt> group_data = GROUP student_details by age; Verification Verify the relation group_data using the DUMP operator as shown below. grunt> Dump group_data; Output Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns − One is age, by which we have grouped the relation. The other is a bag, which contains the group of tuples, student records with the respective age. (21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)}) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}) You can see the schema of the table after grouping the data using the describe command as shown below. grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname: chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}} In the same way, you can get the sample illustration of the schema using the illustrate command as shown below. $ Illustrate group_data; It will produce the following output − ————————————————————————————————- |group_data| group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}| ————————————————————————————————- | | 21 | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| | | 2 | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| ————————————————————————————————- Grouping by Multiple Columns Let us group the relation by age and city as shown below. grunt> group_multiple = GROUP student_details by (age, city); You can verify the content of the relation named group_multiple using the Dump operator as shown below. grunt> Dump group_multiple; ((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) ((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)}) ((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)}) ((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)}) ((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)}) ((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)}) (24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)}) Group All You can group a relation by all the columns as shown below. grunt> group_all = GROUP student_details All; Now, verify the content of the relation group_all as shown below. grunt> Dump group_all; (all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), (6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar), (4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)}) Print Page Previous Next Advertisements ”;

Apache Solr – Updating Data

Apache Solr – Updating Data ”; Previous Next Updating the Document Using XML Following is the XML file used to update a field in the existing document. Save this in a file with the name update.xml. <add> <doc> <field name = “id”>001</field> <field name = “first name” update = “set”>Raj</field> <field name = “last name” update = “add”>Malhotra</field> <field name = “phone” update = “add”>9000000000</field> <field name = “city” update = “add”>Delhi</field> </doc> </add> As you can observe, the XML file written to update data is just like the one which we use to add documents. But the only difference is we use the update attribute of the field. In our example, we will use the above document and try to update the fields of the document with the id 001. Suppose the XML document exists in the bin directory of Solr. Since we are updating the index which exists in the core named my_core, you can update using the post tool as follows − [Hadoop@localhost bin]$ ./post -c my_core update.xml On executing the above command, you will get the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = my_core -Ddata = files org.apache.Solr.util.SimplePostTool update.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/my_core/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf, htm,html,txt,log POSTing file update.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/my_core/update… Time spent: 0:00:00.159 Verification Visit the homepage of Apache Solr web interface and select the core as my_core. Try to retrieve all the documents by passing the query “:” in the text area q and execute the query. On executing, you can observe that the document is updated. Updating the Document Using Java (Client API) Following is the Java program to add documents to Apache Solr index. Save this code in a file with the name UpdatingDocument.java. import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.client.Solrj.request.UpdateRequest; import org.apache.Solr.client.Solrj.response.UpdateResponse; import org.apache.Solr.common.SolrInputDocument; public class UpdatingDocument { public static void main(String args[]) throws SolrServerException, IOException { //Preparing the Solr client String urlString = “http://localhost:8983/Solr/my_core”; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing the Solr document SolrInputDocument doc = new SolrInputDocument(); UpdateRequest updateRequest = new UpdateRequest(); updateRequest.setAction( UpdateRequest.ACTION.COMMIT, false, false); SolrInputDocument myDocumentInstantlycommited = new SolrInputDocument(); myDocumentInstantlycommited.addField(“id”, “002”); myDocumentInstantlycommited.addField(“name”, “Rahman”); myDocumentInstantlycommited.addField(“age”,”27″); myDocumentInstantlycommited.addField(“addr”,”hyderabad”); updateRequest.add( myDocumentInstantlycommited); UpdateResponse rsp = updateRequest.process(Solr); System.out.println(“Documents Updated”); } } Compile the above code by executing the following commands in the terminal − [Hadoop@localhost bin]$ javac UpdatingDocument [Hadoop@localhost bin]$ java UpdatingDocument On executing the above command, you will get the following output. Documents updated Print Page Previous Next Advertisements ”;

Apache Pig – Architecture

Apache Pig – Architecture ”; Previous Next The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below. Apache Pig Components As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results. Pig Latin Data Model The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30) Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, [email protected],}} Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order). Print Page Previous Next Advertisements ”;

Apache Pig – Execution

Apache Pig – Execution ”; Previous Next In the previous chapter, we explained how to install Apache Pig. In this chapter, we will discuss how to execute Apache Pig. Apache Pig Execution Modes You can run Apache Pig in two modes, namely, Local Mode and HDFS mode. Local Mode In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose. MapReduce Mode MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS. Apache Pig Execution Mechanisms Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode. Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator). Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension. Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script. Invoking the Grunt Shell You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below. Local mode MapReduce mode Command − $ ./pig –x local Command − $ ./pig -x mapreduce Output − Output − Either of these commands gives you the Grunt shell prompt as shown below. grunt> You can exit the Grunt shell using ‘ctrl &plus; d’. After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it. grunt> customers = LOAD ”customers.txt” USING PigStorage(”,”); Executing Apache Pig in Batch Mode You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as shown below. Sample_script.pig student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”) as (id:int,name:chararray,city:chararray); Dump student; Now, you can execute the script in the above file as shown below. Local mode MapReduce mode $ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig Note − We will discuss in detail how to run a Pig script in Bach mode and in embedded mode in subsequent chapters. Print Page Previous Next Advertisements ”;

Apache Pig – Grunt Shell

Apache Pig – Grunt Shell ”; Previous Next After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there are certain useful shell and utility commands provided by the Grunt shell. This chapter explains the shell and utility commands provided by the Grunt shell. Note − In some portions of this chapter, the commands like Load and Store are used. Refer the respective chapters to get in-detail information on them. Shell Commands The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke any shell commands using sh and fs. sh Command Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command from the Grunt shell, we cannot execute the commands that are a part of the shell environment (ex − cd). Syntax Given below is the syntax of sh command. grunt> sh shell command parameters Example We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown below. In this example, it lists out the files in the /pig/bin/ directory. grunt> sh ls pig pig_1444799121955.log pig.cmd pig.py fs Command Using the fs command, we can invoke any FsShell commands from the Grunt shell. Syntax Given below is the syntax of fs command. grunt> sh File System command parameters Example We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following example, it lists the files in the HDFS root directory. grunt> fs –ls Found 3 items drwxrwxrwx – Hadoop supergroup 0 2015-09-08 14:13 Hbase drwxr-xr-x – Hadoop supergroup 0 2015-09-09 14:52 seqgen_data drwxr-xr-x – Hadoop supergroup 0 2015-09-08 11:30 twitter_data In the same way, we can invoke all the other file system shell commands from the Grunt shell using the fs command. Utility Commands The Grunt shell provides a set of utility commands. These include utility commands such as clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt shell. Given below is the description of the utility commands provided by the Grunt shell. clear Command The clear command is used to clear the screen of the Grunt shell. Syntax You can clear the screen of the grunt shell using the clear command as shown below. grunt> clear help Command The help command gives you a list of Pig commands or Pig properties. Usage You can get a list of Pig commands using the help command as shown below. grunt> help Commands: <pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig File system commands:fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html Diagnostic Commands:describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B. explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<pCram_value>] [-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script. -script – Explain the entire script. -out – Store the output into directory rather than print to stdout. -brief – Don”t expand nested plans (presenting a smaller graph for overview). -dot – Generate the output in .dot format. Default is text format. -xml – Generate the output in .xml format. Default is text format. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. alias – Alias to explain. dump <alias> – Compute the alias and writes the results to stdout. Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> – Execute the script with access to grunt environment including aliases. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. script – Script to be executed. run [-param <param_name>=param_value] [-param_file <file_name>] <script> – Execute the script with access to grunt environment. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. script – Script to be executed. sh <shell command> – Invoke a shell command. kill <job_id> – Kill the hadoop job specified by the hadoop job id. set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive. The following keys are supported: default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default. debug – Set debug on or off. Default is off. job.name – Single-quoted name for jobs. Default is PigLatin:<script name> job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal stream.skippath – String that contains the path. This is used by streaming any hadoop property. help – Display this message. history [-n] – Display the list statements in cache. -n Hide line numbers. quit – Quit the grunt shell. history Command This command displays a list of statements executed / used so far since the Grunt sell is invoked. Usage Assume we have executed three statements since opening the Grunt shell. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”); grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”); Then, using the history command will produce the following output. grunt> history customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”); orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”); student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”); set Command The set command is used to show/assign values to keys used in Pig. Usage Using this command, you can set values to the following keys. Key Description and values default_parallel You can set the number of reducers for a map job by passing any whole number as a value to this key. debug You can turn off or turn on the debugging freature in Pig by passing on/off to this key. job.name You can set the Job name to the required job by passing a string value to this key. job.priority You can set the job priority to a job by passing one of the following values to this key − very_low low normal high very_high stream.skippath For streaming, you can set the path

Apache Pig – Installation

Apache Pig – Installation ”; Previous Next This chapter explains the how to download, install, and set up Apache Pig in your system. Prerequisites It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the following link − https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm Download Apache Pig First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org/ Step 1 Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown in the following snapshot. Step 2 On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors. Step 3 Choose and click any one of these mirrors as shown below. Step 4 These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click the latest version among them. Step 5 Within these folders, you will have the source and binary files of Apache Pig in various distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz. Install Apache Pig After downloading the Apache Pig software, install it in your Linux environment by following the steps given below. Step 1 Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named Hadoop). $ mkdir Pig Step 2 Extract the downloaded tar files as shown below. $ cd Downloads/ $ tar zxvf pig-0.15.0-src.tar.gz $ tar zxvf pig-0.15.0.tar.gz Step 3 Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below. $ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/ Configure Apache Pig After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and pig.properties. .bashrc file In the .bashrc file, set the following variables − PIG_HOME folder to the Apache Pig’s installation folder, PATH environment variable to the bin folder, and PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations (the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files). export PIG_HOME = /home/Hadoop/Pig export PATH = $PATH:/home/Hadoop/pig/bin export PIG_CLASSPATH = $HADOOP_HOME/conf pig.properties file In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various parameters as given below. pig -h properties The following properties are supported − Logging: verbose = true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch aggregate.warning = true|false; default is true. If true, prints count of warnings of each type rather than logging each warning. Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). Note that this memory is shared across all large bags used by the application. pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). Specifies the fraction of heap available for the reducer to perform the join. pig.exec.nocombiner = true|false; default is false. Only disable combiner as a temporary workaround for problems. opt.multiquery = true|false; multiquery is on by default. Only disable multiquery as a temporary workaround for problems. opt.fetch=true|false; fetch is on by default. Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. pig.tmpfilecompression = true|false; compression is off by default. Determines whether output of intermediate jobs is compressed. pig.tmpfilecompression.codec = lzo|gzip; default is gzip. Used in conjunction with pig.tmpfilecompression. Defines compression type. pig.noSplitCombination = true|false. Split combination is on by default. Determines if multiple small files are combined into a single map. pig.exec.mapPartAgg = true|false. Default is false. Determines if partial aggregation is done within map phase, before records are sent to combiner. pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled. Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command. udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF. stop.on.failure = true|false; default is false. Set to true to terminate on the first error. pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified. Verifying the Installation Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will get the version of Apache Pig as shown below. $ pig –version Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 Print Page Previous Next Advertisements ”;