Apache Pig – Union Operator ”; Previous Next The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical. Syntax Given below is the syntax of the UNION operator. grunt> Relation_name3 = UNION Relation_name1, Relation_name2; Example Assume that we have two files namely student_data1.txt and student_data2.txt in the /pig_data/ directory of HDFS as shown below. Student_data1.txt 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. Student_data2.txt 7,Komal,Nayak,9848022334,trivendram. 8,Bharathi,Nambiayar,9848022333,Chennai. And we have loaded these two files into Pig with the relations student1 and student2 as shown below. grunt> student1 = LOAD ”hdfs://localhost:9000/pig_data/student_data1.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); grunt> student2 = LOAD ”hdfs://localhost:9000/pig_data/student_data2.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); Let us now merge the contents of these two relations using the UNION operator as shown below. grunt> student = UNION student1, student2; Verification Verify the relation student using the DUMP operator as shown below. grunt> Dump student; Output It will display the following output, displaying the contents of the relation student. (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) (7,Komal,Nayak,9848022334,trivendram) (8,Bharathi,Nambiayar,9848022333,Chennai) Print Page Previous Next Advertisements ”;
Category: apache Pig
Apache Pig – Reading Data
Apache Pig – Reading Data ”; Previous Next In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HDFS. Preparing HDFS In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HDFS and create the following sample data in HDFS. Student ID First Name Last Name Phone City 001 Rajiv Reddy 9848022337 Hyderabad 002 siddarth Battacharya 9848022338 Kolkata 003 Rajesh Khanna 9848022339 Delhi 004 Preethi Agarwal 9848022330 Pune 005 Trupthi Mohanthy 9848022336 Bhuwaneshwar 006 Archana Mishra 9848022335 Chennai The above dataset contains personal details like id, first name, last name, phone number and city, of six students. Step 1: Verifying Hadoop First of all, verify the installation using Hadoop version command, as shown below. $ hadoop version If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output − Hadoop 2.6.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 Compiled by jenkins on 2014-11-13T21:10Z Compiled with protoc 2.5.0 From source with checksum 18e43357c8f927c0695f1e9522859d6a This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop common-2.6.0.jar Step 2: Starting HDFS Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below. cd /$Hadoop_Home/sbin/ $ start-dfs.sh localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out Starting secondary namenodes [0.0.0.0] starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out $ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out Step 3: Create a Directory in HDFS In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below. $cd /$Hadoop_Home/bin/ $ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data Step 4: Placing the data in HDFS The input file of Pig contains each tuple/record in individual lines. And the entities of the record are separated by a delimiter (In our example we used “,”). In the local file system, create an input file student_data.txt containing data as shown below. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. Now, move the file from the local file system to HDFS using put command as shown below. (You can use copyFromLocal command as well.) $ cd $HADOOP_HOME/bin $ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/ Verifying the file You can use the cat command to verify whether the file has been moved into the HDFS, as shown below. $ cd $HADOOP_HOME/bin $ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt Output You can see the content of the file as shown below. 15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai The Load Operator You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin. Syntax The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator. Relation_name = LOAD ”Input file path” USING function as schema; Where, relation_name − We have to mention the relation in which we want to store the data. Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode) function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). Schema − We have to define the schema of the data. We can define the required schema as follows − (column1 : data type, column2 : data type, column3 : data type); Note − We load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check). Example As an example, let us load the data in student_data.txt in Pig under the schema named Student using the LOAD command. Start the Pig Grunt Shell First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below. $ Pig –x mapreduce It will start the Pig Grunt shell as shown below. 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main – Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main – Logging error messages to: /home/Hadoop/pig_1443683018078.log 2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/Hadoop/.pigbootup not found 2015-10-01 12:33:39,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://localhost:9000 grunt> Execute the Load Statement Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Following is the description of the above statement. Relation name We have stored the data in the schema student. Input file path We are reading data from the file student_data.txt, which is in the /pig_data/ directory of HDFS. Storage function We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘t’ as a parameter. schema We have stored the data using the following schema. column id firstname lastname phone city datatype int char array char array char array char array Note − The load statement will simply load the data into the specified relation in Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators which are discussed in
Apache Pig – Group Operator
Apache Pig – Group Operator ”; Previous Next The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. Syntax Given below is the syntax of the group operator. grunt> Group_data = GROUP Relation_name BY age; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Apache Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); Now, let us group the records/tuples in the relation by age as shown below. grunt> group_data = GROUP student_details by age; Verification Verify the relation group_data using the DUMP operator as shown below. grunt> Dump group_data; Output Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns − One is age, by which we have grouped the relation. The other is a bag, which contains the group of tuples, student records with the respective age. (21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)}) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}) You can see the schema of the table after grouping the data using the describe command as shown below. grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname: chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}} In the same way, you can get the sample illustration of the schema using the illustrate command as shown below. $ Illustrate group_data; It will produce the following output − ————————————————————————————————- |group_data| group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}| ————————————————————————————————- | | 21 | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| | | 2 | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| ————————————————————————————————- Grouping by Multiple Columns Let us group the relation by age and city as shown below. grunt> group_multiple = GROUP student_details by (age, city); You can verify the content of the relation named group_multiple using the Dump operator as shown below. grunt> Dump group_multiple; ((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) ((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)}) ((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)}) ((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)}) ((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)}) ((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)}) (24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)}) Group All You can group a relation by all the columns as shown below. grunt> group_all = GROUP student_details All; Now, verify the content of the relation group_all as shown below. grunt> Dump group_all; (all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), (6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar), (4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)}) Print Page Previous Next Advertisements ”;
Apache Pig – Explain Operator ”; Previous Next The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation. Syntax Given below is the syntax of the explain operator. grunt> explain Relation_name; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us explain the relation named student using the explain operator as shown below. grunt> explain student; Output It will produce the following output. $ explain student; 2015-10-05 11:32:43,660 [main] 2015-10-05 11:32:43,660 [main] INFO org.apache.pig.newplan.logical.optimizer .LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} #———————————————– # New Logical Plan: #———————————————– student: (Name: LOStore Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | |—student: (Name: LOForEach Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | | | (Name: LOGenerate[false,false,false,false,false] Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33, 31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31] | | | | | (Name: Cast Type: int Uid: 31) | | | | | |—id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 32) | | | | | |—firstname:(Name: Project Type: bytearray Uid: 32 Input: 1 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 33) | | | | | |—lastname:(Name: Project Type: bytearray Uid: 33 Input: 2 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 34) | | | | | |—phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 35) | | | | | |—city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column: (*)) | | | |—(Name: LOInnerLoad[0] Schema: id#31:bytearray) | | | |—(Name: LOInnerLoad[1] Schema: firstname#32:bytearray) | | | |—(Name: LOInnerLoad[2] Schema: lastname#33:bytearray) | | | |—(Name: LOInnerLoad[3] Schema: phone#34:bytearray) | | | |—(Name: LOInnerLoad[4] Schema: city#35:bytearray) | |—student: (Name: LOLoad Schema: id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray ,city#35:bytearray)RequiredFields:null #———————————————– # Physical Plan: #———————————————– student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36 | |—student: New For Each(false,false,false,false,false)[bag] – scope-35 | | | Cast[int] – scope-21 | | | |—Project[bytearray][0] – scope-20 | | | Cast[chararray] – scope-24 | | | |—Project[bytearray][1] – scope-23 | | | Cast[chararray] – scope-27 | | | |—Project[bytearray][2] – scope-26 | | | Cast[chararray] – scope-30 | | | |—Project[bytearray][3] – scope-29 | | | Cast[chararray] – scope-33 | | | |—Project[bytearray][4] – scope-32 | |—student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(”,”)) – scope19 2015-10-05 11:32:43,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false 2015-10-05 11:32:43,684 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer – MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MultiQueryOp timizer – MR plan size after optimization: 1 #————————————————– # Map Reduce Plan #————————————————– MapReduce node scope-37 Map Plan student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36 | |—student: New For Each(false,false,false,false,false)[bag] – scope-35 | | | Cast[int] – scope-21 | | | |—Project[bytearray][0] – scope-20 | | | Cast[chararray] – scope-24 | | | |—Project[bytearray][1] – scope-23 | | | Cast[chararray] – scope-27 | | | |—Project[bytearray][2] – scope-26 | | | Cast[chararray] – scope-30 | | | |—Project[bytearray][3] – scope-29 | | | Cast[chararray] – scope-33 | | | |—Project[bytearray][4] – scope-32 | |—student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(”,”)) – scope 19——– Global sort: false —————- Print Page Previous Next Advertisements ”;
Apache Pig – Split Operator
Apache Pig – Split Operator ”; Previous Next The SPLIT operator is used to split a relation into two or more relations. Syntax Given below is the syntax of the SPLIT operator. grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2), Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); Let us now split the relation into two, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25. SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25); Verification Verify the relations student_details1 and student_details2 using the DUMP operator as shown below. grunt> Dump student_details1; grunt> Dump student_details2; Output It will produce the following output, displaying the contents of the relations student_details1 and student_details2 respectively. grunt> Dump student_details1; (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune) grunt> Dump student_details2; (5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) (6,Archana,Mishra,23,9848022335,Chennai) (7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,Chennai) Print Page Previous Next Advertisements ”;
Apache Pig – Filter Operator
Apache Pig – Filter Operator ”; Previous Next The FILTER operator is used to select the required tuples from a relation based on a condition. Syntax Given below is the syntax of the FILTER operator. grunt> Relation2_name = FILTER Relation1_name BY (condition); Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); Let us now use the Filter operator to get the details of the students who belong to the city Chennai. filter_data = FILTER student_details BY city == ”Chennai”; Verification Verify the relation filter_data using the DUMP operator as shown below. grunt> Dump filter_data; Output It will produce the following output, displaying the contents of the relation filter_data as follows. (6,Archana,Mishra,23,9848022335,Chennai) (8,Bharathi,Nambiayar,24,9848022333,Chennai) Print Page Previous Next Advertisements ”;
Apache Pig – Cross Operator
Apache Pig – Cross Operator ”; Previous Next The CROSS operator computes the cross-product of two or more relations. This chapter explains with example how to use the cross operator in Pig Latin. Syntax Given below is the syntax of the CROSS operator. grunt> Relation3_name = CROSS Relation1_name, Relation2_name; Example Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060 And we have loaded these two files into Pig with the relations customers and orders as shown below. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”) as (oid:int, date:chararray, customer_id:int, amount:int); Let us now get the cross-product of these two relations using the cross operator on these two relations as shown below. grunt> cross_data = CROSS customers, orders; Verification Verify the relation cross_data using the DUMP operator as shown below. grunt> Dump cross_data; Output It will produce the following output, displaying the contents of the relation cross_data. (7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) (7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) (7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) (6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) (6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560) (6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) (6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000) (5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) (5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) (5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) (4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) (4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) (3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) (3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000) Print Page Previous Next Advertisements ”;
Apache Pig – Describe Operator ”; Previous Next The describe operator is used to view the schema of a relation. Syntax The syntax of the describe operator is as follows − grunt> Describe Relation_name Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us describe the relation named student and verify the schema as shown below. grunt> describe student; Output Once you execute the above Pig Latin statement, it will produce the following output. grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray } Print Page Previous Next Advertisements ”;
Apache Pig – Distinct Operator ”; Previous Next The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation. Syntax Given below is the syntax of the DISTINCT operator. grunt> Relation_name2 = DISTINCT Relatin_name1; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai 006,Archana,Mishra,9848022335,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); Let us now remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator, and store it as another relation named distinct_data as shown below. grunt> distinct_data = DISTINCT student_details; Verification Verify the relation distinct_data using the DUMP operator as shown below. grunt> Dump distinct_data; Output It will produce the following output, displaying the contents of the relation distinct_data as follows. (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) Print Page Previous Next Advertisements ”;
Apache Pig – Architecture
Apache Pig – Architecture ”; Previous Next The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below. Apache Pig Components As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results. Pig Latin Data Model The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30) Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, [email protected],}} Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order). Print Page Previous Next Advertisements ”;