Apache Pig – Cogroup Operator

Apache Pig – Cogroup Operator ”; Previous Next The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations. Grouping Two Relations using Cogroup Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai employee_details.txt 001,Robin,22,newyork 002,BOB,23,Kolkata 003,Maya,23,Tokyo 004,Sara,25,London 005,David,23,Bhuwaneshwar 006,Maggy,22,Chennai And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); grunt> employee_details = LOAD ”hdfs://localhost:9000/pig_data/employee_details.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, city:chararray); Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below. grunt> cogroup_data = COGROUP student_details by age, employee_details by age; Verification Verify the relation cogroup_data using the DUMP operator as shown below. grunt> Dump cogroup_data; Output It will produce the following output, displaying the contents of the relation named cogroup_data as shown below. (21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, { }) (22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) }, { (6,Maggy,22,Chennai),(1,Robin,22,newyork) }) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, { }) (25,{ }, {(4,Sara,25,London)}) The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value. For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags − the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and the second bag contains all the tuples from the second relation (employee_details in this case) having age 21. In case a relation doesn’t have tuples having the age value 21, it returns an empty bag. Print Page Previous Next Advertisements ”;

Apache Pig – Bag & Tuple Functions

Apache Pig – Bag & Tuple Functions ”; Previous Next Given below is the list of Bag and Tuple functions. S.N. Function & Description 1 TOBAG() To convert two or more expressions into a bag. 2 TOP() To get the top N tuples of a relation. 3 TOTUPLE() To convert one or more expressions into a tuple. 4 TOMAP() To convert the key-value pairs into a Map. Print Page Previous Next Advertisements ”;

Apache Pig – date-time Functions

Apache Pig – Date-time Functions ”; Previous Next Apache Pig provides the following Date and Time functions − S.N. Functions & Description 1 ToDate(milliseconds) This function returns a date-time object according to the given parameters. The other alternative for this function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone) 2 CurrentTime() returns the date-time object of the current time. 3 GetDay(datetime) Returns the day of a month from the date-time object. 4 GetHour(datetime) Returns the hour of a day from the date-time object. 5 GetMilliSecond(datetime) Returns the millisecond of a second from the date-time object. 6 GetMinute(datetime) Returns the minute of an hour from the date-time object. 7 GetMonth(datetime) Returns the month of a year from the date-time object. 8 GetSecond(datetime) Returns the second of a minute from the date-time object. 9 GetWeek(datetime) Returns the week of a year from the date-time object. 10 GetWeekYear(datetime) Returns the week year from the date-time object. 11 GetYear(datetime) Returns the year from the date-time object. 12 AddDuration(datetime, duration) Returns the result of a date-time object along with the duration object. 13 SubtractDuration(datetime, duration) Subtracts the Duration object from the Date-Time object and returns the result. 14 DaysBetween(datetime1, datetime2) Returns the number of days between the two date-time objects. 15 HoursBetween(datetime1, datetime2) Returns the number of hours between two date-time objects. 16 MilliSecondsBetween(datetime1, datetime2) Returns the number of milliseconds between two date-time objects. 17 MinutesBetween(datetime1, datetime2) Returns the number of minutes between two date-time objects. 18 MonthsBetween(datetime1, datetime2) Returns the number of months between two date-time objects. 19 SecondsBetween(datetime1, datetime2) Returns the number of seconds between two date-time objects. 20 WeeksBetween(datetime1, datetime2) Returns the number of weeks between two date-time objects. 21 YearsBetween(datetime1, datetime2) Returns the number of years between two date-time objects. Print Page Previous Next Advertisements ”;

Apache Pig – Storing Data

Apache Pig – Storing Data ”; Previous Next In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator. Syntax Given below is the syntax of the Store statement. STORE Relation_name INTO ” required_directory_path ” [USING function]; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below. grunt> STORE student INTO ” hdfs://localhost:9000/pig_Output/ ” USING PigStorage (”,”); Output After executing the store statement, you will get the following output. A directory is created with the specified name and the data will be stored in it. 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher – 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature 0 0 0 0 student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/student_data.txt” Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/pig_Output” Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine .mapReduceLayer.MapReduceLau ncher – Success! Verification You can verify the stored data as shown below. Step 1 First of all, list out the files in the directory named pig_output using the ls command as shown below. hdfs dfs -ls ”hdfs://localhost:9000/pig_Output/” Found 2 items rw-r–r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS rw-r–r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000 You can observe that two files were created after executing the store statement. Step 2 Using cat command, list the contents of the file named part-m-00000 as shown below. $ hdfs dfs -cat ”hdfs://localhost:9000/pig_Output/part-m-00000” 1,Rajiv,Reddy,9848022337,Hyderabad 2,siddarth,Battacharya,9848022338,Kolkata 3,Rajesh,Khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai Print Page Previous Next Advertisements ”;

Load & Store Functions

Apache Pig – Load & Store Functions ”; Previous Next The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig. S.N. Function & Description 1 PigStorage() To load and store structured files. 2 TextLoader() To load unstructured data into Pig. 3 BinStorage() To load and store data into Pig using machine readable format. 4 Handling Compression In Pig Latin, we can load and store compressed data. Print Page Previous Next Advertisements ”;

Apache Pig – Limit Operator

Apache Pig – Limit Operator ”; Previous Next The LIMIT operator is used to get a limited number of tuples from a relation. Syntax Given below is the syntax of the LIMIT operator. grunt> Result = LIMIT Relation_name required number of tuples; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); Now, let’s sort the relation in descending order based on the age of the student and store it into another relation named limit_data using the ORDER BY operator as shown below. grunt> limit_data = LIMIT student_details 4; Verification Verify the relation limit_data using the DUMP operator as shown below. grunt> Dump limit_data; Output It will produce the following output, displaying the contents of the relation limit_data as follows. (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune) Print Page Previous Next Advertisements ”;

Pig Latin – Basics

Pig Latin – Basics ”; Previous Next Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational operators, and Pig Latin UDF’s. Pig Latin – Data Model As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin data model. And it is a bag where − A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data. Pig Latin – Statemets While processing data using Pig Latin, statements are the basic constructs. These statements work with relations. They include expressions and schemas. Every statement ends with a semicolon (;). We will perform various operations using operators provided by Pig Latin, through statements. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried out. Example Given below is a Pig Latin statement, which loads data to Apache Pig. grunt> Student_data = LOAD ”student_data.txt” USING PigStorage(”,”)as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Pig Latin – Data types Given below table describes the Pig Latin data types. S.N. Data Type Description & Example 1 int Represents a signed 32-bit integer. Example : 8 2 long Represents a signed 64-bit integer. Example : 5L 3 float Represents a signed 32-bit floating point. Example : 5.5F 4 double Represents a 64-bit floating point. Example : 10.5 5 chararray Represents a character array (string) in Unicode UTF-8 format. Example : ‘tutorials point’ 6 Bytearray Represents a Byte array (blob). 7 Boolean Represents a Boolean value. Example : true/ false. 8 Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00 9 Biginteger Represents a Java BigInteger. Example : 60708090709 10 Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883 Complex Types 11 Tuple A tuple is an ordered set of fields. Example : (raja, 30) 12 Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} 13 Map A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30] Null Values Values for all the above data types can be NULL. Apache Pig treats null values in a similar way as SQL does. A null can be an unknown value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can be the result of an operation. Pig Latin – Arithmetic Operators The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20. Operator Description Example &plus; Addition − Adds values on either side of the operator a &plus; b will give 30 − Subtraction − Subtracts right hand operand from left hand operand a − b will give −10 * Multiplication − Multiplies values on either side of the operator a * b will give 200 / Division − Divides left hand operand by right hand operand b / a will give 2 % Modulus − Divides left hand operand by right hand operand and returns remainder b % a will give 0 ? : Bincond − Evaluates the Boolean operators. It has three operands as shown below. variable x = (expression) ? value1 if true : value2 if false. b = (a == 1)? 20: 30; if a = 1 the value of b is 20. if a!=1 the value of b is 30. CASE WHEN THEN ELSE END Case − The case operator is equivalent to nested bincond operator. CASE f2 % 2 WHEN 0 THEN ”even” WHEN 1 THEN ”odd” END Pig Latin – Comparison Operators The following table describes the comparison operators of Pig Latin. Operator Description Example == Equal − Checks if the values of two operands are equal or not; if yes, then the condition becomes true. (a = b) is not true != Not Equal − Checks if the values of two operands are equal or not. If the values are not equal, then condition becomes true. (a != b) is true. > Greater than − Checks if the value of the left operand is greater than the value of the right operand. If yes, then the condition becomes true. (a > b) is not true. < Less than − Checks if the value of the left operand is less than the value of the right operand. If yes, then the condition becomes true. (a < b) is true. >= Greater than or equal to − Checks if the value of the left operand is greater than or equal to the value of the right operand. If yes, then the condition becomes true. (a >= b) is not true. <= Less than or equal to − Checks if the value of the left operand is less than or equal to the value of the right operand. If yes, then the condition becomes true. (a <= b) is true. matches Pattern matching − Checks whether the string in the left-hand side matches with the constant in the right-hand side. f1 matches ”.*tutorial.*” Pig Latin – Type Construction Operators The following table describes the Type construction operators of Pig Latin. Operator Description Example () Tuple constructor operator − This operator is used to construct a tuple. (Raju, 30) {} Bag constructor operator − This operator is used to construct a bag. {(Raju, 30), (Mohammad, 45)} [] Map constructor operator − This operator is used to construct a tuple. [name#Raja, age#30] Pig Latin – Relational Operations The following table describes the relational operators

Apache Pig – Illustrate Operator

Apache Pig – Illustrate Operator ”; Previous Next The illustrate operator gives you the step-by-step execution of a sequence of statements. Syntax Given below is the syntax of the illustrate operator. grunt> illustrate Relation_name; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us illustrate the relation named student as shown below. grunt> illustrate student; Output On executing the above statement, you will get the following output. grunt> illustrate student; INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap – Aliases being processed per job phase (AliasName[line,offset]): M: student[1,10] C: R: ——————————————————————————————— |student | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray | ——————————————————————————————— | | 002 | siddarth | Battacharya | 9848022338 | Kolkata | ——————————————————————————————— Print Page Previous Next Advertisements ”;

Apache Pig – Diagnostic Operator

Apache Pig – Diagnostic Operators ”; Previous Next The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators − Dump operator Describe operator Explanation operator Illustration operator In this chapter, we will discuss the Dump operators of Pig Latin. Dump Operator The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose. Syntax Given below is the syntax of the Dump operator. grunt> Dump Relation_Name Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us print the contents of the relation using the Dump operator as shown below. grunt> Dump student Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the following output. 2015-10-01 15:05:27,642 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 100% complete 2015-10-01 15:05:27,652 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-01 15:03:11 2015-10-01 05:27 UNKNOWN Success! Job Stats (time in seconds): JobId job_14459_0004 Maps 1 Reduces 0 MaxMapTime n/a MinMapTime n/a AvgMapTime n/a MedianMapTime n/a MaxReduceTime 0 MinReduceTime 0 AvgReduceTime 0 MedianReducetime 0 Alias student Feature MAP_ONLY Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456, Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/ student_data.txt” Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/tmp/temp580182027/ tmp757878456″ Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0004 2015-10-01 15:06:28,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLau ncher – Success! 2015-10-01 15:06:28,441 [main] INFO org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code. 2015-10-01 15:06:28,485 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1 2015-10-01 15:06:28,485 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1 (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) Print Page Previous Next Advertisements ”;

Apache Pig – Join Operator

Apache Pig – Join Operator ”; Previous Next The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types − Self-join Inner-join Outer-join − left join, right join, and full join This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060 And we have loaded these two files into Pig with the relations customers and orders as shown below. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”) as (oid:int, date:chararray, customer_id:int, amount:int); Let us now perform various Join operations on these two relations. Self – join Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation. Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below. grunt> customers1 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); Syntax Given below is the syntax of performing self-join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ; Example Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below. grunt> customers3 = JOIN customers1 BY id, customers2 BY id; Verification Verify the relation customers3 using the DUMP operator as shown below. grunt> Dump customers3; Output It will produce the following output, displaying the contents of the relation customers. (1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000) Inner Join Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables. It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row. Syntax Here is the syntax of performing inner join operation using the JOIN operator. grunt> result = JOIN relation1 BY columnname, relation2 BY columnname; Example Let us perform inner join operation on the two relations customers and orders as shown below. grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id; Verification Verify the relation coustomer_orders using the DUMP operator as shown below. grunt> Dump coustomer_orders; Output You will get the following output that will the contents of the relation named coustomer_orders. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Note − Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways − Left outer join Right outer join Full outer join Left Outer Join The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. Syntax Given below is the syntax of performing left outer join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id; Example Let us perform left outer join operation on the two relations customers and orders as shown below. grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; Verification Verify the relation outer_left using the DUMP operator as shown below. grunt> Dump outer_left; Output It will produce the following output, displaying the contents of the relation outer_left. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Right Outer Join The right outer join operation returns all rows from the right table, even if there are no matches in the left table. Syntax Given below is the syntax of performing right outer join operation using the JOIN operator. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Example Let us perform right outer join operation on the two relations customers and orders as shown below. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Verification Verify the relation outer_right using the DUMP operator as shown below. grunt> Dump outer_right Output It will produce the following output, displaying the contents of the relation outer_right. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Full Outer Join The full outer join operation returns rows when there is a match in one of the relations. Syntax Given below is the syntax of performing full outer join using the JOIN operator. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Example Let us perform full outer join operation on the two relations customers and orders as shown below. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Verification Verify the relation outer_full using the DUMP operator as shown below. grun> Dump outer_full; Output It will produce the following output, displaying the contents of the relation outer_full. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Using Multiple Keys We can perform JOIN operation using multiple keys. Syntax Here is how you can perform a JOIN operation on two tables using multiple keys. grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2); Assume that we have two files namely employee.txt and employee_contact.txt in the /pig_data/ directory of HDFS as shown