Apache Pig – Cogroup Operator ”; Previous Next The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations. Grouping Two Relations using Cogroup Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai employee_details.txt 001,Robin,22,newyork 002,BOB,23,Kolkata 003,Maya,23,Tokyo 004,Sara,25,London 005,David,23,Bhuwaneshwar 006,Maggy,22,Chennai And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); grunt> employee_details = LOAD ”hdfs://localhost:9000/pig_data/employee_details.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, city:chararray); Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below. grunt> cogroup_data = COGROUP student_details by age, employee_details by age; Verification Verify the relation cogroup_data using the DUMP operator as shown below. grunt> Dump cogroup_data; Output It will produce the following output, displaying the contents of the relation named cogroup_data as shown below. (21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, { }) (22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) }, { (6,Maggy,22,Chennai),(1,Robin,22,newyork) }) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, { }) (25,{ }, {(4,Sara,25,London)}) The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value. For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags − the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and the second bag contains all the tuples from the second relation (employee_details in this case) having age 21. In case a relation doesn’t have tuples having the age value 21, it returns an empty bag. Print Page Previous Next Advertisements ”;
Category: apache Pig
Apache Pig – Bag & Tuple Functions ”; Previous Next Given below is the list of Bag and Tuple functions. S.N. Function & Description 1 TOBAG() To convert two or more expressions into a bag. 2 TOP() To get the top N tuples of a relation. 3 TOTUPLE() To convert one or more expressions into a tuple. 4 TOMAP() To convert the key-value pairs into a Map. Print Page Previous Next Advertisements ”;
Apache Pig – Date-time Functions ”; Previous Next Apache Pig provides the following Date and Time functions − S.N. Functions & Description 1 ToDate(milliseconds) This function returns a date-time object according to the given parameters. The other alternative for this function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone) 2 CurrentTime() returns the date-time object of the current time. 3 GetDay(datetime) Returns the day of a month from the date-time object. 4 GetHour(datetime) Returns the hour of a day from the date-time object. 5 GetMilliSecond(datetime) Returns the millisecond of a second from the date-time object. 6 GetMinute(datetime) Returns the minute of an hour from the date-time object. 7 GetMonth(datetime) Returns the month of a year from the date-time object. 8 GetSecond(datetime) Returns the second of a minute from the date-time object. 9 GetWeek(datetime) Returns the week of a year from the date-time object. 10 GetWeekYear(datetime) Returns the week year from the date-time object. 11 GetYear(datetime) Returns the year from the date-time object. 12 AddDuration(datetime, duration) Returns the result of a date-time object along with the duration object. 13 SubtractDuration(datetime, duration) Subtracts the Duration object from the Date-Time object and returns the result. 14 DaysBetween(datetime1, datetime2) Returns the number of days between the two date-time objects. 15 HoursBetween(datetime1, datetime2) Returns the number of hours between two date-time objects. 16 MilliSecondsBetween(datetime1, datetime2) Returns the number of milliseconds between two date-time objects. 17 MinutesBetween(datetime1, datetime2) Returns the number of minutes between two date-time objects. 18 MonthsBetween(datetime1, datetime2) Returns the number of months between two date-time objects. 19 SecondsBetween(datetime1, datetime2) Returns the number of seconds between two date-time objects. 20 WeeksBetween(datetime1, datetime2) Returns the number of weeks between two date-time objects. 21 YearsBetween(datetime1, datetime2) Returns the number of years between two date-time objects. Print Page Previous Next Advertisements ”;
Apache Pig – Storing Data
Apache Pig – Storing Data ”; Previous Next In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator. Syntax Given below is the syntax of the Store statement. STORE Relation_name INTO ” required_directory_path ” [USING function]; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below. grunt> STORE student INTO ” hdfs://localhost:9000/pig_Output/ ” USING PigStorage (”,”); Output After executing the store statement, you will get the following output. A directory is created with the specified name and the data will be stored in it. 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher – 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature 0 0 0 0 student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/student_data.txt” Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/pig_Output” Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine .mapReduceLayer.MapReduceLau ncher – Success! Verification You can verify the stored data as shown below. Step 1 First of all, list out the files in the directory named pig_output using the ls command as shown below. hdfs dfs -ls ”hdfs://localhost:9000/pig_Output/” Found 2 items rw-r–r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS rw-r–r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000 You can observe that two files were created after executing the store statement. Step 2 Using cat command, list the contents of the file named part-m-00000 as shown below. $ hdfs dfs -cat ”hdfs://localhost:9000/pig_Output/part-m-00000” 1,Rajiv,Reddy,9848022337,Hyderabad 2,siddarth,Battacharya,9848022338,Kolkata 3,Rajesh,Khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai Print Page Previous Next Advertisements ”;
Apache Pig – Illustrate Operator ”; Previous Next The illustrate operator gives you the step-by-step execution of a sequence of statements. Syntax Given below is the syntax of the illustrate operator. grunt> illustrate Relation_name; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us illustrate the relation named student as shown below. grunt> illustrate student; Output On executing the above statement, you will get the following output. grunt> illustrate student; INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap – Aliases being processed per job phase (AliasName[line,offset]): M: student[1,10] C: R: ——————————————————————————————— |student | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray | ——————————————————————————————— | | 002 | siddarth | Battacharya | 9848022338 | Kolkata | ——————————————————————————————— Print Page Previous Next Advertisements ”;
Apache Pig – Diagnostic Operators ”; Previous Next The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators − Dump operator Describe operator Explanation operator Illustration operator In this chapter, we will discuss the Dump operators of Pig Latin. Dump Operator The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose. Syntax Given below is the syntax of the Dump operator. grunt> Dump Relation_Name Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us print the contents of the relation using the Dump operator as shown below. grunt> Dump student Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the following output. 2015-10-01 15:05:27,642 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 100% complete 2015-10-01 15:05:27,652 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-01 15:03:11 2015-10-01 05:27 UNKNOWN Success! Job Stats (time in seconds): JobId job_14459_0004 Maps 1 Reduces 0 MaxMapTime n/a MinMapTime n/a AvgMapTime n/a MedianMapTime n/a MaxReduceTime 0 MinReduceTime 0 AvgReduceTime 0 MedianReducetime 0 Alias student Feature MAP_ONLY Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456, Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/ student_data.txt” Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/tmp/temp580182027/ tmp757878456″ Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0004 2015-10-01 15:06:28,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLau ncher – Success! 2015-10-01 15:06:28,441 [main] INFO org.apache.pig.data.SchemaTupleBackend – Key [pig.schematuple] was not set… will not generate code. 2015-10-01 15:06:28,485 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1 2015-10-01 15:06:28,485 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1 (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) Print Page Previous Next Advertisements ”;
Apache Pig – Join Operator
Apache Pig – Join Operator ”; Previous Next The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types − Self-join Inner-join Outer-join − left join, right join, and full join This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060 And we have loaded these two files into Pig with the relations customers and orders as shown below. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”) as (oid:int, date:chararray, customer_id:int, amount:int); Let us now perform various Join operations on these two relations. Self – join Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation. Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below. grunt> customers1 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); Syntax Given below is the syntax of performing self-join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ; Example Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below. grunt> customers3 = JOIN customers1 BY id, customers2 BY id; Verification Verify the relation customers3 using the DUMP operator as shown below. grunt> Dump customers3; Output It will produce the following output, displaying the contents of the relation customers. (1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000) Inner Join Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables. It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row. Syntax Here is the syntax of performing inner join operation using the JOIN operator. grunt> result = JOIN relation1 BY columnname, relation2 BY columnname; Example Let us perform inner join operation on the two relations customers and orders as shown below. grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id; Verification Verify the relation coustomer_orders using the DUMP operator as shown below. grunt> Dump coustomer_orders; Output You will get the following output that will the contents of the relation named coustomer_orders. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Note − Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways − Left outer join Right outer join Full outer join Left Outer Join The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. Syntax Given below is the syntax of performing left outer join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id; Example Let us perform left outer join operation on the two relations customers and orders as shown below. grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; Verification Verify the relation outer_left using the DUMP operator as shown below. grunt> Dump outer_left; Output It will produce the following output, displaying the contents of the relation outer_left. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Right Outer Join The right outer join operation returns all rows from the right table, even if there are no matches in the left table. Syntax Given below is the syntax of performing right outer join operation using the JOIN operator. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Example Let us perform right outer join operation on the two relations customers and orders as shown below. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Verification Verify the relation outer_right using the DUMP operator as shown below. grunt> Dump outer_right Output It will produce the following output, displaying the contents of the relation outer_right. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Full Outer Join The full outer join operation returns rows when there is a match in one of the relations. Syntax Given below is the syntax of performing full outer join using the JOIN operator. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Example Let us perform full outer join operation on the two relations customers and orders as shown below. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Verification Verify the relation outer_full using the DUMP operator as shown below. grun> Dump outer_full; Output It will produce the following output, displaying the contents of the relation outer_full. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Using Multiple Keys We can perform JOIN operation using multiple keys. Syntax Here is how you can perform a JOIN operation on two tables using multiple keys. grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2); Assume that we have two files namely employee.txt and employee_contact.txt in the /pig_data/ directory of HDFS as shown
Apache Pig – Union Operator
Apache Pig – Union Operator ”; Previous Next The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical. Syntax Given below is the syntax of the UNION operator. grunt> Relation_name3 = UNION Relation_name1, Relation_name2; Example Assume that we have two files namely student_data1.txt and student_data2.txt in the /pig_data/ directory of HDFS as shown below. Student_data1.txt 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. Student_data2.txt 7,Komal,Nayak,9848022334,trivendram. 8,Bharathi,Nambiayar,9848022333,Chennai. And we have loaded these two files into Pig with the relations student1 and student2 as shown below. grunt> student1 = LOAD ”hdfs://localhost:9000/pig_data/student_data1.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); grunt> student2 = LOAD ”hdfs://localhost:9000/pig_data/student_data2.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); Let us now merge the contents of these two relations using the UNION operator as shown below. grunt> student = UNION student1, student2; Verification Verify the relation student using the DUMP operator as shown below. grunt> Dump student; Output It will display the following output, displaying the contents of the relation student. (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) (7,Komal,Nayak,9848022334,trivendram) (8,Bharathi,Nambiayar,9848022333,Chennai) Print Page Previous Next Advertisements ”;
Load & Store Functions
Apache Pig – Load & Store Functions ”; Previous Next The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig. S.N. Function & Description 1 PigStorage() To load and store structured files. 2 TextLoader() To load unstructured data into Pig. 3 BinStorage() To load and store data into Pig using machine readable format. 4 Handling Compression In Pig Latin, we can load and store compressed data. Print Page Previous Next Advertisements ”;
Apache Pig – Limit Operator
Apache Pig – Limit Operator ”; Previous Next The LIMIT operator is used to get a limited number of tuples from a relation. Syntax Given below is the syntax of the LIMIT operator. grunt> Result = LIMIT Relation_name required number of tuples; Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); Now, let’s sort the relation in descending order based on the age of the student and store it into another relation named limit_data using the ORDER BY operator as shown below. grunt> limit_data = LIMIT student_details 4; Verification Verify the relation limit_data using the DUMP operator as shown below. grunt> Dump limit_data; Output It will produce the following output, displaying the contents of the relation limit_data as follows. (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune) Print Page Previous Next Advertisements ”;