Apache Pig – Execution ”; Previous Next In the previous chapter, we explained how to install Apache Pig. In this chapter, we will discuss how to execute Apache Pig. Apache Pig Execution Modes You can run Apache Pig in two modes, namely, Local Mode and HDFS mode. Local Mode In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose. MapReduce Mode MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS. Apache Pig Execution Mechanisms Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode. Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator). Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension. Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script. Invoking the Grunt Shell You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below. Local mode MapReduce mode Command − $ ./pig –x local Command − $ ./pig -x mapreduce Output − Output − Either of these commands gives you the Grunt shell prompt as shown below. grunt> You can exit the Grunt shell using ‘ctrl + d’. After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it. grunt> customers = LOAD ”customers.txt” USING PigStorage(”,”); Executing Apache Pig in Batch Mode You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as shown below. Sample_script.pig student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”) as (id:int,name:chararray,city:chararray); Dump student; Now, you can execute the script in the above file as shown below. Local mode MapReduce mode $ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig Note − We will discuss in detail how to run a Pig script in Bach mode and in embedded mode in subsequent chapters. Print Page Previous Next Advertisements ”;
Category: apache Pig
Apache Pig – Grunt Shell
Apache Pig – Grunt Shell ”; Previous Next After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there are certain useful shell and utility commands provided by the Grunt shell. This chapter explains the shell and utility commands provided by the Grunt shell. Note − In some portions of this chapter, the commands like Load and Store are used. Refer the respective chapters to get in-detail information on them. Shell Commands The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke any shell commands using sh and fs. sh Command Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command from the Grunt shell, we cannot execute the commands that are a part of the shell environment (ex − cd). Syntax Given below is the syntax of sh command. grunt> sh shell command parameters Example We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown below. In this example, it lists out the files in the /pig/bin/ directory. grunt> sh ls pig pig_1444799121955.log pig.cmd pig.py fs Command Using the fs command, we can invoke any FsShell commands from the Grunt shell. Syntax Given below is the syntax of fs command. grunt> sh File System command parameters Example We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following example, it lists the files in the HDFS root directory. grunt> fs –ls Found 3 items drwxrwxrwx – Hadoop supergroup 0 2015-09-08 14:13 Hbase drwxr-xr-x – Hadoop supergroup 0 2015-09-09 14:52 seqgen_data drwxr-xr-x – Hadoop supergroup 0 2015-09-08 11:30 twitter_data In the same way, we can invoke all the other file system shell commands from the Grunt shell using the fs command. Utility Commands The Grunt shell provides a set of utility commands. These include utility commands such as clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt shell. Given below is the description of the utility commands provided by the Grunt shell. clear Command The clear command is used to clear the screen of the Grunt shell. Syntax You can clear the screen of the grunt shell using the clear command as shown below. grunt> clear help Command The help command gives you a list of Pig commands or Pig properties. Usage You can get a list of Pig commands using the help command as shown below. grunt> help Commands: <pig latin statement>; – See the PigLatin manual for details: http://hadoop.apache.org/pig File system commands:fs <fs arguments> – Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html Diagnostic Commands:describe <alias>[::<alias] – Show the schema for the alias. Inner aliases can be described as A::B. explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<pCram_value>] [-param_file <file_name>] [<alias>] – Show the execution plan to compute the alias or for entire script. -script – Explain the entire script. -out – Store the output into directory rather than print to stdout. -brief – Don”t expand nested plans (presenting a smaller graph for overview). -dot – Generate the output in .dot format. Default is text format. -xml – Generate the output in .xml format. Default is text format. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. alias – Alias to explain. dump <alias> – Compute the alias and writes the results to stdout. Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> – Execute the script with access to grunt environment including aliases. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. script – Script to be executed. run [-param <param_name>=param_value] [-param_file <file_name>] <script> – Execute the script with access to grunt environment. -param <param_name – See parameter substitution for details. -param_file <file_name> – See parameter substitution for details. script – Script to be executed. sh <shell command> – Invoke a shell command. kill <job_id> – Kill the hadoop job specified by the hadoop job id. set <key> <value> – Provide execution parameters to Pig. Keys and values are case sensitive. The following keys are supported: default_parallel – Script-level reduce parallelism. Basic input size heuristics used by default. debug – Set debug on or off. Default is off. job.name – Single-quoted name for jobs. Default is PigLatin:<script name> job.priority – Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal stream.skippath – String that contains the path. This is used by streaming any hadoop property. help – Display this message. history [-n] – Display the list statements in cache. -n Hide line numbers. quit – Quit the grunt shell. history Command This command displays a list of statements executed / used so far since the Grunt sell is invoked. Usage Assume we have executed three statements since opening the Grunt shell. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”); grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”); Then, using the history command will produce the following output. grunt> history customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”); orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”); student = LOAD ”hdfs://localhost:9000/pig_data/student.txt” USING PigStorage(”,”); set Command The set command is used to show/assign values to keys used in Pig. Usage Using this command, you can set values to the following keys. Key Description and values default_parallel You can set the number of reducers for a map job by passing any whole number as a value to this key. debug You can turn off or turn on the debugging freature in Pig by passing on/off to this key. job.name You can set the Job name to the required job by passing a string value to this key. job.priority You can set the job priority to a job by passing one of the following values to this key − very_low low normal high very_high stream.skippath For streaming, you can set the path
Apache Pig – Installation
Apache Pig – Installation ”; Previous Next This chapter explains the how to download, install, and set up Apache Pig in your system. Prerequisites It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the following link − https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm Download Apache Pig First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org/ Step 1 Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown in the following snapshot. Step 2 On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors. Step 3 Choose and click any one of these mirrors as shown below. Step 4 These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click the latest version among them. Step 5 Within these folders, you will have the source and binary files of Apache Pig in various distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz. Install Apache Pig After downloading the Apache Pig software, install it in your Linux environment by following the steps given below. Step 1 Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named Hadoop). $ mkdir Pig Step 2 Extract the downloaded tar files as shown below. $ cd Downloads/ $ tar zxvf pig-0.15.0-src.tar.gz $ tar zxvf pig-0.15.0.tar.gz Step 3 Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below. $ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/ Configure Apache Pig After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and pig.properties. .bashrc file In the .bashrc file, set the following variables − PIG_HOME folder to the Apache Pig’s installation folder, PATH environment variable to the bin folder, and PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations (the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files). export PIG_HOME = /home/Hadoop/Pig export PATH = $PATH:/home/Hadoop/pig/bin export PIG_CLASSPATH = $HADOOP_HOME/conf pig.properties file In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various parameters as given below. pig -h properties The following properties are supported − Logging: verbose = true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch aggregate.warning = true|false; default is true. If true, prints count of warnings of each type rather than logging each warning. Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). Note that this memory is shared across all large bags used by the application. pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). Specifies the fraction of heap available for the reducer to perform the join. pig.exec.nocombiner = true|false; default is false. Only disable combiner as a temporary workaround for problems. opt.multiquery = true|false; multiquery is on by default. Only disable multiquery as a temporary workaround for problems. opt.fetch=true|false; fetch is on by default. Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. pig.tmpfilecompression = true|false; compression is off by default. Determines whether output of intermediate jobs is compressed. pig.tmpfilecompression.codec = lzo|gzip; default is gzip. Used in conjunction with pig.tmpfilecompression. Defines compression type. pig.noSplitCombination = true|false. Split combination is on by default. Determines if multiple small files are combined into a single map. pig.exec.mapPartAgg = true|false. Default is false. Determines if partial aggregation is done within map phase, before records are sent to combiner. pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled. Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command. udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF. stop.on.failure = true|false; default is false. Set to true to terminate on the first error. pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified. Verifying the Installation Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will get the version of Apache Pig as shown below. $ pig –version Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 Print Page Previous Next Advertisements ”;
Apache Pig – Home
Apache Pig Tutorial PDF Version Quick Guide Resources Job Search Discussion Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig. Audience This tutorial is meant for all those professionals working on Hadoop who would like to perform MapReduce operations without having to type complex codes in Java. Prerequisites To make the most of this tutorial, you should have a good understanding of the basics of Hadoop and HDFS commands. It will certainly help if you are good at SQL. Print Page Previous Next Advertisements ”;
Apache Pig – Foreach Operator ”; Previous Next The FOREACH operator is used to generate specified data transformations based on the column data. Syntax Given below is the syntax of FOREACH operator. grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data); Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); Let us now get the id, age, and city values of each student from the relation student_details and store it into another relation named foreach_data using the foreach operator as shown below. grunt> foreach_data = FOREACH student_details GENERATE id,age,city; Verification Verify the relation foreach_data using the DUMP operator as shown below. grunt> Dump foreach_data; Output It will produce the following output, displaying the contents of the relation foreach_data. (1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi) (4,21,Pune) (5,23,Bhuwaneshwar) (6,23,Chennai) (7,24,trivendram) (8,24,Chennai) Print Page Previous Next Advertisements ”;
Apache Pig – Discussion
Discuss Apache Pig ”; Previous Next Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig. Print Page Previous Next Advertisements ”;