Apache Pig – Join Operator

Apache Pig – Join Operator ”; Previous Next The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types − Self-join Inner-join Outer-join − left join, right join, and full join This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below. customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060 And we have loaded these two files into Pig with the relations customers and orders as shown below. grunt> customers = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD ”hdfs://localhost:9000/pig_data/orders.txt” USING PigStorage(”,”) as (oid:int, date:chararray, customer_id:int, amount:int); Let us now perform various Join operations on these two relations. Self – join Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation. Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below. grunt> customers1 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD ”hdfs://localhost:9000/pig_data/customers.txt” USING PigStorage(”,”) as (id:int, name:chararray, age:int, address:chararray, salary:int); Syntax Given below is the syntax of performing self-join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ; Example Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below. grunt> customers3 = JOIN customers1 BY id, customers2 BY id; Verification Verify the relation customers3 using the DUMP operator as shown below. grunt> Dump customers3; Output It will produce the following output, displaying the contents of the relation customers. (1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000) Inner Join Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables. It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row. Syntax Here is the syntax of performing inner join operation using the JOIN operator. grunt> result = JOIN relation1 BY columnname, relation2 BY columnname; Example Let us perform inner join operation on the two relations customers and orders as shown below. grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id; Verification Verify the relation coustomer_orders using the DUMP operator as shown below. grunt> Dump coustomer_orders; Output You will get the following output that will the contents of the relation named coustomer_orders. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Note − Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways − Left outer join Right outer join Full outer join Left Outer Join The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. Syntax Given below is the syntax of performing left outer join operation using the JOIN operator. grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id; Example Let us perform left outer join operation on the two relations customers and orders as shown below. grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; Verification Verify the relation outer_left using the DUMP operator as shown below. grunt> Dump outer_left; Output It will produce the following output, displaying the contents of the relation outer_left. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Right Outer Join The right outer join operation returns all rows from the right table, even if there are no matches in the left table. Syntax Given below is the syntax of performing right outer join operation using the JOIN operator. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Example Let us perform right outer join operation on the two relations customers and orders as shown below. grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id; Verification Verify the relation outer_right using the DUMP operator as shown below. grunt> Dump outer_right Output It will produce the following output, displaying the contents of the relation outer_right. (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) Full Outer Join The full outer join operation returns rows when there is a match in one of the relations. Syntax Given below is the syntax of performing full outer join using the JOIN operator. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Example Let us perform full outer join operation on the two relations customers and orders as shown below. grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id; Verification Verify the relation outer_full using the DUMP operator as shown below. grun> Dump outer_full; Output It will produce the following output, displaying the contents of the relation outer_full. (1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,) Using Multiple Keys We can perform JOIN operation using multiple keys. Syntax Here is how you can perform a JOIN operation on two tables using multiple keys. grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2); Assume that we have two files namely employee.txt and employee_contact.txt in the /pig_data/ directory of HDFS as shown

Apache Pig – Union Operator

Apache Pig – Union Operator ”; Previous Next The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical. Syntax Given below is the syntax of the UNION operator. grunt> Relation_name3 = UNION Relation_name1, Relation_name2; Example Assume that we have two files namely student_data1.txt and student_data2.txt in the /pig_data/ directory of HDFS as shown below. Student_data1.txt 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. Student_data2.txt 7,Komal,Nayak,9848022334,trivendram. 8,Bharathi,Nambiayar,9848022333,Chennai. And we have loaded these two files into Pig with the relations student1 and student2 as shown below. grunt> student1 = LOAD ”hdfs://localhost:9000/pig_data/student_data1.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); grunt> student2 = LOAD ”hdfs://localhost:9000/pig_data/student_data2.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); Let us now merge the contents of these two relations using the UNION operator as shown below. grunt> student = UNION student1, student2; Verification Verify the relation student using the DUMP operator as shown below. grunt> Dump student; Output It will display the following output, displaying the contents of the relation student. (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) (7,Komal,Nayak,9848022334,trivendram) (8,Bharathi,Nambiayar,9848022333,Chennai) Print Page Previous Next Advertisements ”;

Apache Storm – Quick Guide

Apache Storm – Quick Guide ”; Previous Next Apache Storm – Introduction What is Apache Storm? Apache Storm is a distributed real-time big data-processing system. Storm is designed to process vast amount of data in a fault-tolerant and horizontal scalable method. It is a streaming data framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all kinds of manipulations on real-time data in parallel. Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate and it guarantees that every message will be processed through the topology at least once. Apache Storm vs Hadoop Basically Hadoop and Storm frameworks are used for analyzing big data. Both of them complement each other and differ in some aspects. Apache Storm does all the operations except persistency, while Hadoop is good at everything but lags in real-time computation. The following table compares the attributes of Storm and Hadoop. Storm Hadoop Real-time stream processing Batch processing Stateless Stateful Master/Slave architecture with ZooKeeper based coordination. The master node is called as nimbus and slaves are supervisors. Master-slave architecture with/without ZooKeeper based coordination. Master node is job tracker and slave node is task tracker. A Storm streaming process can access tens of thousands messages per second on cluster. Hadoop Distributed File System (HDFS) uses MapReduce framework to process vast amount of data that takes minutes or hours. Storm topology runs until shutdown by the user or an unexpected unrecoverable failure. MapReduce jobs are executed in a sequential order and completed eventually. Both are distributed and fault-tolerant If nimbus / supervisor dies, restarting makes it continue from where it stopped, hence nothing gets affected. If the JobTracker dies, all the running jobs are lost. Use-Cases of Apache Storm Apache Storm is very famous for real-time big data stream processing. For this reason, most of the companies are using Storm as an integral part of their system. Some notable examples are as follows − Twitter − Twitter is using Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process each and every tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter infrastructure. NaviSite − NaviSite is using Storm for Event log monitoring/auditing system. Every logs generated in the system will go through the Storm. Storm will check the message against the configured set of regular expression and if there is a match, then that particular message will be saved to the database. Wego − Wego is a travel metasearch engine located in Singapore. Travel related data comes from many sources all over the world with different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user. Apache Storm Benefits Here is a list of the benefits that Apache Storm offers − Storm is open source, robust, and user friendly. It could be utilized in small companies as well as large corporations. Storm is fault tolerant, flexible, reliable, and supports any programming language. Allows real-time stream processing. Storm is unbelievably fast because it has enormous power of processing the data. Storm can keep up the performance even under increasing load by adding resources linearly. It is highly scalable. Storm performs data refresh and end-to-end delivery response in seconds or minutes depends upon the problem. It has very low latency. Storm has operational intelligence. Storm provides guaranteed data processing even if any of the connected nodes in the cluster die or messages are lost. Apache Storm – Core Concepts Apache Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units and output the processed / useful information at the other end. The following diagram depicts the core concept of Apache Storm. Let us now have a closer look at the components of Apache Storm − Components Description Tuple Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster. Stream Stream is an unordered sequence of tuples. Spouts Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc. Bolts Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc. Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache Storm. The following diagram depicts the structure. The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the tweets of the users using Twitter Streaming API and output as a stream of tuples. A single tuple from the spout will have a twitter username and a single tweet as comma separated values. Then, this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet into individual word, calculate the word count, and persist the information to a configured datasource. Now, we can easily get the result by querying the datasource. Topology Spouts and bolts are connected together and they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology is a directed graph where vertices are computation and edges are stream of data. A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt

Apache Storm – Applications

Apache Storm – Applications ”; Previous Next Apache Storm framework supports many of the today”s best industrial applications. We will provide a very brief overview of some of the most notable applications of Storm in this chapter. Klout Klout is an application that uses social media analytics to rank its users based on online social influence through Klout Score, which is a numerical value between 1 and 100. Klout uses Apache Storm’s inbuilt Trident abstraction to create complex topologies that stream data. The Weather Channel The Weather Channel uses Storm topologies to ingest weather data. It has tied up with Twitter to enable weather-informed advertising on Twitter and mobile applications. OpenSignal is a company that specializes in wireless coverage mapping. StormTag and WeatherSignal are weather-based projects created by OpenSignal. StormTag is a Bluetooth weather station that attaches to a keychain. The weather data collected by the device is sent to the WeatherSignal app and OpenSignal servers. Telecom Industry Telecommunication providers process millions of phone calls per second. They perform forensics on dropped calls and poor sound quality. Call detail records flow in at a rate of millions per second and Apache Storm processes those in real-time and identifies any troubling patterns. Storm analysis can be used to continuously improve call quality. Print Page Previous Next Advertisements ”;

Apache Tajo – String Functions

Apache Tajo – String Functions ”; Previous Next The following table lists out the string functions in Tajo. S.No. Function & Description 1 concat(string1, …, stringN) Concatenate the given strings. 2 length(string) Returns the length of the given string. 3 lower(string) Returns the lowercase format for the string. 4 upper(string) Returns the uppercase format for the given string. 5 ascii(string text) Returns the ASCII code of the first character of the text. 6 bit_length(string text) Returns the number of bits in a string. 7 char_length(string text) Returns the number of characters in a string. 8 octet_length(string text) Returns the number of bytes in a string. 9 digest(input text, method text) Calculates the Digest hash of string. Here, the second arg method refers to the hash method. 10 initcap(string text) Converts the first letter of each word to upper case. 11 md5(string text) Calculates the MD5 hash of string. 12 left(string text, int size) Returns the first n characters in the string. 13 right(string text, int size) Returns the last n characters in the string. 14 locate(source text, target text, start_index) Returns the location of specified substring. 15 strposb(source text, target text) Returns the binary location of specified substring. 16 substr(source text, start index, length) Returns the substring for the specified length. 17 trim(string text[, characters text]) Removes the characters (a space by default) from the start/end/both ends of the string. 18 split_part(string text, delimiter text, field int) Splits a string on delimiter and returns the given field (counting from one). 19 regexp_replace(string text, pattern text, replacement text) Replaces substrings matched to a given regular expression pattern. 20 reverse(string) Reverse operation performed for the string. Print Page Previous Next Advertisements ”;

Apache Solr – Updating Data

Apache Solr – Updating Data ”; Previous Next Updating the Document Using XML Following is the XML file used to update a field in the existing document. Save this in a file with the name update.xml. <add> <doc> <field name = “id”>001</field> <field name = “first name” update = “set”>Raj</field> <field name = “last name” update = “add”>Malhotra</field> <field name = “phone” update = “add”>9000000000</field> <field name = “city” update = “add”>Delhi</field> </doc> </add> As you can observe, the XML file written to update data is just like the one which we use to add documents. But the only difference is we use the update attribute of the field. In our example, we will use the above document and try to update the fields of the document with the id 001. Suppose the XML document exists in the bin directory of Solr. Since we are updating the index which exists in the core named my_core, you can update using the post tool as follows − [Hadoop@localhost bin]$ ./post -c my_core update.xml On executing the above command, you will get the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = my_core -Ddata = files org.apache.Solr.util.SimplePostTool update.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/my_core/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf, htm,html,txt,log POSTing file update.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/my_core/update… Time spent: 0:00:00.159 Verification Visit the homepage of Apache Solr web interface and select the core as my_core. Try to retrieve all the documents by passing the query “:” in the text area q and execute the query. On executing, you can observe that the document is updated. Updating the Document Using Java (Client API) Following is the Java program to add documents to Apache Solr index. Save this code in a file with the name UpdatingDocument.java. import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.client.Solrj.request.UpdateRequest; import org.apache.Solr.client.Solrj.response.UpdateResponse; import org.apache.Solr.common.SolrInputDocument; public class UpdatingDocument { public static void main(String args[]) throws SolrServerException, IOException { //Preparing the Solr client String urlString = “http://localhost:8983/Solr/my_core”; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing the Solr document SolrInputDocument doc = new SolrInputDocument(); UpdateRequest updateRequest = new UpdateRequest(); updateRequest.setAction( UpdateRequest.ACTION.COMMIT, false, false); SolrInputDocument myDocumentInstantlycommited = new SolrInputDocument(); myDocumentInstantlycommited.addField(“id”, “002”); myDocumentInstantlycommited.addField(“name”, “Rahman”); myDocumentInstantlycommited.addField(“age”,”27″); myDocumentInstantlycommited.addField(“addr”,”hyderabad”); updateRequest.add( myDocumentInstantlycommited); UpdateResponse rsp = updateRequest.process(Solr); System.out.println(“Documents Updated”); } } Compile the above code by executing the following commands in the terminal − [Hadoop@localhost bin]$ javac UpdatingDocument [Hadoop@localhost bin]$ java UpdatingDocument On executing the above command, you will get the following output. Documents updated Print Page Previous Next Advertisements ”;

Apache Storm – Core Concepts

Apache Storm – Core Concepts ”; Previous Next Apache Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units and output the processed / useful information at the other end. The following diagram depicts the core concept of Apache Storm. Let us now have a closer look at the components of Apache Storm − Components Description Tuple Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster. Stream Stream is an unordered sequence of tuples. Spouts Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc. Bolts Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc. Let’s take a real-time example of “Twitter Analysis” and see how it can be modelled in Apache Storm. The following diagram depicts the structure. The input for the “Twitter Analysis” comes from Twitter Streaming API. Spout will read the tweets of the users using Twitter Streaming API and output as a stream of tuples. A single tuple from the spout will have a twitter username and a single tweet as comma separated values. Then, this steam of tuples will be forwarded to the Bolt and the Bolt will split the tweet into individual word, calculate the word count, and persist the information to a configured datasource. Now, we can easily get the result by querying the datasource. Topology Spouts and bolts are connected together and they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology is a directed graph where vertices are computation and edges are stream of data. A simple topology starts with spouts. Spout emits the data to one or more bolts. Bolt represents a node in the topology having the smallest processing logic and the output of a bolt can be emitted into another bolt as input. Storm keeps the topology always running, until you kill the topology. Apache Storm’s main job is to run the topology and will run any number of topology at a given time. Tasks Now you have a basic idea on spouts and bolts. They are the smallest logical unit of the topology and a topology is built using a single spout and an array of bolts. They should be executed properly in a particular order for the topology to run successfully. The execution of each and every spout and bolt by Storm is called as “Tasks”. In simple words, a task is either the execution of a spout or a bolt. At a given time, each spout and bolt can have multiple instances running in multiple separate threads. Workers A topology runs in a distributed manner, on multiple worker nodes. Storm spreads the tasks evenly on all the worker nodes. The worker node’s role is to listen for jobs and start or stop the processes whenever a new job arrives. Stream Grouping Stream of data flows from spouts to bolts or from one bolt to another bolt. Stream grouping controls how the tuples are routed in the topology and helps us to understand the tuples flow in the topology. There are four in-built groupings as explained below. Shuffle Grouping In shuffle grouping, an equal number of tuples is distributed randomly across all of the workers executing the bolts. The following diagram depicts the structure. Field Grouping The fields with same values in tuples are grouped together and the remaining tuples kept outside. Then, the tuples with the same field values are sent forward to the same worker executing the bolts. For example, if the stream is grouped by the field “word”, then the tuples with the same string, “Hello” will move to the same worker. The following diagram shows how Field Grouping works. Global Grouping All the streams can be grouped and forward to one bolt. This grouping sends tuples generated by all instances of the source to a single target instance (specifically, pick the worker with lowest ID). All Grouping All Grouping sends a single copy of each tuple to all instances of the receiving bolt. This kind of grouping is used to send signals to bolts. All grouping is useful for join operations. Print Page Previous Next Advertisements ”;

Apache Pig – Explain Operator

Apache Pig – Explain Operator ”; Previous Next The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation. Syntax Given below is the syntax of the explain operator. grunt> explain Relation_name; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us explain the relation named student using the explain operator as shown below. grunt> explain student; Output It will produce the following output. $ explain student; 2015-10-05 11:32:43,660 [main] 2015-10-05 11:32:43,660 [main] INFO org.apache.pig.newplan.logical.optimizer .LogicalPlanOptimizer – {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} #———————————————– # New Logical Plan: #———————————————– student: (Name: LOStore Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | |—student: (Name: LOForEach Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | | | (Name: LOGenerate[false,false,false,false,false] Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33, 31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31] | | | | | (Name: Cast Type: int Uid: 31) | | | | | |—id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 32) | | | | | |—firstname:(Name: Project Type: bytearray Uid: 32 Input: 1 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 33) | | | | | |—lastname:(Name: Project Type: bytearray Uid: 33 Input: 2 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 34) | | | | | |—phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 35) | | | | | |—city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column: (*)) | | | |—(Name: LOInnerLoad[0] Schema: id#31:bytearray) | | | |—(Name: LOInnerLoad[1] Schema: firstname#32:bytearray) | | | |—(Name: LOInnerLoad[2] Schema: lastname#33:bytearray) | | | |—(Name: LOInnerLoad[3] Schema: phone#34:bytearray) | | | |—(Name: LOInnerLoad[4] Schema: city#35:bytearray) | |—student: (Name: LOLoad Schema: id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray ,city#35:bytearray)RequiredFields:null #———————————————– # Physical Plan: #———————————————– student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36 | |—student: New For Each(false,false,false,false,false)[bag] – scope-35 | | | Cast[int] – scope-21 | | | |—Project[bytearray][0] – scope-20 | | | Cast[chararray] – scope-24 | | | |—Project[bytearray][1] – scope-23 | | | Cast[chararray] – scope-27 | | | |—Project[bytearray][2] – scope-26 | | | Cast[chararray] – scope-30 | | | |—Project[bytearray][3] – scope-29 | | | Cast[chararray] – scope-33 | | | |—Project[bytearray][4] – scope-32 | |—student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(”,”)) – scope19 2015-10-05 11:32:43,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false 2015-10-05 11:32:43,684 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer – MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MultiQueryOp timizer – MR plan size after optimization: 1 #————————————————– # Map Reduce Plan #————————————————– MapReduce node scope-37 Map Plan student: Store(fakefile:org.apache.pig.builtin.PigStorage) – scope-36 | |—student: New For Each(false,false,false,false,false)[bag] – scope-35 | | | Cast[int] – scope-21 | | | |—Project[bytearray][0] – scope-20 | | | Cast[chararray] – scope-24 | | | |—Project[bytearray][1] – scope-23 | | | Cast[chararray] – scope-27 | | | |—Project[bytearray][2] – scope-26 | | | Cast[chararray] – scope-30 | | | |—Project[bytearray][3] – scope-29 | | | Cast[chararray] – scope-33 | | | |—Project[bytearray][4] – scope-32 | |—student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage(”,”)) – scope 19——– Global sort: false —————- Print Page Previous Next Advertisements ”;

Apache Storm in Yahoo! Finance

Apache Storm in Yahoo! Finance ”; Previous Next Yahoo! Finance is the Internet”s leading business news and financial data website. It is a part of Yahoo! and gives information about financial news, market statistics, international market data and other information about financial resources that anyone can access. If you are a registered Yahoo! user, then you can customize Yahoo! Finance to take advantage of its certain offerings. Yahoo! Finance API is used to query financial data from Yahoo! This API displays data that is delayed by 15-minutes from real time, and updates its database every 1 minute, to access current stock-related information. Now let us take a real-time scenario of a company and see how to raise an alert when its stock value goes below 100. Spout Creation The purpose of spout is to get the details of the company and emit the prices to bolts. You can use the following program code to create a spout. Coding: YahooFinanceSpout.java import java.util.*; import java.io.*; import java.math.BigDecimal; //import yahoofinace packages import yahoofinance.YahooFinance; import yahoofinance.Stock; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.topology.IRichSpout; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; public class YahooFinanceSpout implements IRichSpout { private SpoutOutputCollector collector; private boolean completed = false; private TopologyContext context; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector){ this.context = context; this.collector = collector; } @Override public void nextTuple() { try { Stock stock = YahooFinance.get(“INTC”); BigDecimal price = stock.getQuote().getPrice(); this.collector.emit(new Values(“INTC”, price.doubleValue())); stock = YahooFinance.get(“GOOGL”); price = stock.getQuote().getPrice(); this.collector.emit(new Values(“GOOGL”, price.doubleValue())); stock = YahooFinance.get(“AAPL”); price = stock.getQuote().getPrice(); this.collector.emit(new Values(“AAPL”, price.doubleValue())); } catch(Exception e) {} } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“company”, “price”)); } @Override public void close() {} public boolean isDistributed() { return false; } @Override public void activate() {} @Override public void deactivate() {} @Override public void ack(Object msgId) {} @Override public void fail(Object msgId) {} @Override public Map<String, Object> getComponentConfiguration() { return null; } } Bolt Creation Here the purpose of bolt is to process the given company’s prices when the prices fall below 100. It uses Java Map object to set the cutoff price limit alert as true when the stock prices fall below 100; otherwise false. The complete program code is as follows − Coding: PriceCutOffBolt.java import java.util.HashMap; import java.util.Map; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.task.OutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.IRichBolt; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.tuple.Tuple; public class PriceCutOffBolt implements IRichBolt { Map<String, Integer> cutOffMap; Map<String, Boolean> resultMap; private OutputCollector collector; @Override public void prepare(Map conf, TopologyContext context, OutputCollector collector) { this.cutOffMap = new HashMap <String, Integer>(); this.cutOffMap.put(“INTC”, 100); this.cutOffMap.put(“AAPL”, 100); this.cutOffMap.put(“GOOGL”, 100); this.resultMap = new HashMap<String, Boolean>(); this.collector = collector; } @Override public void execute(Tuple tuple) { String company = tuple.getString(0); Double price = tuple.getDouble(1); if(this.cutOffMap.containsKey(company)){ Integer cutOffPrice = this.cutOffMap.get(company); if(price < cutOffPrice) { this.resultMap.put(company, true); } else { this.resultMap.put(company, false); } } collector.ack(tuple); } @Override public void cleanup() { for(Map.Entry<String, Boolean> entry:resultMap.entrySet()){ System.out.println(entry.getKey()+” : ” + entry.getValue()); } } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“cut_off_price”)); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } Submitting a Topology This is the main application where YahooFinanceSpout.java and PriceCutOffBolt.java are connected together and produce a topology. The following program code shows how you can submit a topology. Coding: YahooFinanceStorm.java import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.Config; import backtype.storm.LocalCluster; import backtype.storm.topology.TopologyBuilder; public class YahooFinanceStorm { public static void main(String[] args) throws Exception{ Config config = new Config(); config.setDebug(true); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“yahoo-finance-spout”, new YahooFinanceSpout()); builder.setBolt(“price-cutoff-bolt”, new PriceCutOffBolt()) .fieldsGrouping(“yahoo-finance-spout”, new Fields(“company”)); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“YahooFinanceStorm”, config, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } Building and Running the Application The complete application has three Java codes. They are as follows − YahooFinanceSpout.java PriceCutOffBolt.java YahooFinanceStorm.java The application can be built using the following command − javac -cp “/path/to/storm/apache-storm-0.9.5/lib/*”:”/path/to/yahoofinance/lib/*” *.java The application can be run using the following command − javac -cp “/path/to/storm/apache-storm-0.9.5/lib/*”:”/path/to/yahoofinance/lib/*”:. YahooFinanceStorm Output The output will be similar to the following − GOOGL : false AAPL : false INTC : true Print Page Previous Next Advertisements ”;

Apache Pig – Split Operator

Apache Pig – Split Operator ”; Previous Next The SPLIT operator is used to split a relation into two or more relations. Syntax Given below is the syntax of the SPLIT operator. grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2), Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); Let us now split the relation into two, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25. SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25); Verification Verify the relations student_details1 and student_details2 using the DUMP operator as shown below. grunt> Dump student_details1; grunt> Dump student_details2; Output It will produce the following output, displaying the contents of the relations student_details1 and student_details2 respectively. grunt> Dump student_details1; (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune) grunt> Dump student_details2; (5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) (6,Archana,Mishra,23,9848022335,Chennai) (7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,Chennai) Print Page Previous Next Advertisements ”;