Big Data & Analytics Archives - Page 67 of 75 - Donotsad where can learn any thing work project and make money

Aug 10

Apache Solr – Querying Data

Apache Solr – Querying Data ”; Previous Next In addition to storing data, Apache Solr also provides the facility of querying it back as and when required. Solr provides certain parameters using which we can query the data stored in it. In the following table, we have listed down the various query parameters available in Apache Solr. Parameter Description q This is the main query parameter of Apache Solr, documents are scored by their similarity to terms in this parameter. fq This parameter represents the filter query of Apache Solr the restricts the result set to documents matching this filter. start The start parameter represents the starting offsets for a page results the default value of this parameter is 0. rows This parameter represents the number of the documents that are to be retrieved per page. The default value of this parameter is 10. sort This parameter specifies the list of fields, separated by commas, based on which the results of the query is to be sorted. fl This parameter specifies the list of the fields to return for each document in the result set. wt This parameter represents the type of the response writer we wanted to view the result. You can see all these parameters as options to query Apache Solr. Visit the homepage of Apache Solr. On the left-hand side of the page, click on the option Query. Here, you can see the fields for the parameters of a query. Retrieving the Records Assume we have 3 records in the core named my_core. To retrieve a particular record from the selected core, you need to pass the name and value pairs of the fields of a particular document. For example, if you want to retrieve the record with the value of the field id, you need to pass the name-value pair of the field as − Id:001 as value for the parameter q and execute the query. In the same way, you can retrieve all the records from an index by passing *:* as a value to the parameter q, as shown in the following screenshot. Retrieving from the 2nd record We can retrieve the records from the second record by passing 2 as a value to the parameter start, as shown in the following screenshot. Restricting the Number of Records You can restrict the number of records by specifying a value in the rows parameter. For example, we can restrict the total number of records in the result of the query to 2 by passing the value 2 into the parameter rows, as shown in the following screenshot. Response Writer Type You can get the response in required document type by selecting one from the provided values of the parameter wt. In the above instance, we have chosen the .csv format to get the response. List of the Fields If we want to have particular fields in the resulted documents, we need to pass the list of the required fields, separated by commas, as a value to the property fl. In the following example, we are trying to retrieve the fields − id, phone, and first_name. Print Page Previous Next Advertisements ”;

Aug 10

Apache Solr – Retrieving Data

Apache Solr – Retrieving Data ”; Previous Next In this chapter, we will discuss how to retrieve data using Java Client API. Suppose we have a .csv document named sample.csv with the following content. 001,9848022337,Hyderabad,Rajiv,Reddy 002,9848022338,Kolkata,Siddarth,Battacharya 003,9848022339,Delhi,Rajesh,Khanna You can index this data under the core named sample_Solr using the post command. [Hadoop@localhost bin]$ ./post -c Solr_sample sample.csv Following is the Java program to add documents to Apache Solr index. Save this code in a file with named RetrievingData.java. import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrQuery; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.client.Solrj.response.QueryResponse; import org.apache.Solr.common.SolrDocumentList; public class RetrievingData { public static void main(String args[]) throws SolrServerException, IOException { //Preparing the Solr client String urlString = “http://localhost:8983/Solr/my_core”; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing Solr query SolrQuery query = new SolrQuery(); query.setQuery(“*:*”); //Adding the field to be retrieved query.addField(“*”); //Executing the query QueryResponse queryResponse = Solr.query(query); //Storing the results of the query SolrDocumentList docs = queryResponse.getResults(); System.out.println(docs); System.out.println(docs.get(0)); System.out.println(docs.get(1)); System.out.println(docs.get(2)); //Saving the operations Solr.commit(); } } Compile the above code by executing the following commands in the terminal − [Hadoop@localhost bin]$ javac RetrievingData [Hadoop@localhost bin]$ java RetrievingData On executing the above command, you will get the following output. {numFound = 3,start = 0,docs = [SolrDocument{id=001, phone = [9848022337], city = [Hyderabad], first_name = [Rajiv], last_name = [Reddy], _version_ = 1547262806014820352}, SolrDocument{id = 002, phone = [9848022338], city = [Kolkata], first_name = [Siddarth], last_name = [Battacharya], _version_ = 1547262806026354688}, SolrDocument{id = 003, phone = [9848022339], city = [Delhi], first_name = [Rajesh], last_name = [Khanna], _version_ = 1547262806029500416}]} SolrDocument{id = 001, phone = [9848022337], city = [Hyderabad], first_name = [Rajiv], last_name = [Reddy], _version_ = 1547262806014820352} SolrDocument{id = 002, phone = [9848022338], city = [Kolkata], first_name = [Siddarth], last_name = [Battacharya], _version_ = 1547262806026354688} SolrDocument{id = 003, phone = [9848022339], city = [Delhi], first_name = [Rajesh], last_name = [Khanna], _version_ = 1547262806029500416} Print Page Previous Next Advertisements ”;

Aug 10

Apache Solr – Architecture

Apache Solr – Architecture ”; Previous Next In this chapter, we will discuss the architecture of Apache Solr. The following illustration shows a block diagram of the architecture of Apache Solr. Solr Architecture ─ Building Blocks Following are the major building blocks (components) of Apache Solr − Request Handler − The requests we send to Apache Solr are processed by these request handlers. The requests might be query requests or index update requests. Based on our requirement, we need to select the request handler. To pass a request to Solr, we will generally map the handler to a certain URI end-point and the specified request will be served by it. Search Component − A search component is a type (feature) of search provided in Apache Solr. It might be spell checking, query, faceting, hit highlighting, etc. These search components are registered as search handlers. Multiple components can be registered to a search handler. Query Parser − The Apache Solr query parser parses the queries that we pass to Solr and verifies the queries for syntactical errors. After parsing the queries, it translates them to a format which Lucene understands. Response Writer − A response writer in Apache Solr is the component which generates the formatted output for the user queries. Solr supports response formats such as XML, JSON, CSV, etc. We have different response writers for each type of response. Analyzer/tokenizer − Lucene recognizes data in the form of tokens. Apache Solr analyzes the content, divides it into tokens, and passes these tokens to Lucene. An analyzer in Apache Solr examines the text of fields and generates a token stream. A tokenizer breaks the token stream prepared by the analyzer into tokens. Update Request Processor − Whenever we send an update request to Apache Solr, the request is run through a set of plugins (signature, logging, indexing), collectively known as update request processor. This processor is responsible for modifications such as dropping a field, adding a field, etc. Print Page Previous Next Advertisements ”;

Aug 10

Apache Solr – Terminology

Apache Solr – Terminology ”; Previous Next In this chapter, we will try to understand the real meaning of some of the terms that are frequently used while working on Solr. General Terminology The following is a list of general terms that are used across all types of Solr setups − Instance − Just like a tomcat instance or a jetty instance, this term refers to the application server, which runs inside a JVM. The home directory of Solr provides reference to each of these Solr instances, in which one or more cores can be configured to run in each instance. Core − While running multiple indexes in your application, you can have multiple cores in each instance, instead of multiple instances each having one core. Home − The term $SOLR_HOME refers to the home directory which has all the information regarding the cores and their indexes, configurations, and dependencies. Shard − In distributed environments, the data is partitioned between multiple Solr instances, where each chunk of data can be called as a Shard. It contains a subset of the whole index. SolrCloud Terminology In an earlier chapter, we discussed how to install Apache Solr in standalone mode. Note that we can also install Solr in distributed mode (cloud environment) where Solr is installed in a master-slave pattern. In distributed mode, the index is created on the master server and it is replicated to one or more slave servers. The key terms associated with Solr Cloud are as follows − Node − In Solr cloud, each single instance of Solr is regarded as a node. Cluster − All the nodes of the environment combined together make a cluster. Collection − A cluster has a logical index that is known as a collection. Shard − A shard is portion of the collection which has one or more replicas of the index. Replica − In Solr Core, a copy of shard that runs in a node is known as a replica. Leader − It is also a replica of shard, which distributes the requests of the Solr Cloud to the remaining replicas. Zookeeper − It is an Apache project that Solr Cloud uses for centralized configuration and coordination, to manage the cluster and to elect a leader. Configuration Files The main configuration files in Apache Solr are as follows − Solr.xml − It is the file in the $SOLR_HOME directory that contains Solr Cloud related information. To load the cores, Solr refers to this file, which helps in identifying them. Solrconfig.xml − This file contains the definitions and core-specific configurations related to request handling and response formatting, along with indexing, configuring, managing memory and making commits. Schema.xml − This file contains the whole schema along with the fields and field types. Core.properties − This file contains the configurations specific to the core. It is referred for core discovery, as it contains the name of the core and path of the data directory. It can be used in any directory, which will then be treated as the core directory. Print Page Previous Next Advertisements ”;

Aug 10

Apache Solr – Basic Commands

Apache Solr – Basic Commands ”; Previous Next Starting Solr After installing Solr, browse to the bin folder in Solr home directory and start Solr using the following command. [Hadoop@localhost ~]$ cd [Hadoop@localhost ~]$ cd Solr/ [Hadoop@localhost Solr]$ cd bin/ [Hadoop@localhost bin]$ ./Solr start This command starts Solr in the background, listening on port 8983 by displaying the following message. Waiting up to 30 seconds to see Solr running on port 8983 [] Started Solr server on port 8983 (pid = 6035). Happy searching! Starting Solr in foreground If you start Solr using the start command, then Solr will start in the background. Instead, you can start Solr in the foreground using the –f option. [Hadoop@localhost bin]$ ./Solr start –f 5823 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.SolrResourceLoader Adding ”file:/home/Hadoop/Solr/contrib/extraction/lib/xmlbeans-2.6.0.jar” to classloader 5823 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.SolrResourceLoader Adding ”file:/home/Hadoop/Solr/dist/Solr-cell-6.2.0.jar” to classloader 5823 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.SolrResourceLoader Adding ”file:/home/Hadoop/Solr/contrib/clustering/lib/carrot2-guava-18.0.jar” to classloader 5823 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.SolrResourceLoader Adding ”file:/home/Hadoop/Solr/contrib/clustering/lib/attributes-binder1.3.1.jar” to classloader 5823 INFO (coreLoadExecutor-6-thread-2) [ ] o.a.s.c.SolrResourceLoader Adding ”file:/home/Hadoop/Solr/contrib/clustering/lib/simple-xml-2.7.1.jar” to classloader …………………………………………………………………………………………………………………………………………………………………………………………………………… …………………………………………………………………………………………………………………………………………………………………………………………………. 12901 INFO (coreLoadExecutor-6-thread-1) [ x:Solr_sample] o.a.s.u.UpdateLog Took 24.0ms to seed version buckets with highest version 1546058939881226240 12902 INFO (coreLoadExecutor-6-thread-1) [ x:Solr_sample] o.a.s.c.CoreContainer registering core: Solr_sample 12904 INFO (coreLoadExecutor-6-thread-2) [ x:my_core] o.a.s.u.UpdateLog Took 16.0ms to seed version buckets with highest version 1546058939894857728 12904 INFO (coreLoadExecutor-6-thread-2) [ x:my_core] o.a.s.c.CoreContainer registering core: my_core Starting Solr on another port Using –p option of the start command, we can start Solr in another port, as shown in the following code block. [Hadoop@localhost bin]$ ./Solr start -p 8984 Waiting up to 30 seconds to see Solr running on port 8984 [-] Started Solr server on port 8984 (pid = 10137). Happy searching! Stopping Solr You can stop Solr using the stop command. $ ./Solr stop This command stops Solr, displaying a message as shown below. Sending stop command to Solr running on port 8983 … waiting 5 seconds to allow Jetty process 6035 to stop gracefully. Restarting Solr The restart command of Solr stops Solr for 5 seconds and starts it again. You can restart Solr using the following command − ./Solr restart This command restarts Solr, displaying the following message − Sending stop command to Solr running on port 8983 … waiting 5 seconds to allow Jetty process 6671 to stop gracefully. Waiting up to 30 seconds to see Solr running on port 8983 [|] [/] Started Solr server on port 8983 (pid = 6906). Happy searching! Solr ─ help Command The help command of Solr can be used to check the usage of the Solr prompt and its options. [Hadoop@localhost bin]$ ./Solr -help Usage: Solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version, zk Standalone server example (start Solr running in the background on port 8984): ./Solr start -p 8984 SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to Zookeeper, with 1g max heap size and remote Java debug options enabled): ./Solr start -c -m 1g -z localhost:2181 -a “-Xdebug – Xrunjdwp:transport = dt_socket,server = y,suspend = n,address = 1044” Pass -help after any COMMAND to see command-specific usage information, such as: ./Solr start -help or ./Solr stop -help Solr ─ status Command This status command of Solr can be used to search and find out the running Solr instances on your computer. It can provide you information about a Solr instance such as its version, memory usage, etc. You can check the status of a Solr instance, using the status command as follows − [Hadoop@localhost bin]$ ./Solr status On executing, the above command displays the status of Solr as follows − Found 1 Solr nodes: Solr process 6906 running on port 8983 { “Solr_home”:”/home/Hadoop/Solr/server/Solr”, “version”:”6.2.0 764d0f19151dbff6f5fcd9fc4b2682cf934590c5 – mike – 2016-08-20 05:41:37″, “startTime”:”2016-09-20T06:00:02.877Z”, “uptime”:”0 days, 0 hours, 5 minutes, 14 seconds”, “memory”:”30.6 MB (%6.2) of 490.7 MB” } Solr Admin After starting Apache Solr, you can visit the homepage of the Solr web interface by using the following URL. Localhost:8983/Solr/ The interface of Solr Admin appears as follows − Print Page Previous Next Advertisements ”;

Aug 10

Apache Pig – Architecture

Apache Pig – Architecture ”; Previous Next The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data. To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below. Apache Pig Components As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components. Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges. Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results. Pig Latin Data Model The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model. Atom Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Example − ‘raja’ or ‘30’ Tuple A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Example − (Raja, 30) Bag A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. Example − {(Raja, 30), (Mohammad, 45)} A bag can be a field in a relation; in that context, it is known as inner bag. Example − {Raja, 30, {9848022338, [email protected],}} Map A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ Example − [name#Raja, age#30] Relation A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order). Print Page Previous Next Advertisements ”;

Aug 10

Apache Kafka – Fundamentals

Apache Kafka – Fundamentals ”; Previous Next Before moving deep into the Kafka, you must aware of the main terminologies such as topics, brokers, producers and consumers. The following diagram illustrates the main terminologies and the table describes the diagram components in detail. In the above diagram, a topic is configured into three partitions. Partition 1 has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2, and 3. Partition 3 has one offset factor 0. The id of the replica is same as the id of the server that hosts it. Assume, if the replication factor of the topic is set to 3, then Kafka will create 3 identical replicas of each partition and place them in the cluster to make available for all its operations. To balance a load in cluster, each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time. S.No Components and Description 1 Topics A stream of messages belonging to a particular category is called a topic. Data is stored in topics. Topics are split into partitions. For each topic, Kafka keeps a mini-mum of one partition. Each such partition contains messages in an immutable ordered sequence. A partition is implemented as a set of segment files of equal sizes. 2 Partition Topics may have many partitions, so it can handle an arbitrary amount of data. 3 Partition offset Each partitioned message has a unique sequence id called as offset. 4 Replicas of partition Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss. 5 Brokers Brokers are simple system responsible for maintaining the pub-lished data. Each broker may have zero or more partitions per topic. Assume, if there are N partitions in a topic and N number of brokers, each broker will have one partition. Assume if there are N partitions in a topic and more than N brokers (n + m), the first N broker will have one partition and the next M broker will not have any partition for that particular topic. Assume if there are N partitions in a topic and less than N brokers (n-m), each broker will have one or more partition sharing among them. This scenario is not recommended due to unequal load distri-bution among the broker. 6 Kafka Cluster Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data. 7 Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last segment file. Actually, the message will be appended to a partition. Producer can also send messages to a partition of their choice. 8 Consumers Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. 9 Leader Leader is the node responsible for all reads and writes for the given partition. Every partition has one server acting as a leader. 10 Follower Node which follows leader instructions are called as follower. If the leader fails, one of the follower will automatically become the new leader. A follower acts as normal consumer, pulls messages and up-dates its own data store. Print Page Previous Next Advertisements ”;

Aug 10

Simple Producer Example

Apache Kafka – Simple Producer Example ”; Previous Next Let us create an application for publishing and consuming messages using a Java client. Kafka producer client consists of the following API’s. KafkaProducer API Let us understand the most important set of Kafka producer API in this section. The central part of the KafkaProducer API is KafkaProducer class. The KafkaProducer class provides an option to connect a Kafka broker in its constructor with the following methods. KafkaProducer class provides send method to send messages asynchronously to a topic. The signature of send() is as follows producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key1, value1) , callback); ProducerRecord − The producer manages a buffer of records waiting to be sent. Callback − A user-supplied callback to execute when the record has been acknowl-edged by the server (null indicates no callback). KafkaProducer class provides a flush method to ensure all previously sent messages have been actually completed. Syntax of the flush method is as follows − public void flush() KafkaProducer class provides partitionFor method, which helps in getting the partition metadata for a given topic. This can be used for custom partitioning. The signature of this method is as follows − public Map metrics() It returns the map of internal metrics maintained by the producer. public void close() − KafkaProducer class provides close method blocks until all previously sent requests are completed. Producer API The central part of the Producer API is Producer class. Producer class provides an option to connect Kafka broker in its constructor by the following methods. The Producer Class The producer class provides send method to send messages to either single or multiple topics using the following signatures. public void send(KeyedMessaget<k,v> message) – sends the data to a single topic,par-titioned by key using either sync or async producer. public void send(List<KeyedMessage<k,v>>messages) – sends data to multiple topics. Properties prop = new Properties(); prop.put(producer.type,”async”) ProducerConfig config = new ProducerConfig(prop); There are two types of producers – Sync and Async. The same API configuration applies to Sync producer as well. The difference between them is a sync producer sends messages directly, but sends messages in background. Async producer is preferred when you want a higher throughput. In the previous releases like 0.8, an async producer does not have a callback for send() to register error handlers. This is available only in the current release of 0.9. public void close() Producer class provides close method to close the producer pool connections to all Kafka bro-kers. Configuration Settings The Producer API’s main configuration settings are listed in the following table for better under-standing − S.No Configuration Settings and Description 1 client.id identifies producer application 2 producer.type either sync or async 3 acks The acks config controls the criteria under producer requests are con-sidered complete. 4 retries If producer request fails, then automatically retry with specific value. 5 bootstrap.servers bootstrapping list of brokers. 6 linger.ms if you want to reduce the number of requests you can set linger.ms to something greater than some value. 7 key.serializer Key for the serializer interface. 8 value.serializer value for the serializer interface. 9 batch.size Buffer size. 10 buffer.memory controls the total amount of memory available to the producer for buff-ering. ProducerRecord API ProducerRecord is a key/value pair that is sent to Kafka cluster.ProducerRecord class constructor for creating a record with partition, key and value pairs using the following signature. public ProducerRecord (string topic, int partition, k key, v value) Topic − user defined topic name that will appended to record. Partition − partition count Key − The key that will be included in the record. Value − Record contents public ProducerRecord (string topic, k key, v value) ProducerRecord class constructor is used to create a record with key, value pairs and without partition. Topic − Create a topic to assign record. Key − key for the record. Value − record contents. public ProducerRecord (string topic, v value) ProducerRecord class creates a record without partition and key. Topic − create a topic. Value − record contents. The ProducerRecord class methods are listed in the following table − S.No Class Methods and Description 1 public string topic() Topic will append to the record. 2 public K key() Key that will be included in the record. If no such key, null will be re-turned here. 3 public V value() Record contents. 4 partition() Partition count for the record SimpleProducer application Before creating the application, first start ZooKeeper and Kafka broker then create your own topic in Kafka broker using create topic command. After that create a java class named Sim-pleProducer.java and type in the following coding. //import util.properties packages import java.util.Properties; //import simple producer packages import org.apache.kafka.clients.producer.Producer; //import KafkaProducer packages import org.apache.kafka.clients.producer.KafkaProducer; //import ProducerRecord packages import org.apache.kafka.clients.producer.ProducerRecord; //Create java class named “SimpleProducer” public class SimpleProducer { public static void main(String[] args) throws Exception{ // Check arguments length value if(args.length == 0){ System.out.println(“Enter

Aug 10

Apache Kafka – Discussion

Discuss Apache Kafka ”; Previous Next Apache Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011, then First-class Apache project in 2012. Kafka is written in Scala and Java. Apache Kafka is publish-subscribe based fault tolerant messaging system. It is fast, scalable and distributed by design. This tutorial will explore the principles of Kafka, installation, operations and then it will walk you through with the deployment of Kafka cluster. Finally, we will conclude with real-time applica-tions and integration with Big Data Technologies. Print Page Previous Next Advertisements ”;

Aug 10

Apache Kafka – Introduction

Apache Kafka – Introduction ”; Previous Next In Big Data, an enormous volume of data is used. Regarding data, we have two main challenges.The first challenge is how to collect large volume of data and the second challenge is to analyze the collected data. To overcome those challenges, you must need a messaging system. Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications. What is a Messaging System? A Messaging System is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging system. Two types of messaging patterns are available − one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub. Point to Point Messaging System In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue. The typical example of this system is an Order Processing System, where each order will be processed by one Order Processor, but Multiple Order Processors can work as well at the same time. The following diagram depicts the structure. Publish-Subscribe Messaging System In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers. A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can subscribe to their own set of channels and get them whenever their subscribed channels are available. What is Kafka? Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis. Benefits Following are a few benefits of Kafka − Reliability − Kafka is distributed, partitioned, replicated and fault tolerance. Scalability − Kafka messaging system scales easily without down time.. Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable.. Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored. Kafka is very fast and guarantees zero downtime and zero data loss. Use Cases Kafka can be used in many Use Cases. Some of them are listed below − Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data. Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers. Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing. Need for Kafka Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket. Print Page Previous Next Advertisements ”;