Apache Kafka – Useful Resources

Apache Kafka – Useful Resources ”; Previous Next The following resources contain additional information on Apache Kafka. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Apache Spark Online Training Course 47 Lectures 3.5 hours Tutorialspoint More Detail Apache Cassandra for Beginners 28 Lectures 2 hours Navdeep Kaur More Detail NGINX, Apache, SSL Encryption – Training Course 60 Lectures 3.5 hours YouAccel More Detail Learn Advanced Apache Kafka from Scratch Featured 154 Lectures 9 hours Learnkart Technology Pvt Ltd More Detail Kafka Basics and Develop Kafka Java Clients 27 Lectures 3 hours Narender Singh Chaudhary More Detail Kafka Streams with Spring Cloud Stream 49 Lectures 7 hours Packt Publishing More Detail Print Page Previous Next Advertisements ”;

Apache Pig – Foreach Operator

Apache Pig – Foreach Operator ”; Previous Next The FOREACH operator is used to generate specified data transformations based on the column data. Syntax Given below is the syntax of FOREACH operator. grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data); Example Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai And we have loaded this file into Pig with the relation name student_details as shown below. grunt> student_details = LOAD ”hdfs://localhost:9000/pig_data/student_details.txt” USING PigStorage(”,”) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray); Let us now get the id, age, and city values of each student from the relation student_details and store it into another relation named foreach_data using the foreach operator as shown below. grunt> foreach_data = FOREACH student_details GENERATE id,age,city; Verification Verify the relation foreach_data using the DUMP operator as shown below. grunt> Dump foreach_data; Output It will produce the following output, displaying the contents of the relation foreach_data. (1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi) (4,21,Pune) (5,23,Bhuwaneshwar) (6,23,Chennai) (7,24,trivendram) (8,24,Chennai) Print Page Previous Next Advertisements ”;

Apache Pig – Discussion

Discuss Apache Pig ”; Previous Next Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig. Print Page Previous Next Advertisements ”;

Apache Kafka – Fundamentals

Apache Kafka – Fundamentals ”; Previous Next Before moving deep into the Kafka, you must aware of the main terminologies such as topics, brokers, producers and consumers. The following diagram illustrates the main terminologies and the table describes the diagram components in detail. In the above diagram, a topic is configured into three partitions. Partition 1 has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2, and 3. Partition 3 has one offset factor 0. The id of the replica is same as the id of the server that hosts it. Assume, if the replication factor of the topic is set to 3, then Kafka will create 3 identical replicas of each partition and place them in the cluster to make available for all its operations. To balance a load in cluster, each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time. S.No Components and Description 1 Topics A stream of messages belonging to a particular category is called a topic. Data is stored in topics. Topics are split into partitions. For each topic, Kafka keeps a mini-mum of one partition. Each such partition contains messages in an immutable ordered sequence. A partition is implemented as a set of segment files of equal sizes. 2 Partition Topics may have many partitions, so it can handle an arbitrary amount of data. 3 Partition offset Each partitioned message has a unique sequence id called as offset. 4 Replicas of partition Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss. 5 Brokers Brokers are simple system responsible for maintaining the pub-lished data. Each broker may have zero or more partitions per topic. Assume, if there are N partitions in a topic and N number of brokers, each broker will have one partition. Assume if there are N partitions in a topic and more than N brokers (n + m), the first N broker will have one partition and the next M broker will not have any partition for that particular topic. Assume if there are N partitions in a topic and less than N brokers (n-m), each broker will have one or more partition sharing among them. This scenario is not recommended due to unequal load distri-bution among the broker. 6 Kafka Cluster Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data. 7 Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last segment file. Actually, the message will be appended to a partition. Producer can also send messages to a partition of their choice. 8 Consumers Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. 9 Leader Leader is the node responsible for all reads and writes for the given partition. Every partition has one server acting as a leader. 10 Follower Node which follows leader instructions are called as follower. If the leader fails, one of the follower will automatically become the new leader. A follower acts as normal consumer, pulls messages and up-dates its own data store. Print Page Previous Next Advertisements ”;

Simple Producer Example

Apache Kafka – Simple Producer Example ”; Previous Next Let us create an application for publishing and consuming messages using a Java client. Kafka producer client consists of the following API’s. KafkaProducer API Let us understand the most important set of Kafka producer API in this section. The central part of the KafkaProducer API is KafkaProducer class. The KafkaProducer class provides an option to connect a Kafka broker in its constructor with the following methods. KafkaProducer class provides send method to send messages asynchronously to a topic. The signature of send() is as follows producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key1, value1) , callback); ProducerRecord − The producer manages a buffer of records waiting to be sent. Callback − A user-supplied callback to execute when the record has been acknowl-edged by the server (null indicates no callback). KafkaProducer class provides a flush method to ensure all previously sent messages have been actually completed. Syntax of the flush method is as follows − public void flush() KafkaProducer class provides partitionFor method, which helps in getting the partition metadata for a given topic. This can be used for custom partitioning. The signature of this method is as follows − public Map metrics() It returns the map of internal metrics maintained by the producer. public void close() − KafkaProducer class provides close method blocks until all previously sent requests are completed. Producer API The central part of the Producer API is Producer class. Producer class provides an option to connect Kafka broker in its constructor by the following methods. The Producer Class The producer class provides send method to send messages to either single or multiple topics using the following signatures. public void send(KeyedMessaget<k,v> message) – sends the data to a single topic,par-titioned by key using either sync or async producer. public void send(List<KeyedMessage<k,v>>messages) – sends data to multiple topics. Properties prop = new Properties(); prop.put(producer.type,”async”) ProducerConfig config = new ProducerConfig(prop); There are two types of producers – Sync and Async. The same API configuration applies to Sync producer as well. The difference between them is a sync producer sends messages directly, but sends messages in background. Async producer is preferred when you want a higher throughput. In the previous releases like 0.8, an async producer does not have a callback for send() to register error handlers. This is available only in the current release of 0.9. public void close() Producer class provides close method to close the producer pool connections to all Kafka bro-kers. Configuration Settings The Producer API’s main configuration settings are listed in the following table for better under-standing − S.No Configuration Settings and Description 1 client.id identifies producer application 2 producer.type either sync or async 3 acks The acks config controls the criteria under producer requests are con-sidered complete. 4 retries If producer request fails, then automatically retry with specific value. 5 bootstrap.servers bootstrapping list of brokers. 6 linger.ms if you want to reduce the number of requests you can set linger.ms to something greater than some value. 7 key.serializer Key for the serializer interface. 8 value.serializer value for the serializer interface. 9 batch.size Buffer size. 10 buffer.memory controls the total amount of memory available to the producer for buff-ering. ProducerRecord API ProducerRecord is a key/value pair that is sent to Kafka cluster.ProducerRecord class constructor for creating a record with partition, key and value pairs using the following signature. public ProducerRecord (string topic, int partition, k key, v value) Topic − user defined topic name that will appended to record. Partition − partition count Key − The key that will be included in the record. Value − Record contents public ProducerRecord (string topic, k key, v value) ProducerRecord class constructor is used to create a record with key, value pairs and without partition. Topic − Create a topic to assign record. Key − key for the record. Value − record contents. public ProducerRecord (string topic, v value) ProducerRecord class creates a record without partition and key. Topic − create a topic. Value − record contents. The ProducerRecord class methods are listed in the following table − S.No Class Methods and Description 1 public string topic() Topic will append to the record. 2 public K key() Key that will be included in the record. If no such key, null will be re-turned here. 3 public V value() Record contents. 4 partition() Partition count for the record SimpleProducer application Before creating the application, first start ZooKeeper and Kafka broker then create your own topic in Kafka broker using create topic command. After that create a java class named Sim-pleProducer.java and type in the following coding. //import util.properties packages import java.util.Properties; //import simple producer packages import org.apache.kafka.clients.producer.Producer; //import KafkaProducer packages import org.apache.kafka.clients.producer.KafkaProducer; //import ProducerRecord packages import org.apache.kafka.clients.producer.ProducerRecord; //Create java class named “SimpleProducer” public class SimpleProducer { public static void main(String[] args) throws Exception{ // Check arguments length value if(args.length == 0){ System.out.println(“Enter

Apache Kafka – Discussion

Discuss Apache Kafka ”; Previous Next Apache Kafka was originated at LinkedIn and later became an open sourced Apache project in 2011, then First-class Apache project in 2012. Kafka is written in Scala and Java. Apache Kafka is publish-subscribe based fault tolerant messaging system. It is fast, scalable and distributed by design. This tutorial will explore the principles of Kafka, installation, operations and then it will walk you through with the deployment of Kafka cluster. Finally, we will conclude with real-time applica-tions and integration with Big Data Technologies. Print Page Previous Next Advertisements ”;

Apache Flume – Discussion

Discuss Apache Flume ”; Previous Next Flume is a standard, simple, robust, flexible, and extensible tool for data ingestion from various data producers (webservers) into Hadoop. In this tutorial, we will be using simple and illustrative example to explain the basics of Apache Flume and how to use it in practice. Print Page Previous Next Advertisements ”;

Apache Flume – Useful Resources

Apache Flume – Useful Resources ”; Previous Next The following resources contain additional information on Apache Flume. Please use them to get more in-depth knowledge on this. Useful Video Courses Apache Spark Online Training Course 47 Lectures 3.5 hours Tutorialspoint More Detail Delta Lake with Apache Spark using Scala 53 Lectures 2 hours Bigdata Engineer More Detail Apache Spark with Scala for Certified Databricks Professional 78 Lectures 5.5 hours Bigdata Engineer More Detail Apache Cassandra for Beginners 28 Lectures 2 hours Navdeep Kaur More Detail NGINX, Apache, SSL Encryption – Training Course 60 Lectures 3.5 hours YouAccel More Detail Learn Advanced Apache Kafka from Scratch Featured 154 Lectures 9 hours Learnkart Technology Pvt Ltd More Detail Print Page Previous Next Advertisements ”;

Apache Kafka – Applications

Apache Kafka – Applications ”; Previous Next Kafka supports many of today”s best industrial applications. We will provide a very brief overview of some of the most notable applications of Kafka in this chapter. Twitter Twitter is an online social networking service that provides a platform to send and receive user tweets. Registered users can read and post tweets, but unregistered users can only read tweets. Twitter uses Storm-Kafka as a part of their stream processing infrastructure. LinkedIn Apache Kafka is used at LinkedIn for activity stream data and operational metrics. Kafka mes-saging system helps LinkedIn with various products like LinkedIn Newsfeed, LinkedIn Today for online message consumption and in addition to offline analytics systems like Hadoop. Kafka’s strong durability is also one of the key factors in connection with LinkedIn. Netflix Netflix is an American multinational provider of on-demand Internet streaming media. Netflix uses Kafka for real-time monitoring and event processing. Mozilla Mozilla is a free-software community, created in 1998 by members of Netscape. Kafka will soon be replacing a part of Mozilla current production system to collect performance and usage data from the end-user’s browser for projects like Telemetry, Test Pilot, etc. Oracle Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines. Print Page Previous Next Advertisements ”;

Apache Kafka – Installation Steps

Apache Kafka – Installation Steps ”; Previous Next Following are the steps for installing Java on your machine. Step 1 – Verifying Java Installation Hopefully you have already installed java on your machine right now, so you just verify it using the following command. $ java -version If java is successfully installed on your machine, you could see the version of the installed Java. Step 1.1 – Download JDK If Java is not downloaded, please download the latest version of JDK by visiting the following link and download latest version. http://www.oracle.com/technetwork/java/javase/downloads/index.html Now the latest version is JDK 8u 60 and the file is “jdk-8u60-linux-x64.tar.gz”. Please download the file on your machine. Step 1.2 – Extract Files Generally, files being downloaded are stored in the downloads folder, verify it and extract the tar setup using the following commands. $ cd /go/to/download/path $ tar -zxf jdk-8u60-linux-x64.gz Step 1.3 – Move to Opt Directory To make java available to all users, move the extracted java content to usr/local/java/ folder. $ su password: (type password of root user) $ mkdir /opt/jdk $ mv jdk-1.8.0_60 /opt/jdk/ Step 1.4 – Set path To set path and JAVA_HOME variables, add the following commands to ~/.bashrc file. export JAVA_HOME =/usr/jdk/jdk-1.8.0_60 export PATH=$PATH:$JAVA_HOME/bin Now apply all the changes into current running system. $ source ~/.bashrc Step 1.5 – Java Alternatives Use the following command to change Java Alternatives. update-alternatives –install /usr/bin/java java /opt/jdk/jdk1.8.0_60/bin/java 100 Step 1.6 − Now verify java using verification command (java -version) explained in Step 1. Step 2 – ZooKeeper Framework Installation Step 2.1 – Download ZooKeeper To install ZooKeeper framework on your machine, visit the following link and download the latest version of ZooKeeper. http://zookeeper.apache.org/releases.html As of now, latest version of ZooKeeper is 3.4.6 (ZooKeeper-3.4.6.tar.gz). Step 2.2 – Extract tar file Extract tar file using the following command $ cd opt/ $ tar -zxf zookeeper-3.4.6.tar.gz $ cd zookeeper-3.4.6 $ mkdir data Step 2.3 – Create Configuration File Open Configuration File named conf/zoo.cfg using the command vi “conf/zoo.cfg” and all the following parameters to set as starting point. $ vi conf/zoo.cfg tickTime=2000 dataDir=/path/to/zookeeper/data clientPort=2181 initLimit=5 syncLimit=2 Once the configuration file has been saved successfully and return to terminal again, you can start the zookeeper server. Step 2.4 – Start ZooKeeper Server $ bin/zkServer.sh start After executing this command, you will get a response as shown below − $ JMX enabled by default $ Using config: /Users/../zookeeper-3.4.6/bin/../conf/zoo.cfg $ Starting zookeeper … STARTED Step 2.5 – Start CLI $ bin/zkCli.sh After typing the above command, you will be connected to the zookeeper server and will get the below response. Connecting to localhost:2181 ……………. ……………. ……………. Welcome to ZooKeeper! ……………. ……………. WATCHER:: WatchedEvent state:SyncConnected type: None path:null [zk: localhost:2181(CONNECTED) 0] Step 2.6 – Stop Zookeeper Server After connecting the server and performing all the operations, you can stop the zookeeper server with the following command − $ bin/zkServer.sh stop Now you have successfully installed Java and ZooKeeper on your machine. Let us see the steps to install Apache Kafka. Step 3 – Apache Kafka Installation Let us continue with the following steps to install Kafka on your machine. Step 3.1 – Download Kafka To install Kafka on your machine, click on the below link − https://www.apache.org/dyn/closer.cgi?path=/kafka/0.9.0.0/kafka_2.11-0.9.0.0.tgz Now the latest version i.e., – kafka_2.11_0.9.0.0.tgz will be downloaded onto your machine. Step 3.2 – Extract the tar file Extract the tar file using the following command − $ cd opt/ $ tar -zxf kafka_2.11.0.9.0.0 tar.gz $ cd kafka_2.11.0.9.0.0 Now you have downloaded the latest version of Kafka on your machine. Step 3.3 – Start Server You can start the server by giving the following command − $ bin/kafka-server-start.sh config/server.properties After the server starts, you would see the below response on your screen − $ bin/kafka-server-start.sh config/server.properties [2016-01-02 15:37:30,410] INFO KafkaConfig values: request.timeout.ms = 30000 log.roll.hours = 168 inter.broker.protocol.version = 0.9.0.X log.preallocate = false security.inter.broker.protocol = PLAINTEXT ……………………………………………. ……………………………………………. Step 4 – Stop the Server After performing all the operations, you can stop the server using the following command − $ bin/kafka-server-stop.sh config/server.properties Now that we have already discussed the Kafka installation, we can learn how to perform basic operations on Kafka in the next chapter. Print Page Previous Next Advertisements ”;