Zookeeper – Useful Resources ”; Previous Next The following resources contain additional information on Zookeeper. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Building Application Ecosystem with Docker Compose 15 Lectures 40 mins Prashant Hardikar More Detail Web Apps with ReactJS and Redux – The Complete Course 64 Lectures 9.5 hours TELCOMA Global More Detail Learn Big Data Hadoop: Hands-On for Beginner 256 Lectures 13.5 hours Bigdata Engineer More Detail Learn Advanced Apache Kafka from Scratch Featured 154 Lectures 9 hours Learnkart Technology Pvt Ltd More Detail Apache Kafka for Beginners – Learn Kafka by Hands-On 54 Lectures 4.5 hours Packt Publishing More Detail Apache Storm Course 21 Lectures 1.5 hours Corporate Bridge Consultancy Private Limited More Detail Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Sqoop – Quick Guide
Sqoop – Quick Guide ”; Previous Next Sqoop – Introduction The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure. When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational database server and Hadoop’s HDFS. Sqoop − “SQL to Hadoop and Hadoop to SQL” Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided by the Apache Software Foundation. How Sqoop Works? The following image describes the workflow of Sqoop. Sqoop Import The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files. Sqoop Export The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter. Sqoop – Installation As Sqoop is a sub-project of Hadoop, it can only work on Linux operating system. Follow the steps given below to install Sqoop on your system. Step 1: Verifying JAVA Installation You need to have Java installed on your system before installing Sqoop. Let us verify Java installation using the following command − $ java –version If Java is already installed on your system, you get to see the following response − java version “1.7.0_71” Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) If Java is not installed on your system, then follow the steps given below. Installing Java Follow the simple steps given below to install Java on your system. Step 1 Download Java (JDK <latest version> – X64.tar.gz) by visiting the following link. Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step 2 Generally, you can find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands. $ cd Downloads/ $ ls jdk-7u71-linux-x64.gz $ tar zxf jdk-7u71-linux-x64.gz $ ls jdk1.7.0_71 jdk-7u71-linux-x64.gz Step 3 To make Java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands. $ su password: # mv jdk1.7.0_71 /usr/local/java # exitStep IV: Step 4 For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file. export JAVA_HOME=/usr/local/java export PATH=$PATH:$JAVA_HOME/bin Now apply all the changes into the current running system. $ source ~/.bashrc Step 5 Use the following commands to configure Java alternatives − # alternatives –install /usr/bin/java java usr/local/java/bin/java 2 # alternatives –install /usr/bin/javac javac usr/local/java/bin/javac 2 # alternatives –install /usr/bin/jar jar usr/local/java/bin/jar 2 # alternatives –set java usr/local/java/bin/java # alternatives –set javac usr/local/java/bin/javac # alternatives –set jar usr/local/java/bin/jar Now verify the installation using the command java -version from the terminal as explained above. Step 2: Verifying Hadoop Installation Hadoop must be installed on your system before installing Sqoop. Let us verify the Hadoop installation using the following command − $ hadoop version If Hadoop is already installed on your system, then you will get the following response − Hadoop 2.4.1 — Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 If Hadoop is not installed on your system, then proceed with the following steps − Downloading Hadoop Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands. $ su password: # cd /usr/local # wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ hadoop-2.4.1.tar.gz # tar xzf hadoop-2.4.1.tar.gz # mv hadoop-2.4.1/* to hadoop/ # exit Installing Hadoop in Pseudo Distributed Mode Follow the steps given below to install Hadoop 2.4.1 in pseudo-distributed mode. Step 1: Setting up Hadoop You can set Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin Now, apply all the changes into the current running system. $ source ~/.bashrc Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs using java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system. export JAVA_HOME=/usr/local/java Given below is the list of files that you need to edit to configure Hadoop. core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers. Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000 </value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode path of your local file systems. It means the place where you want to store the Hadoop infrastructure. Let us assume the following data. dfs.replication (data replication value) = 1 (In the following path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties in between the <configuration>, </configuration>
Tableau – Trend Lines
Tableau – Trend Lines ”; Previous Next Trend lines are used to predict the continuation of a certain trend of a variable. It also helps to identify the correlation between two variables by observing the trend in both of them simultaneously. There are many mathematical models for establishing trend lines. Tableau provides four options. They are Linear, Logarithmic, Exponential, and Polynomial. In this chapter, only the linear model is discussed. Tableau takes a time dimension and a measure field to create a Trend Line. Creating a Trend Line Using the Sample-superstore, find the trend for the value of the measure sales for next year. To achieve this objective, following are the steps. Step 1 − Drag the dimension Order date to the Column shelf and the measure Sales to the Rows shelf. Choose the chart type as Line chart. In the Analysis menu, go to model → Trend Line. Clicking on Trend Line pops up an option showing different types of trend lines that can be added. Choose the linear model as shown in the following screenshot. Step 2 − On completion of the above step, you will get various trend lines. It also shows the mathematical expression for the correlation between the fields, the P-Value and the R-Squared value. Describe the Trend Line Right-click on the chart and select the option Describe Trend Line to get a detailed description of the Trend Line chart. It shows the coefficients, intercept value, and the equation. These details can also be copied to the clipboard and used in further analysis. Print Page Previous Next Advertisements ”;
Tableau – Histogram
Tableau – Histogram ”; Previous Next A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chart but it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range. Tableau creates a histogram by taking one measure. It creates an additional bin field for the measure used in creating a histogram. Creating a Histogram Using the Sample-superstore, plan to find the quantities of sales for different regions. To achieve this, drag the Measure named Quantity to the Rows shelf. Then open Show Me and select the Histogram chart. The following diagram shows the chart created. It shows the quantities automatically bucketed into values ranging from 0 to 4811 and divided into 12 bins. Creating a Histogram with Dimension You can also add Dimensions to Measures to create histograms. This will create a stacked histogram. Each bar will have stacks representing the values of the dimension. Following the steps of the above example, add the Region Dimension to the color Shelf under Marks Card. This creates the following histogram where each bar also includes the visualization for different regions. Print Page Previous Next Advertisements ”;
Tableau – Waterfall Charts
Tableau – Waterfall Charts ”; Previous Next Waterfall charts effectively display the cumulative effect of sequential positive and negative values. It shows where a value starts, ends and how it gets there incrementally. So, we are able to see both the size of changes and difference in values between consecutive data points. Tableau needs one Dimension and one Measure to create a Waterfall chart. Creating a Waterfall Chart Using the Sample-superstore, plan to find the variation of Sales for each Sub-Category of Products. To achieve this objective, following are the steps. Step 1 − Drag the Dimension Sub-Category to the Columns shelf and the Measure Sales to the Rows shelf. Sort the data in an ascending order of sales value. For this, use the sort option appearing in the middle of the vertical axis when you hover the mouse over it. The following chart appears on completing this step. Step 2 − Next, right-click on the SUM (Sales) value and select the running total from the table calculation option. Change the chart type to Gantt Bar. The following chart appears. Step 3 − Create a calculated field named -sales and mention the following formula for its value. Step 4 − Drag the newly created calculated field (-sales) to the size shelf under Marks Card. The chart above now changes to produce the following chart which is a Waterfall chart. Waterfall Chart with Color Next, give different color shades to the bars in the chart by dragging the Sales measure to the Color shelf under the Marks Card. You get the following waterfall chart with color. Print Page Previous Next Advertisements ”;
Zookeeper – Fundamentals
Zookeeper – Fundamentals ”; Previous Next Before going deep into the working of ZooKeeper, let us take a look at the fundamental concepts of ZooKeeper. We will discuss the following topics in this chapter − Architecture Hierarchical namespace Session Watches Architecture of ZooKeeper Take a look at the following diagram. It depicts the “Client-Server Architecture” of ZooKeeper. Each one of the components that is a part of the ZooKeeper architecture has been explained in the following table. Part Description Client Clients, one of the nodes in our distributed application cluster, access information from the server. For a particular time interval, every client sends a message to the server to let the sever know that the client is alive. Similarly, the server sends an acknowledgement when a client connects. If there is no response from the connected server, the client automatically redirects the message to another server. Server Server, one of the nodes in our ZooKeeper ensemble, provides all the services to clients. Gives acknowledgement to client to inform that the server is alive. Ensemble Group of ZooKeeper servers. The minimum number of nodes that is required to form an ensemble is 3. Leader Server node which performs automatic recovery if any of the connected node failed. Leaders are elected on service startup. Follower Server node which follows leader instruction. Hierarchical Namespace The following diagram depicts the tree structure of ZooKeeper file system used for memory representation. ZooKeeper node is referred as znode. Every znode is identified by a name and separated by a sequence of path (/). In the diagram, first you have a root znode separated by “/”. Under root, you have two logical namespaces config and workers. The config namespace is used for centralized configuration management and the workers namespace is used for naming. Under config namespace, each znode can store upto 1MB of data. This is similar to UNIX file system except that the parent znode can store data as well. The main purpose of this structure is to store synchronized data and describe the metadata of the znode. This structure is called as ZooKeeper Data Model. Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides the metadata of a znode. It consists of Version number, Action control list (ACL), Timestamp, and Data length. Version number − Every znode has a version number, which means every time the data associated with the znode changes, its corresponding version number would also increased. The use of version number is important when multiple zookeeper clients are trying to perform operations over the same znode. Action Control List (ACL) − ACL is basically an authentication mechanism for accessing the znode. It governs all the znode read and write operations. Timestamp − Timestamp represents time elapsed from znode creation and modification. It is usually represented in milliseconds. ZooKeeper identifies every change to the znodes from “Transaction ID” (zxid). Zxid is unique and maintains time for each transaction so that you can easily identify the time elapsed from one request to another request. Data length − Total amount of the data stored in a znode is the data length. You can store a maximum of 1MB of data. Types of Znodes Znodes are categorized as persistence, sequential, and ephemeral. Persistence znode − Persistence znode is alive even after the client, which created that particular znode, is disconnected. By default, all znodes are persistent unless otherwise specified. Ephemeral znode − Ephemeral znodes are active until the client is alive. When a client gets disconnected from the ZooKeeper ensemble, then the ephemeral znodes get deleted automatically. For this reason, only ephemeral znodes are not allowed to have a children further. If an ephemeral znode is deleted, then the next suitable node will fill its position. Ephemeral znodes play an important role in Leader election. Sequential znode − Sequential znodes can be either persistent or ephemeral. When a new znode is created as a sequential znode, then ZooKeeper sets the path of the znode by attaching a 10 digit sequence number to the original name. For example, if a znode with path /myapp is created as a sequential znode, ZooKeeper will change the path to /myapp0000000001 and set the next sequence number as 0000000002. If two sequential znodes are created concurrently, then ZooKeeper never uses the same number for each znode. Sequential znodes play an important role in Locking and Synchronization. Sessions Sessions are very important for the operation of ZooKeeper. Requests in a session are executed in FIFO order. Once a client connects to a server, the session will be established and a session id is assigned to the client. The client sends heartbeats at a particular time interval to keep the session valid. If the ZooKeeper ensemble does not receive heartbeats from a client for more than the period (session timeout) specified at the starting of the service, it decides that the client died. Session timeouts are usually represented in milliseconds. When a session ends for any reason, the ephemeral znodes created during that session also get deleted. Watches Watches are a simple mechanism for the client to get notifications about the changes in the ZooKeeper ensemble. Clients can set watches while reading a particular znode. Watches send a notification to the registered client for any of the znode (on which client registers) changes. Znode changes are modification of data associated with the znode or changes in the znode’s children. Watches are triggered only once. If a client wants a notification again, it must be done through another read operation. When a connection session is expired, the client will be disconnected from the server and the associated watches are also removed. Print Page Previous Next Advertisements ”;
Zookeeper – Leader Election
Zookeeper – Leader Election ”; Previous Next Let us analyze how a leader node can be elected in a ZooKeeper ensemble. Consider there are N number of nodes in a cluster. The process of leader election is as follows − All the nodes create a sequential, ephemeral znode with the same path, /app/leader_election/guid_. ZooKeeper ensemble will append the 10-digit sequence number to the path and the znode created will be /app/leader_election/guid_0000000001, /app/leader_election/guid_0000000002, etc. For a given instance, the node which creates the smallest number in the znode becomes the leader and all the other nodes are followers. Each follower node watches the znode having the next smallest number. For example, the node which creates znode /app/leader_election/guid_0000000008 will watch the znode /app/leader_election/guid_0000000007 and the node which creates the znode /app/leader_election/guid_0000000007 will watch the znode /app/leader_election/guid_0000000006. If the leader goes down, then its corresponding znode /app/leader_electionN gets deleted. The next in line follower node will get the notification through watcher about the leader removal. The next in line follower node will check if there are other znodes with the smallest number. If none, then it will assume the role of the leader. Otherwise, it finds the node which created the znode with the smallest number as leader. Similarly, all other follower nodes elect the node which created the znode with the smallest number as leader. Leader election is a complex process when it is done from scratch. But ZooKeeper service makes it very simple. Let us move on to the installation of ZooKeeper for development purpose in the next chapter. Print Page Previous Next Advertisements ”;
Zookeeper – Overview
Zookeeper – Overview ”; Previous Next ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and managing a service in a distributed environment is a complicated process. ZooKeeper solves this issue with its simple architecture and API. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application. The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to track the status of distributed data. Before moving further, it is important that we know a thing or two about distributed applications. So, let us start the discussion with a quick overview of distributed applications. Distributed Application A distributed application can run on multiple systems in a network at a given time (simultaneously) by coordinating among themselves to complete a particular task in a fast and efficient manner. Normally, complex and time-consuming tasks, which will take hours to complete by a non-distributed application (running in a single system) can be done in minutes by a distributed application by using computing capabilities of all the system involved. The time to complete the task can be further reduced by configuring the distributed application to run on more systems. A group of systems in which a distributed application is running is called a Cluster and each machine running in a cluster is called a Node. A distributed application has two parts, Server and Client application. Server applications are actually distributed and have a common interface so that clients can connect to any server in the cluster and get the same result. Client applications are the tools to interact with a distributed application. Benefits of Distributed Applications Reliability − Failure of a single or a few systems does not make the whole system to fail. Scalability − Performance can be increased as and when needed by adding more machines with minor change in the configuration of the application with no downtime. Transparency − Hides the complexity of the system and shows itself as a single entity / application. Challenges of Distributed Applications Race condition − Two or more machines trying to perform a particular task, which actually needs to be done only by a single machine at any given time. For example, shared resources should only be modified by a single machine at any given time. Deadlock − Two or more operations waiting for each other to complete indefinitely. Inconsistency − Partial failure of data. What is Apache ZooKeeper Meant For? Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintain shared data with robust synchronization techniques. ZooKeeper is itself a distributed application providing services for writing a distributed application. The common services provided by ZooKeeper are as follows − Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes. Configuration management − Latest and up-to-date configuration information of the system for a joining node. Cluster management − Joining / leaving of a node in a cluster and node status at real time. Leader election − Electing a node as leader for coordination purpose. Locking and synchronization service − Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting other distributed applications like Apache HBase. Highly reliable data registry − Availability of data even when one or a few nodes are down. Distributed applications offer a lot of benefits, but they throw a few complex and hard-to-crack challenges as well. ZooKeeper framework provides a complete mechanism to overcome all the challenges. Race condition and deadlock are handled using fail-safe synchronization approach. Another main drawback is inconsistency of data, which ZooKeeper resolves with atomicity. Benefits of ZooKeeper Here are the benefits of using ZooKeeper − Simple distributed coordination process Synchronization − Mutual exclusion and co-operation between server processes. This process helps in Apache HBase for configuration management. Ordered Messages Serialization − Encode the data according to specific rules. Ensure your application runs consistently. This approach can be used in MapReduce to coordinate queue to execute running threads. Reliability Atomicity − Data transfer either succeed or fail completely, but no transaction is partial. Print Page Previous Next Advertisements ”;
Stratified sampling
Statistics – Stratified sampling ”; Previous Next This strategy for examining is utilized as a part of circumstance where the population can be effortlessly partitioned into gatherings or strata which are particularly not quite the same as one another, yet the components inside of a gathering are homogeneous regarding a few attributes e. g. understudies of school can be separated into strata on the premise of sexual orientation, courses offered, age and so forth. In this the population is initially partitioned into strata and afterward a basic irregular specimen is taken from every stratum. Stratified testing is of two sorts: proportionate stratified inspecting and disproportionate stratified examining. Proportionate Stratified Sampling – In this the number of units selected from each stratum is proportionate to the share of stratum in the population e.g. in a college there are total 2500 students out of which 1500 students are enrolled in graduate courses and 1000 are enrolled in post graduate courses. If a sample of 100 is to be chosen using proportionate stratified sampling then the number of undergraduate students in sample would be 60 and 40 would be post graduate students. Thus the two strata are represented in the same proportion in the sample as is their representation in the population. This method is most suitable when the purpose of sampling is to estimate the population value of some characteristic and there is no difference in within- stratum variances. Disproportionate Stratified Sampling – When the purpose of study is to compare the differences among strata then it become necessary to draw equal units from all strata irrespective of their share in population. Sometimes some strata are more variable with respect to some characteristic than other strata, in such a case a larger number of units may be drawn from the more variable strata. In both the situations the sample drawn is a disproportionate stratified sample. The difference in stratum size and stratum variability can be optimally allocated using the following formula for determining the sample size from different strata Formula ${n_i = frac{n.n_isigma_i}{n_1sigma_1+n_2sigma_2+…+n_ksigma_k} for i = 1,2 …k}$ Where − ${n_i}$ = the sample size of i strata. ${n}$ = the size of strata. ${sigma_1}$ = the standard deviation of i strata. In addition to it, there might be a situation where cost of collecting a sample might be more in one strata than in other. The optimal disproportionate sampling should be done in a manner that ${frac{n_1}{n_1sigma_1sqrt{c_1}} = frac{n_2}{n_2sigma_1sqrt{c_2}} = … = frac{n_k}{n_ksigma_ksqrt{c_k}}}$ Where ${c_1, c_2, … ,c_k}$ refer to the cost of sampling in k strata. The sample size from different strata can be determined using the following formula: ${n_i = frac{frac{n.n_isigma_i}{sqrt{c_i}}}{frac{n_1sigma_1}{sqrt{c_i}}+frac{n_2sigma_2}{sqrt{c_2}}+…+frac{n_ksigma_k}{sqrt{c_k}}} for i = 1,2 …k}$ Example Problem Statement: An organisation has 5000 employees who have been stratified into three levels. Stratum A: 50 executives with standard deviation = 9 Stratum B: 1250 non-manual workers with standard deviation = 4 Stratum C: 3700 manual workers with standard deviation = 1 How will a sample of 300 employees are drawn on a disproportionate basis having optimum allocation? Solution: Using the formula of disproportionate sampling for optimum allocation. ${n_i = frac{n.n_isigma_i}{n_1sigma_1+n_2sigma_2+n_3sigma_3}} \[7pt] , For Stream A, {n_1 = frac{300(50)(9)}{(50)(9)+(1250)(4)+(3700)(1)}} \[7pt] , = {frac{135000}{1950} = {14.75} or say {15}} \[7pt] , For Stream B, {n_1 = frac{300(1250)(4)}{(50)(9)+(1250)(4)+(3700)(1)}} \[7pt] , = {frac{150000}{1950} = {163.93} or say {167}} \[7pt] , For Stream C, {n_1 = frac{300(3700)(1)}{(50)(9)+(1250)(4)+(3700)(1)}} \[7pt] , = {frac{110000}{1950} = {121.3} or say {121}}$ Print Page Previous Next Advertisements ”;
Zookeeper – Workflow
Zookeeper – Workflow ”; Previous Next Once a ZooKeeper ensemble starts, it will wait for the clients to connect. Clients will connect to one of the nodes in the ZooKeeper ensemble. It may be a leader or a follower node. Once a client is connected, the node assigns a session ID to the particular client and sends an acknowledgement to the client. If the client does not get an acknowledgment, it simply tries to connect another node in the ZooKeeper ensemble. Once connected to a node, the client will send heartbeats to the node in a regular interval to make sure that the connection is not lost. If a client wants to read a particular znode, it sends a read request to the node with the znode path and the node returns the requested znode by getting it from its own database. For this reason, reads are fast in ZooKeeper ensemble. If a client wants to store data in the ZooKeeper ensemble, it sends the znode path and the data to the server. The connected server will forward the request to the leader and then the leader will reissue the writing request to all the followers. If only a majority of the nodes respond successfully, then the write request will succeed and a successful return code will be sent to the client. Otherwise, the write request will fail. The strict majority of nodes is called as Quorum. Nodes in a ZooKeeper Ensemble Let us analyze the effect of having different number of nodes in the ZooKeeper ensemble. If we have a single node, then the ZooKeeper ensemble fails when that node fails. It contributes to “Single Point of Failure” and it is not recommended in a production environment. If we have two nodes and one node fails, we don’t have majority as well, since one out of two is not a majority. If we have three nodes and one node fails, we have majority and so, it is the minimum requirement. It is mandatory for a ZooKeeper ensemble to have at least three nodes in a live production environment. If we have four nodes and two nodes fail, it fails again and it is similar to having three nodes. The extra node does not serve any purpose and so, it is better to add nodes in odd numbers, e.g., 3, 5, 7. We know that a write process is expensive than a read process in ZooKeeper ensemble, since all the nodes need to write the same data in its database. So, it is better to have less number of nodes (3, 5 or 7) than having a large number of nodes for a balanced environment. The following diagram depicts the ZooKeeper WorkFlow and the subsequent table explains its different components. Component Description Write Write process is handled by the leader node. The leader forwards the write request to all the znodes and waits for answers from the znodes. If half of the znodes reply, then the write process is complete. Read Reads are performed internally by a specific connected znode, so there is no need to interact with the cluster. Replicated Database It is used to store data in zookeeper. Each znode has its own database and every znode has the same data at every time with the help of consistency. Leader Leader is the Znode that is responsible for processing write requests. Follower Followers receive write requests from the clients and forward them to the leader znode. Request Processor Present only in leader node. It governs write requests from the follower node. Atomic broadcasts Responsible for broadcasting the changes from the leader node to the follower nodes. Print Page Previous Next Advertisements ”;