Apache Flink – Flink vs Spark vs Hadoop Here is a comprehensive table, which shows the comparison between three most popular big data frameworks: Apache Flink, Apache Spark and Apache Hadoop. Apache Hadoop Apache Spark Apache Flink Year of Origin 2005 2009 2009 Place of Origin MapReduce (Google) Hadoop (Yahoo) University of California, Berkeley Technical University of Berlin Data Processing Engine Batch Batch Stream Processing Speed Slower than Spark and Flink 100x Faster than Hadoop Faster than spark Programming Languages Java, C, C++, Ruby, Groovy, Perl, Python Java, Scala, python and R Java and Scala Programming Model MapReduce Resilient distributed Datasets (RDD) Cyclic dataflows Data Transfer Batch Batch Pipelined and Batch Memory Management Disk Based JVM Managed Active Managed Latency Low Medium Low Throughput Medium High High Optimization Manual Manual Automatic API Low-level High-level High-level Streaming Support NA Spark Streaming Flink Streaming SQL Support Hive, Impala SparkSQL Table API and SQL Graph Support NA GraphX Gelly Machine Learning Support NA SparkML FlinkML
Category: apache Flink
Apache Flink – Useful Resources The following resources contain additional information on Apache Flink. Please use them to get more in-depth knowledge on this. Useful Links on Apache Flink − Wikipedia Reference for Apache Flink − official Site of Apache Flink Useful Books on Apache Flink To enlist your site on this page, please drop an email to [email protected]
Apache Flink – Machine Learning Apache Flink”s Machine Learning library is called FlinkML. Since usage of machine learning has been increasing exponentially over the last 5 years, Flink community decided to add this machine learning APO also in its ecosystem. The list of contributors and algorithms are increasing in FlinkML. This API is not a part of binary distribution yet. Here is an example of linear regression using FlinkML − // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = … val testingData: DataSet[Vector] = … // Alternatively, a Splitter is used to break up a DataSet into training and testing data. val dataSet: DataSet[LabeledVector] = … val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet) val trainingData: DataSet[LabeledVector] = trainTestData.training val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector) val mlr = MultipleLinearRegression() .setStepsize(1.0) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData) Inside flink-1.7.1/examples/batch/ path, you will find KMeans.jar file. Let us run this sample FlinkML example. This example program is run using the default point and the centroid data set. ./bin/flink run examples/batch/KMeans.jar –output Print
Discuss Apache Flink Apache Flink is an open source stream processing framework, which has both batch and stream processing capabilities. Apache Flink is very similar to Apache Spark, but it follows stream-first approach. It is also a part of Big Data tools list. This tutorial explains the basics of Flink Architecture Ecosystem and its APIs.
Apache Flink – Conclusion The comparison table that we saw in the previous chapter concludes the pointers pretty much. Apache Flink is the most suited framework for real-time processing and use cases. Its single engine system is unique which can process both batch and streaming data with different APIs like Dataset and DataStream. It does not mean Hadoop and Spark are out of the game, the selection of the most suited big data framework always depends and vary from use case to use case. There can be several use cases where a combination of Hadoop and Flink or Spark and Flink might be suited. Nevertheless, Flink is the best framework for real time processing currently. The growth of Apache Flink has been amazing and the number of contributors to its community is growing day by day. Happy Flinking!
Apache Flink – Quick Guide Apache Flink – Big Data Platform The advancement of data in the last 10 years has been enormous; this gave rise to a term ”Big Data”. There is no fixed size of data, which you can call as big data; any data that your traditional system (RDBMS) is not able to handle is Big Data. This Big Data can be in structured, semi-structured or un-structured format. Initially, there were three dimensions to data − Volume, Velocity, Variety. The dimensions have now gone beyond just the three Vs. We have now added other Vs − Veracity, Validity, Vulnerability, Value, Variability, etc. Big Data led to the emergence of multiple tools and frameworks that help in the storage and processing of data. There are a few popular big data frameworks such as Hadoop, Spark, Hive, Pig, Storm and Zookeeper. It also gave opportunity to create Next Gen products in multiple domains like Healthcare, Finance, Retail, E-Commerce and more. Whether it is an MNC or a start-up, everyone is leveraging Big Data to store and process it and take smarter decisions. Apache Flink – Batch vs Real-time Processing In terms of Big Data, there are two types of processing − Batch Processing Real-time Processing Processing based on the data collected over time is called Batch Processing. For example, a bank manager wants to process past one-month data (collected over time) to know the number of cheques that got cancelled in the past 1 month. Processing based on immediate data for instant result is called Real-time Processing. For example, a bank manager getting a fraud alert immediately after a fraud transaction (instant result) has occurred. The table given below lists down the differences between Batch and Real-Time Processing − Batch Processing Real-Time Processing Static Files Event Streams Processed Periodically in minute, hour, day etc. Processed immediately nanoseconds Past data on disk storage In Memory Storage Example − Bill Generation Example − ATM Transaction Alert These days, real-time processing is being used a lot in every organization. Use cases like fraud detection, real-time alerts in healthcare and network attack alert require real-time processing of instant data; a delay of even few milliseconds can have a huge impact. An ideal tool for such real time use cases would be the one, which can input data as stream and not batch. Apache Flink is that real-time processing tool. Apache Flink – Introduction Apache Flink is a real-time processing framework which can process streaming data. It is an open source stream processing framework for high-performance, scalable, and accurate real-time applications. It has true streaming model and does not take input data as batch or micro-batches. Apache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479 contributors and 15500 + commits so far. Ecosystem on Apache Flink The diagram given below shows the different layers of Apache Flink Ecosystem − Storage Apache Flink has multiple options from where it can Read/Write data. Below is a basic storage list − HDFS (Hadoop Distributed File System) Local File System S3 RDBMS (MySQL, Oracle, MS SQL etc.) MongoDB HBase Apache Kafka Apache Flume Deploy You can deploy Apache Fink in local mode, cluster mode or on cloud. Cluster mode can be standalone, YARN, MESOS. On cloud, Flink can be deployed on AWS or GCP. Kernel This is the runtime layer, which provides distributed processing, fault tolerance, reliability, native iterative processing capability and more. APIs & Libraries This is the top layer and most important layer of Apache Flink. It has Dataset API, which takes care of batch processing, and Datastream API, which takes care of stream processing. There are other libraries like Flink ML (for machine learning), Gelly (for graph processing ), Tables for SQL. This layer provides diverse capabilities to Apache Flink. Apache Flink – Architecture Apache Flink works on Kappa architecture. Kappa architecture has a single processor – stream, which treats all input as stream and the streaming engine processes the data in real-time. Batch data in kappa architecture is a special case of streaming. The following diagram shows the Apache Flink Architecture. The key idea in Kappa architecture is to handle both batch and real-time data through a single stream processing engine. Most big data framework works on Lambda architecture, which has separate processors for batch and streaming data. In Lambda architecture, you have separate codebases for batch and stream views. For querying and getting the result, the codebases need to be merged. Not maintaining separate codebases/views and merging them is a pain, but Kappa architecture solves this issue as it has only one view − real-time, hence merging of codebase is not required. That does not mean Kappa architecture replaces Lambda architecture, it completely depends on the use-case and the application that decides which architecture would be preferable. The following diagram shows Apache Flink job execution architecture. Program It is a piece of code, which you run on the Flink Cluster. Client It is responsible for taking code (program) and constructing job dataflow graph, then passing it to JobManager. It also retrieves the Job results. JobManager After receiving the Job Dataflow Graph from Client, it is responsible for creating the execution graph. It assigns the job to TaskManagers in the cluster and supervises the execution of the job. TaskManager It is responsible for executing all the tasks that have been assigned by JobManager. All the TaskManagers run the tasks in their separate slots in specified parallelism. It is responsible to send the status of the tasks to JobManager. Features of Apache Flink The features of Apache Flink are as follows − It has a streaming processor, which can run both batch and stream programs. It can process data at lightning fast speed. APIs available in Java, Scala and Python. Provides APIs for all the common operations, which is very easy for programmers to use. Processes data in low latency (nanoseconds) and high throughput. Its fault tolerant. If a node,
Apache Flink – Libraries In this chapter, we will learn about the different libraries of Apache Flink. Complex Event Processing (CEP) FlinkCEP is an API in Apache Flink, which analyses event patterns on continuous streaming data. These events are near real time, which have high throughput and low latency. This API is used mostly on Sensor data, which come in real-time and are very complex to process. CEP analyses the pattern of the input stream and gives the result very soon. It has the ability to provide real-time notifications and alerts in case the event pattern is complex. FlinkCEP can connect to different kind of input sources and analyse patterns in them. This how a sample architecture with CEP looks like − Sensor data will be coming in from different sources, Kafka will act as a distributed messaging framework, which will distribute the streams to Apache Flink, and FlinkCEP will analyse the complex event patterns. You can write programs in Apache Flink for complex event processing using Pattern API. It allows you to decide the event patterns to detect from the continuous stream data. Below are some of the most commonly used CEP patterns − Begin It is used to define the starting state. The following program shows how it is defined in a Flink program − Pattern<Event, ?> next = start.next(“next”); Where It is used to define a filter condition in the current state. patternState.where(new FilterFunction <Event>() { @Override public boolean filter(Event value) throws Exception { } }); Next It is used to append a new pattern state and the matching event needed to pass the previous pattern. Pattern<Event, ?> next = start.next(“next”); FollowedBy It is used to append a new pattern state but here other events can occur b/w two matching events. Pattern<Event, ?> followedBy = start.followedBy(“next”); Gelly Apache Flink”s Graph API is Gelly. Gelly is used to perform graph analysis on Flink applications using a set of methods and utilities. You can analyse huge graphs using Apache Flink API in a distributed fashion with Gelly. There are other graph libraries also like Apache Giraph for the same purpose, but since Gelly is used on top of Apache Flink, it uses single API. This is very helpful from development and operation point of view. Let us run an example using Apache Flink API − Gelly. Firstly, you need to copy 2 Gelly jar files from opt directory of Apache Flink to its lib directory. Then run flink-gelly-examples jar. cp opt/flink-gelly* lib/ ./bin/flink run examples/gelly/flink-gelly-examples_*.jar Let us now run the PageRank example. PageRank computes a per-vertex score, which is the sum of PageRank scores transmitted over in-edges. Each vertex”s score is divided evenly among out-edges. High-scoring vertices are linked to by other high-scoring vertices. The result contains the vertex ID and the PageRank score. usage: flink run examples/flink-gelly-examples_<version>.jar –algorithm PageRank [algorithm options] –input <input> [input options] –output <output> [output options] ./bin/flink run examples/gelly/flink-gelly-examples_*.jar –algorithm PageRank –input CycleGraph –vertex_count 2 –output Print
Apache Flink – Use Cases In this chapter, we will understand a few test cases in Apache Flink. Apache Flink − Bouygues Telecom Bouygues Telecom is one of the largest telecom organization in France. It has 11+ million mobile subscribers and 2.5+ million fixed customers. Bouygues heard about Apache Flink for the first time in a Hadoop Group Meeting held at Paris. Since then they have been using Flink for multiple use-cases. They have been processing billions of messages in a day in real-time through Apache Flink. This is what Bouygues has to say about Apache Flink: “We ended up with Flink because the system supports true streaming – both at the API and at the runtime level, giving us the programmability and low latency that we were looking for. In addition, we were able to get our system up and running with Flink in a fraction of the time compared to other solutions, which resulted in more available developer resources for expanding the business logic in the system.” At Bouygues, customer experience is the highest priority. They analyse data in real-time so that they can give below insights to their engineers − Real-Time Customer Experience over their network What is happening globally on the network Network evaluations and operations They created a system called LUX (Logged User Experience) which processed massive log data from network equipment with internal data reference to give quality of experience indicators which will log their customer experience and build an alarming functionality to detect any failure in consumption of data within 60 seconds. To achieve this, they needed a framework which can take massive data in real-time, is easy to set up and provides rich set of APIs for processing the streamed data. Apache Flink was a perfect fit for Bouygues Telecom. Apache Flink − Alibaba Alibaba is the largest ecommerce retail company in the world with 394 billion $ revenue in 2015. Alibaba search is the entry point to all the customers, which shows all the search and recommends accordingly. Alibaba uses Apache Flink in its search engine to show results in real-time with highest accuracy and relevancy for each user. Alibaba was looking for a framework, which was − Very Agile in maintaining one codebase for their entire search infrastructure process. Provides low latency for the availability changes in the products on the website. Consistent and cost effective. Apache Flink qualified for all the above requirements. They need a framework, which has a single processing engine and can process both batch and stream data with same engine and that is what Apache Flink does. They also use Blink, a forked version for Flink to meet some unique requirements for their search. They are also using Apache Flink”s Table API with few improvements for their search. This is what Alibaba had to say about apache Flink: “Looking back, it was no doubt a huge year for Blink and Flink at Alibaba. No one thought that we would make this much progress in a year, and we are very grateful to all the people who helped us in the community. Flink is proven to work at the very large scale. We are more committed than ever to continue our work with the community to move Flink forward!“
Apache Flink – Big Data Platform The advancement of data in the last 10 years has been enormous; this gave rise to a term ”Big Data”. There is no fixed size of data, which you can call as big data; any data that your traditional system (RDBMS) is not able to handle is Big Data. This Big Data can be in structured, semi-structured or un-structured format. Initially, there were three dimensions to data − Volume, Velocity, Variety. The dimensions have now gone beyond just the three Vs. We have now added other Vs − Veracity, Validity, Vulnerability, Value, Variability, etc. Big Data led to the emergence of multiple tools and frameworks that help in the storage and processing of data. There are a few popular big data frameworks such as Hadoop, Spark, Hive, Pig, Storm and Zookeeper. It also gave opportunity to create Next Gen products in multiple domains like Healthcare, Finance, Retail, E-Commerce and more. Whether it is an MNC or a start-up, everyone is leveraging Big Data to store and process it and take smarter decisions.
Apache Flink – Batch vs Real-time Processing In terms of Big Data, there are two types of processing − Batch Processing Real-time Processing Processing based on the data collected over time is called Batch Processing. For example, a bank manager wants to process past one-month data (collected over time) to know the number of cheques that got cancelled in the past 1 month. Processing based on immediate data for instant result is called Real-time Processing. For example, a bank manager getting a fraud alert immediately after a fraud transaction (instant result) has occurred. The table given below lists down the differences between Batch and Real-Time Processing − Batch Processing Real-Time Processing Static Files Event Streams Processed Periodically in minute, hour, day etc. Processed immediately nanoseconds Past data on disk storage In Memory Storage Example − Bill Generation Example − ATM Transaction Alert These days, real-time processing is being used a lot in every organization. Use cases like fraud detection, real-time alerts in healthcare and network attack alert require real-time processing of instant data; a delay of even few milliseconds can have a huge impact. An ideal tool for such real time use cases would be the one, which can input data as stream and not batch. Apache Flink is that real-time processing tool.