Apache NiFi – Processors

Apache NiFi – Processors ”; Previous Next Apache NiFi processors are the basic blocks of creating a data flow. Every processor has different functionality, which contributes to the creation of output flowfile. Dataflow shown in the image below is fetching file from one directory using GetFile processor and storing it in another directory using PutFile processor. GetFile GetFile process is used to fetch files of a specific format from a specific directory. It also provides other options to user for more control on fetching. We will discuss it in properties section below. GetFile Settings Following are the different settings of GetFile processor − Name In the Name setting, a user can define any name for the processors either according to the project or by that, which makes the name more meaningful. Enable A user can enable or disable the processor using this setting. Penalty Duration This setting lets a user to add the penalty time duration, in the event of flowfile failure. Yield Duration This setting is used to specify the yield time for processor. In this duration, the process is not scheduled again. Bulletin Level This setting is used to specify the log level of that processor. Automatically Terminate Relationships This has a list of check of all the available relationship of that particular process. By checking the boxes, a user can program processor to terminate the flowfile on that event and do not send it further in the flow. GetFile Scheduling These are the following scheduling options offered by the GetFile processor − Schedule Strategy You can either schedule the process on time basis by selecting time driven or a specified CRON string by selecting a CRON driver option. Concurrent Tasks This option is used to define the concurrent task schedule for this processor. Execution A user can define whether to run the processor in all nodes or only in Primary node by using this option. Run Schedule It is used to define the time for time driven strategy or CRON expression for CRON driven strategy. GetFile Properties GetFile offers multiple properties as shown in the image below raging compulsory properties like Input directory and file filter to optional properties like Path Filter and Maximum file Size. A user can manage file fetching process using these properties. GetFile Comments This Section is used to specify any information about processor. PutFile The PutFile processor is used to store the file from the data flow to a specific location. PutFile Settings The PutFile processor has the following settings − Name In the Name setting, a user can define any name for the processors either according to the project or by that which makes the name more meaningful. Enable A user can enable or disable the processor using this setting. Penalty Duration This setting lets a user add the penalty time duration, in the event of flowfile failure. Yield Duration This setting is used to specify the yield time for processor. In this duration, the process does not get scheduled again. Bulletin Level This setting is used to specify the log level of that processor. Automatically Terminate Relationships This settings has a list of check of all the available relationship of that particular process. By checking the boxes, user can program processor to terminate the flowfile on that event and do not send it further in the flow. PutFile Scheduling These are the following scheduling options offered by the PutFile processor − Schedule Strategy You can schedule the process on time basis either by selecting timer driven or a specified CRON string by selecting CRON driver option. There is also an Experimental strategy Event Driven, which will trigger the processor on a specific event. Concurrent Tasks This option is used to define the concurrent task schedule for this processor. Execution A user can define whether to run the processor in all nodes or only in primary node by using this option. Run Schedule It is used to define the time for timer driven strategy or CRON expression for CRON driven strategy. PutFile Properties The PutFile processor provides properties like Directory to specify the output directory for the purpose of file transfer and others to manage the transfer as shown in the image below. PutFile Comments This Section is used to specify any information about processor. Print Page Previous Next Advertisements ”;

Apache NiFi – Labels

Apache NiFi – Labels ”; Previous Next Apache NiFi offers labels to enable a developer to write information about the components present in the NiFI canvas. The leftmost icon in the top menu of NiFi UI is used to add the label in NiFi canvas. A developer can change the color of the label and the size of the text with a right-click on the label and choose the appropriate option from the menu. Print Page Previous Next Advertisements ”;

Apache NiFi – Data Provenance

Apache NiFi – Data Provenance ”; Previous Next Apache NiFi logs and store every information about the events occur on the ingested data in the flow. Data provenance repository stores this information and provides UI to search this event information. Data provenance can be accessed for full NiFi level and processor level also. The following table lists down the different fields in the NiFi Data Provenance event list have following fields − S.No. Field Name Description 1 Date/Time Date and time of event. 2 Type Type of Event like ‘CREATE’. 3 FlowFileUuid UUID of the flowfile on which the event is performed. 4 Size Size of the flowfile. 5 Component Name Name of the component which  performed the event. 6 Component Type Type of the component. 7 Show lineage Last column has the show lineage icon, which is used to see the flowfile lineage as shown in the below image. To get more information about the event, a user can click on the information icon present in the first column of the NiFi Data Provenance UI. There are some properties in nifi.properties file, which are used to manage NiFi Data Provenance repository. S.No. Property Name Default Value Description 1 nifi.provenance.repository.directory.default ./provenance_repository To specify the default path of NiFi data provenance . 2 nifi.provenance.repository.max.storage.time 24 hours To specify the maximum retention time of NiFi data provenance. 3 nifi.provenance.repository.max.storage.size 1 GB To specify the maximum storage of NiFi data provenance. 4 nifi.provenance.repository.rollover.time 30 secs To specify the rollover time of NiFi data provenance. 5 nifi.provenance.repository.rollover.size 100 MB To specify the rollover size of NiFi data provenance. 6 nifi.provenance.repository.indexed.fields EventType, FlowFileUUID, Filename, ProcessorID, Relationship To specify the fields used to search and index NiFi data provenance. Print Page Previous Next Advertisements ”;

Apache NiFi – FlowFile

Apache NiFi – FlowFile ”; Previous Next A flowfile is a basic processing entity in Apache NiFi. It contains data contents and attributes, which are used by NiFi processors to process data. The file content normally contains the data fetched from source systems. The most common attributes of an Apache NiFi FlowFile are − UUID This stands for Universally Unique Identifier, which is a unique identity of a flowfile generated by NiFi. Filename This attribute contains the filename of that flowfile and it should not contain any directory structure. File Size It contains the size of an Apache NiFi FlowFile. mime.type It specifies the MIME Type of this FlowFile. path This attribute contains the relative path of a file to which a flowfile belongs and does not contain the file name. Print Page Previous Next Advertisements ”;

Apache NiFi – Logging

Apache NiFi – Logging ”; Previous Next Apache NiFi uses logback library to handle its logging. There is a file logback.xml present in the conf directory of NiFi, which is used to configure the logging in NiFi. The logs are generated in logs folder of NiFi and the log files are as described below. nifi-app.log This is the main log file of nifi, which logs all the activities of apache NiFi application ranging from NAR files loading to the run time errors or bulletins encountered by NiFi components. Below is the default appender in logback.xml file for nifi-app.log file. <appender name=”APP_FILE” class=”ch.qos.logback.core.rolling.RollingFileAppender”> <file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file> <rollingPolicy class=”ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy”> <fileNamePattern> ${org.apache.nifi.bootstrap.config.log.dir}/ nifi-app_%d{yyyy-MM-dd_HH}.%i.log </fileNamePattern> <maxFileSize>100MB</maxFileSize> <maxHistory>30</maxHistory> </rollingPolicy> <immediateFlush>true</immediateFlush> <encoder class=”ch.qos.logback.classic.encoder.PatternLayoutEncoder”> <pattern>%date %level [%thread] %logger{40} %msg%n</pattern> </encoder> </appender> The appender name is APP_FILE, and the class is RollingFileAppender, which means logger is using rollback policy. By default, the max file size is 100 MB and can be changed to the required size. The maximum retention for APP_FILE is 30 log files and can be changed as per the user requirement. nifi-user.log This log contains the user events like web security, web api config, user authorization, etc. Below is the appender for nifi-user.log in logback.xml file. <appender name=”USER_FILE” class=”ch.qos.logback.core.rolling.RollingFileAppender”> <file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-user.log</file> <rollingPolicy class=”ch.qos.logback.core.rolling.TimeBasedRollingPolicy”> <fileNamePattern> ${org.apache.nifi.bootstrap.config.log.dir}/ nifi-user_%d.log </fileNamePattern> <maxHistory>30</maxHistory> </rollingPolicy> <encoder class=”ch.qos.logback.classic.encoder.PatternLayoutEncoder”> <pattern>%date %level [%thread] %logger{40} %msg%n</pattern> </encoder> </appender> The appender name is USER_FILE. It follows the rollover policy. The maximum retention period for USER_FILE is 30 log files. Below is the default loggers for USER_FILE appender present in nifi-user.log. <logger name=”org.apache.nifi.web.security” level=”INFO” additivity=”false”> <appender-ref ref=”USER_FILE”/> </logger> <logger name=”org.apache.nifi.web.api.config” level=”INFO” additivity=”false”> <appender-ref ref=”USER_FILE”/> </logger> <logger name=”org.apache.nifi.authorization” level=”INFO” additivity=”false”> <appender-ref ref=”USER_FILE”/> </logger> <logger name=”org.apache.nifi.cluster.authorization” level=”INFO” additivity=”false”> <appender-ref ref=”USER_FILE”/> </logger> <logger name=”org.apache.nifi.web.filter.RequestLogger” level=”INFO” additivity=”false”> <appender-ref ref=”USER_FILE”/> </logger> nifi-bootstrap.log This log contains the bootstrap logs, apache NiFi’s standard output (all system.out written in the code mainly for debugging), and standard error (all system.err written in the code). Below is the default appender for the nifi-bootstrap.log in logback.log. <appender name=”BOOTSTRAP_FILE” class=”ch.qos.logback.core.rolling.RollingFileAppender”> <file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-bootstrap.log</file> <rollingPolicy class=”ch.qos.logback.core.rolling.TimeBasedRollingPolicy”> <fileNamePattern> ${org.apache.nifi.bootstrap.config.log.dir}/nifi-bootstrap_%d.log </fileNamePattern> <maxHistory>5</maxHistory> </rollingPolicy> <encoder class=”ch.qos.logback.classic.encoder.PatternLayoutEncoder”> <pattern>%date %level [%thread] %logger{40} %msg%n</pattern> </encoder> </appender> nifi-bootstrap.log file,s appender name is BOOTSTRAP_FILE, which also follows rollback policy. The maximum retention for BOOTSTRAP_FILE appender is 5 log files. Below is the default loggers for nifi-bootstrap.log file. <logger name=”org.apache.nifi.bootstrap” level=”INFO” additivity=”false”> <appender-ref ref=”BOOTSTRAP_FILE” /> </logger> <logger name=”org.apache.nifi.bootstrap.Command” level=”INFO” additivity=”false”> <appender-ref ref=”CONSOLE” /> <appender-ref ref=”BOOTSTRAP_FILE” /> </logger> <logger name=”org.apache.nifi.StdOut” level=”INFO” additivity=”false”> <appender-ref ref=”BOOTSTRAP_FILE” /> </logger> <logger name=”org.apache.nifi.StdErr” level=”ERROR” additivity=”false”> <appender-ref ref=”BOOTSTRAP_FILE” /> </logger> Print Page Previous Next Advertisements ”;

Apache NiFi – Home

Apache NiFi Tutorial PDF Version Quick Guide Resources Job Search Discussion Apache NiFi is an open source data ingestion platform. It was developed by NSA and is now being maintained and further development is supported by Apache foundation. It is based on Java, and runs in Jetty server. It is licensed under the Apache license version 2.0. In this tutorial, we will be explaining the basics of Apache NiFi and its features. Audience This tutorial is designed for software professionals who want to learn the basics of Apache NiFi and its programming concepts in simple and easy steps. It describes the components of Apache NiFi with suitable examples. Prerequisites You should have a basic understanding of Java, ETL, Data ingestion and transformation. The user should be familiar with web server, platform configuration, and regex patterns. Print Page Previous Next Advertisements ”;

Apache NiFi – Configuration

Apache NiFi – Configuration ”; Previous Next Apache NiFi is highly configurable platform. The nifi.properties file in conf directory contains most of the configuration. The commonly used properties of Apache NiFi are as follows − Core properties This section contains the properties, which are compulsory to run a NiFi instance. S.No. Property name Default Value description 1 nifi.flow.configuration.file ./conf/flow.xml.gz This property contains the path to flow.xml file. This file contains all the data flows created in NiFi. 2 nifi.flow.configuration.archive.enabled true This property is used to enable or disable archiving in NiFi. 3 nifi.flow.configuration.archive.dir ./conf/archive/ This property is used to specify the archive directory. 4 nifi.flow.configuration.archive.max.time 30 days This is used to specify the retention time for archiving content. 5 nifi.flow.configuration.archive.max.storage 500 MB it contains the maximum size of archiving directory can grow. 6 nifi.authorizer.configuration.file ./conf/authorizers.xml To specify the authorizer configuration file, which is used for user authorization. 7 nifi.login.identity.provider.configuration.file ./conf/login-identity-providers.xml This property contains the configuration of login identity providers, 8 nifi.templates.directory ./conf/templates This property is used to specify the directory, where NiFi templates will be stored. 9 nifi.nar.library.directory ./lib This property contains the path to library, which NiFi will use to load all the components using NAR files present in this lib folder. 10 nifi.nar.working.directory ./work/nar/ This directory will be storing the unpacked nar files, once NiFi processes them. 11 nifi.documentation.working.directory ./work/docs/components This directory contains the documentation of all components. State Management These properties are used to store the state of the components helpful to start the processing, where components left after a restart and in the next schedule running. S.No. Property name Default Value description 1 nifi.state.management.configuration.file ./conf/state-management.xml This property contains the path to state-management.xml file. This file contains all component state present in the data flows of that NiFi instance. 2 nifi.state.management.provider.local local-provider It contains the ID of the local state provider. 3 nifi.state.management.provider.cluster zk-provider This property contains the ID of the cluster-wide state provider. This will be ignored if NiFi is not clustered but must be populated if running in a cluster. 4 nifi.state.management. embedded. zookeeper. start false This property specifies whether or not this instance of NiFi should run an embedded ZooKeeper server. 5 nifi.state.management. embedded. zookeeper.properties ./conf/zookeeper.properties This property contains the path of the properties file that provides the ZooKeeper properties to use if <nifi.state.management. embedded. zookeeper. start> is set to true. FlowFile Repository Let us now look into the important details of the FlowFile repository − S.No. Property name Default Value description 1 nifi.flowfile.repository. implementation org.apache.nifi. controller. repository. WriteAhead FlowFileRepository This property is used to specify either to store the flowfiles in memory or disk. If a user want to stores the flowfiles in memory then change to “org.apache.nifi.controller. repository.VolatileFlowFileRepository”. 2 nifi.flowfile.repository.directory ./flowfile_repository To specify the directory for flowfile repository. Print Page Previous Next Advertisements ”;

Apache NiFi – Templates

Apache NiFi – Templates ”; Previous Next Apache NiFi offers the concept of Templates, which makes it easier to reuse and distribute the NiFi flows. The flows can be used by other developers or in other NiFi clusters. It also helps NiFi developers to share their work in repositories like GitHub. Create Template Let us create a template for the flow, which we created in chapter no 15 “Apache NiFi – Creating Flows”. Select all the components of the flow using shift key and then click on the create template icon at the left hand side of the NiFi canvas. You can also see a tool box as shown in the above image. Click on the icon create template marked in blue as in the above picture. Enter the name for the template. A developer can also add description, which is optional. Download Template Then go to the NiFi templates option in the menu present at the top right hand corner of NiFi UI as show in the picture below. Now click the download icon (present at the right hand side in the list) of the template, you want to download. An XML file with the template name will get downloaded. Upload Template To use a template in NiFi, a developer will have to upload its xml file to NiFi using UI. There is an Upload Template icon (marked with blue in below image) beside Create Template icon click on that and browse the xml. Add Template In the top toolbar of NiFi UI, the template icon is before the label icon. The icon is marked in blue as shown in the picture below. Drag the template icon and choose the template from the drop down list and click add. It will add the template to NiFi canvas. Print Page Previous Next Advertisements ”;

Apache NiFi – Reporting Task

Apache NiFi – Reporting Task ”; Previous Next Apache NiFi reporting tasks are similar to the controller services, which run in the background and send or log the statistics of NiFi instance. NiFi reporting task can also be accessed from the same page as controller settings, but in a different tab. To add a reporting task, a developer needs to click on the plus button present at the top right hand side of the reporting tasks page. These reporting tasks are mainly used for monitoring the activities of a NiFi instance, in either the bulletins or the provenance. Mainly these reporting tasks uses Site-to-Site to transport the NiFi statistics data to other node or external system. Let us now add a configured reporting task for more understanding. MonitorMemory This reporting task is used to generate bulletins, when a memory pool crosses specified percentage. Follow these steps to configure the MonitorMemory reporting task − Add in the plus sign and search for MonitorMemory in the list. Select MonitorMemory and click on ADD. Once it is added in the main page of reporting tasks main page, click on the configure icon. In the properties tab, select the memory pool, which you want to monitor. Select the percentage after which you want bulletins to alert the users. Start the reporting task. Print Page Previous Next Advertisements ”;

Apache NiFi – Introduction

Apache NiFi – Introduction ”; Previous Next Apache NiFi is a powerful, easy to use and reliable system to process and distribute data between disparate systems. It is based on Niagara Files technology developed by NSA and then after 8 years donated to Apache Software foundation. It is distributed under Apache License Version 2.0, January 2004. The latest version for Apache NiFi is 1.7.1. Apache NiFi is a real time data ingestion platform, which can transfer and manage data transfer between different sources and destination systems. It supports a wide variety of data formats like logs, geo location data, social feeds, etc. It also supports many protocols like SFTP, HDFS, and KAFKA, etc. This support to wide variety of data sources and protocols making this platform popular in many IT organizations. Apache NiFi- General Features The general features of Apache NiFi are as follows − Apache NiFi provides a web-based user interface, which provides seamless experience between design, control, feedback, and monitoring. It is highly configurable. This helps users with guaranteed delivery, low latency, high throughput, dynamic prioritization, back pressure and modify flows on runtime. It also provides data provenance module to track and monitor data from the start to the end of the flow. Developers can create their own custom processors and reporting tasks according to their needs. NiFi also provides support to secure protocols like SSL, HTTPS, SSH and other encryptions. It also supports user and role management and also can be configured with LDAP for authorization. Apache NiFi -Key Concepts The key concepts of Apache NiFi are as follows − Process Group − It is a group of NiFi flows, which helps a userto manage and keep flows in hierarchical manner. Flow − It is created connecting different processors to transfer and modify data if required from one data source or sources to another destination data sources. Processor − A processor is a java module responsible for either fetching data from sourcing system or storing it in destination system. Other processors are also used to add attributes or change content in flowfile. Flowfile − It is the basic usage of NiFi, which represents the single object of the data picked from source system in NiFi. NiFiprocessormakes changes to flowfile while it moves from the source processor to the destination. Different events like CREATE, CLONE, RECEIVE, etc. are performed on flowfile by different processors in a flow. Event − Events represent the change in flowfile while traversing through a NiFi Flow. These events are tracked in data provenance. Data provenance − It is a repository.It also has a UI, which enables users to check the information about a flowfile and helps in troubleshooting if any issues that arise during the processing of a flowfile. Apache NiFi Advantages Apache NiFi enables data fetching from remote machines by using SFTP and guarantees data lineage. Apache NiFi supports clustering, so it can work on multiple nodes with same flow processing different data, which increase the performance of data processing. It also provides security policies on user level, process group level and other modules too. Its UI can also run on HTTPS, which makes the interaction of users with NiFi secure. NiFi supports around 188 processors and a user can also create custom plugins to support a wide variety of data systems. Apache NiFi Disadvantages When node gets disconnected from NiFi cluster while a user is making any changes in it, then the flow.xml becomes invalid.Anode cannot connect back to the cluster unless admin manually copies flow.xml from the connected node. Apache NiFi have state persistence issue in case of primary node switch, which sometimes makes processors not able to fetch data from sourcing systems. Print Page Previous Next Advertisements ”;