Apache Pig – Bag & Tuple Functions ”; Previous Next Given below is the list of Bag and Tuple functions. S.N. Function & Description 1 TOBAG() To convert two or more expressions into a bag. 2 TOP() To get the top N tuples of a relation. 3 TOTUPLE() To convert one or more expressions into a tuple. 4 TOMAP() To convert the key-value pairs into a Map. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Apache Tajo – Discussion
Discuss Apache Tajo ”; Previous Next Apache Tajo is an open-source distributed data warehouse framework for Hadoop. Tajo was initially started by Gruter, a Hadoop-based infrastructure company in south Korea. Later, experts from Intel, Etsy, NASA, Cloudera, Hortonworks also contributed to the project. Tajo refers to an ostrich in Korean language. In the year March 2014, Tajo was granted a top-level open source Apache project. This tutorial will explore the basics of Tajo and moving on, it will explain cluster setup, Tajo shell, SQL queries, integration with other big data technologies and finally conclude with some examples. Print Page Previous Next Advertisements ”;
AVRO – Environment Setup
AVRO – Environment Setup ”; Previous Next Apache software foundation provides Avro with various releases. You can download the required release from Apache mirrors. Let us see, how to set up the environment to work with Avro − Downloading Avro To download Apache Avro, proceed with the following − Open the web page Apache.org. You will see the homepage of Apache Avro as shown below − Click on project → releases. You will get a list of releases. Select the latest release which leads you to a download link. mirror.nexcess is one of the links where you can find the list of all libraries of different languages that Avro supports as shown below − You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar. Avro with Eclipse To use Avro in Eclipse environment, you need to follow the steps given below − Step 1. Open eclipse. Step 2. Create a project. Step 3. Right-click on the project name. You will get a shortcut menu. Step 4. Click on Build Path. It leads you to another shortcut menu. Step 5. Click on Configure Build Path… You can see Properties window of your project as shown below − Step 6. Under libraries tab, click on ADD EXternal JARs… button. Step 7. Select the jar file avro-1.77.jar you have downloaded. Step 8. Click on OK. Avro with Maven You can also get the Avro library into your project using Maven. Given below is the pom.xml file for Avro. <project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=” http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”> <modelVersion>4.0.0</modelVersion> <groupId>Test</groupId> <artifactId>Test</artifactId> <version>0.0.1-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-tools</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.0-beta9</version> </dependency> </dependencies> </project> Setting Classpath To work with Avro in Linux environment, download the following jar files − avro-1.77.jar avro-tools-1.77.jar log4j-api-2.0-beta9.jar og4j-core-2.0.beta9.jar. Copy these files into a folder and set the classpath to the folder, in the ./bashrc file as shown below. #class path for Avro export CLASSPATH=$CLASSPATH://home/Hadoop/Avro_Work/jars/* Print Page Previous Next Advertisements ”;
AVRO – Reference API
AVRO – Reference API ”; Previous Next In the previous chapter, we described the input type of Avro, i.e., Avro schemas. In this chapter, we will explain the classes and methods used in the serialization and deserialization of Avro schemas. SpecificDatumWriter Class This class belongs to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format. Constructor S.No. Description 1 SpecificDatumWriter(Schema schema) Method S.No. Description 1 SpecificData getSpecificData() Returns the SpecificData implementation used by this writer. SpecificDatumReader Class This class belongs to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the class which supports generated java classes. Constructor S.No. Description 1 SpecificDatumReader(Schema schema) Construct where the writer”s and reader”s schemas are the same. Methods S.No. Description 1 SpecificData getSpecificData() Returns the contained SpecificData. 2 void setSchema(Schema actual) This method is used to set the writer”s schema. DataFileWriter Instantiates DataFileWrite for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema in a file. Constructor S.No. Description 1 DataFileWriter(DatumWriter<D> dout) Methods S.No Description 1 void append(D datum) Appends a datum to a file. 2 DataFileWriter<D> appendTo(File file) This method is used to open a writer appending to an existing file. Data FileReader This class provides random access to files written with DataFileWriter. It inherits the class DataFileStream. Constructor S.No. Description 1 DataFileReader(File file, DatumReader<D> reader)) Methods S.No. Description 1 next() Reads the next datum in the file. 2 Boolean hasNext() Returns true if more entries remain in this file. Class Schema.parser This class is a parser for JSON-format schemas. It contains methods to parse the schema. It belongs to org.apache.avro package. Constructor S.No. Description 1 Schema.Parser() Methods S.No. Description 1 parse (File file) Parses the schema provided in the given file. 2 parse (InputStream in) Parses the schema provided in the given InputStream. 3 parse (String s) Parses the schema provided in the given String. Interface GenricRecord This interface provides methods to access the fields by name as well as index. Methods S.No. Description 1 Object get(String key) Returns the value of a field given. 2 void put(String key, Object v) Sets the value of a field given its name. Class GenericData.Record Constructor S.No. Description 1 GenericData.Record(Schema schema) Methods S.No. Description 1 Object get(String key) Returns the value of a field of the given name. 2 Schema getSchema() Returns the schema of this instance. 3 void put(int i, Object v) Sets the value of a field given its position in the schema. 4 void put(String key, Object value) Sets the value of a field given its name. Print Page Previous Next Advertisements ”;
AVRO – Quick Guide
AVRO – Quick Guide ”; Previous Next AVRO – Overview To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique. This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library. What is Avro? Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application. Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. Avro Schemas Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift. Comparison with Thrift and Protocol Buffers Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways − Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization. Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro”s schema definition is in JSON and not in any proprietary IDL. Property Avro Thrift & Protocol Buffer Dynamic schema Yes No Built into Hadoop Yes No Schema in JSON Yes No No need to compile Yes No No need to declare IDs Yes No Bleeding edge Yes No Features of Avro Listed below are some of the prominent features of Avro − Avro is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs. Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language. Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries. Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section. Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake. General Working of Avro To use Avro, you need to follow the given workflow − Step 1 − Create schemas. Here you need to design Avro schema according to your data. Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library. Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific. Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific. AVRO – Serialization Data is serialized for two objectives − For persistent storage To transport the data over network What is Serialization? Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling. Serialization in Java Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object”s data as well as information about the object”s type and the types of data stored in the object. After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory. ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java. Serialization in Hadoop Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage. Interprocess Communication To establish the interprocess communication between the nodes connected in a network, RPC technique was used. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message. The RPC serialization format is required to be as follows − Compact − To make the best use
Apache Tajo – Database Creation ”; Previous Next This section explains the Tajo DDL commands. Tajo has a built-in database named default. Create Database Statement Create Database is a statement used to create a database in Tajo. The syntax for this statement is as follows − CREATE DATABASE [IF NOT EXISTS] <database_name> Query default> default> create database if not exists test; Result The above query will generate the following result. OK Database is the namespace in Tajo. A database can contain multiple tables with a unique name. Show Current Database To check the current database name, issue the following command − Query default> c Result The above query will generate the following result. You are now connected to database “default” as user “user1″. default> Connect to Database As of now, you have created a database named “test”. The following syntax is used to connect the “test” database. c <database name> Query default> c test Result The above query will generate the following result. You are now connected to database “test” as user “user1”. test> You can now see the prompt changes from default database to test database. Drop Database To drop a database, use the following syntax − DROP DATABASE <database-name> Query test> c default You are now connected to database “default” as user “user1″. default> drop database test; Result The above query will generate the following result. OK Print Page Previous Next Advertisements ”;
Apache Pig – Date-time Functions ”; Previous Next Apache Pig provides the following Date and Time functions − S.N. Functions & Description 1 ToDate(milliseconds) This function returns a date-time object according to the given parameters. The other alternative for this function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone) 2 CurrentTime() returns the date-time object of the current time. 3 GetDay(datetime) Returns the day of a month from the date-time object. 4 GetHour(datetime) Returns the hour of a day from the date-time object. 5 GetMilliSecond(datetime) Returns the millisecond of a second from the date-time object. 6 GetMinute(datetime) Returns the minute of an hour from the date-time object. 7 GetMonth(datetime) Returns the month of a year from the date-time object. 8 GetSecond(datetime) Returns the second of a minute from the date-time object. 9 GetWeek(datetime) Returns the week of a year from the date-time object. 10 GetWeekYear(datetime) Returns the week year from the date-time object. 11 GetYear(datetime) Returns the year from the date-time object. 12 AddDuration(datetime, duration) Returns the result of a date-time object along with the duration object. 13 SubtractDuration(datetime, duration) Subtracts the Duration object from the Date-Time object and returns the result. 14 DaysBetween(datetime1, datetime2) Returns the number of days between the two date-time objects. 15 HoursBetween(datetime1, datetime2) Returns the number of hours between two date-time objects. 16 MilliSecondsBetween(datetime1, datetime2) Returns the number of milliseconds between two date-time objects. 17 MinutesBetween(datetime1, datetime2) Returns the number of minutes between two date-time objects. 18 MonthsBetween(datetime1, datetime2) Returns the number of months between two date-time objects. 19 SecondsBetween(datetime1, datetime2) Returns the number of seconds between two date-time objects. 20 WeeksBetween(datetime1, datetime2) Returns the number of weeks between two date-time objects. 21 YearsBetween(datetime1, datetime2) Returns the number of years between two date-time objects. Print Page Previous Next Advertisements ”;
Apache Pig – Storing Data
Apache Pig – Storing Data ”; Previous Next In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator. Syntax Given below is the syntax of the Store statement. STORE Relation_name INTO ” required_directory_path ” [USING function]; Example Assume we have a file student_data.txt in HDFS with the following content. 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai. And we have read it into a relation student using the LOAD operator as shown below. grunt> student = LOAD ”hdfs://localhost:9000/pig_data/student_data.txt” USING PigStorage(”,”) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below. grunt> STORE student INTO ” hdfs://localhost:9000/pig_Output/ ” USING PigStorage (”,”); Output After executing the store statement, you will get the following output. A directory is created with the specified name and the data will be stored in it. 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher – 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats – Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature 0 0 0 0 student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ Input(s): Successfully read 0 records from: “hdfs://localhost:9000/pig_data/student_data.txt” Output(s): Successfully stored 0 records in: “hdfs://localhost:9000/pig_Output” Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine .mapReduceLayer.MapReduceLau ncher – Success! Verification You can verify the stored data as shown below. Step 1 First of all, list out the files in the directory named pig_output using the ls command as shown below. hdfs dfs -ls ”hdfs://localhost:9000/pig_Output/” Found 2 items rw-r–r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS rw-r–r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000 You can observe that two files were created after executing the store statement. Step 2 Using cat command, list the contents of the file named part-m-00000 as shown below. $ hdfs dfs -cat ”hdfs://localhost:9000/pig_Output/part-m-00000” 1,Rajiv,Reddy,9848022337,Hyderabad 2,siddarth,Battacharya,9848022338,Kolkata 3,Rajesh,Khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai Print Page Previous Next Advertisements ”;
Apache Tajo – Installation
Apache Tajo – Installation ”; Previous Next To install Apache Tajo, you must have the following software on your system − Hadoop version 2.3 or greater Java version 1.7 or higher Linux or Mac OS Let us now continue with the following steps to install Tajo. Verifying Java installation Hopefully, you have already installed Java version 8 on your machine. Now, you just need to proceed by verifying it. To verify, use the following command − $ java -version If Java is successfully installed on your machine, you could see the present version of the installed Java. If Java is not installed follow these steps to install Java 8 on your machine. Download JDK Download the latest version of JDK by visiting the following link and then, download the latest version. https://www.oracle.com The latest version is JDK 8u 92 and the file is “jdk-8u92-linux-x64.tar.gz”. Please download the file on your machine. Following this, extract the files and move them to a specific directory. Now, set the Java alternatives. Finally, Java is installed on your machine. Verifying Hadoop Installation You have already installed Hadoop on your system. Now, verify it using the following command − $ hadoop version If everything is fine with your setup, then you could see the version of Hadoop. If Hadoop is not installed, download and install Hadoop by visiting the following link − https://www.apache.org Apache Tajo Installation Apache Tajo provides two execution modes — local mode and fully distributed mode. After verifying Java and Hadoop installation proceed with the following steps to install Tajo cluster on your machine. A local mode Tajo instance requires very easy configurations. Download the latest version of Tajo by visiting the following link − https://www.apache.org/dyn/closer.cgi/tajo Now you can download the file “tajo-0.11.3.tar.gz” from your machine. Extract Tar File Extract the tar file by using the following command − $ cd opt/ $ tar tajo-0.11.3.tar.gz $ cd tajo-0.11.3 Set Environment Variable Add the following changes to “conf/tajo-env.sh” file $ cd tajo-0.11.3 $ vi conf/tajo-env.sh # Hadoop home. Required export HADOOP_HOME = /Users/path/to/Hadoop/hadoop-2.6.2 # The java implementation to use. Required. export JAVA_HOME = /path/to/jdk1.8.0_92.jdk/ Here, you must specify Hadoop and Java path to “tajo-env.sh” file. After the changes are made, save the file and quit the terminal. Start Tajo Server To launch the Tajo server, execute the following command − $ bin/start-tajo.sh You will receive a response similar to the following − Starting single TajoMaster starting master, logging to /Users/path/to/Tajo/tajo-0.11.3/bin/../ localhost: starting worker, logging to /Users/path/toe/Tajo/tajo-0.11.3/bin/../logs/ Tajo master web UI: http://local:26080 Tajo Client Service: local:26002 Now, type the command “jps” to see the running daemons. $ jps 1010 TajoWorker 1140 Jps 933 TajoMaster Launch Tajo Shell (Tsql) To launch the Tajo shell client, use the following command − $ bin/tsql You will receive the following output − welcome to _____ ___ _____ ___ /_ _/ _ |/_ _/ / / // /_| |_/ // / / /_//_/ /_/___/ __/ 0.11.3 Try ? for help. Quit Tajo Shell Execute the following command to quit Tsql − default> q bye! Here, the default refers to the catalog in Tajo. Web UI Type the following URL to launch the Tajo web UI − http://localhost:26080/ You will now see the following screen which is similar to the ExecuteQuery option. Stop Tajo To stop the Tajo server, use the following command − $ bin/stop-tajo.sh You will get the following response − localhost: stopping worker stopping master Print Page Previous Next Advertisements ”;
Apache Tajo – Configuration Settings ”; Previous Next Tajo’s configuration is based on Hadoop’s configuration system. This chapter explains Tajo configuration settings in detail. Basic Settings Tajo uses the following two config files − catalog-site.xml − configuration for the catalog server. tajo-site.xml − configuration for other Tajo modules. Distributed Mode Configuration Distributed mode setup runs on Hadoop Distributed File System (HDFS). Let’s follow the steps to configure Tajo distributed mode setup. tajo-site.xml This file is available @ /path/to/tajo/conf directory and acts as configuration for other Tajo modules. To access Tajo in a distributed mode, apply the following changes to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.master.client-rpc.address</name> <value>hostname:26002</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>hostname:26005</value> </property> Master Node Configuration Tajo uses HDFS as a primary storage type. The configuration is as follows and should be added to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://namenode_hostname:port/path</value> </property> Catalog Configuration If you want to customize the catalog service, copy $path/to/Tajo/conf/catalogsite.xml.template to $path/to/Tajo/conf/catalog-site.xml and add any of the following configuration as needed. For example, if you use “Hive catalog store” to access Tajo, then the configuration should be like the following − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HCatalogStore</value> </property> If you need to store MySQL catalog, then apply the following changes − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.MySQLStore</value> </property> <property> <name>tajo.catalog.jdbc.connection.id</name> <value><mysql user name></value> </property> <property> <name>tajo.catalog.jdbc.connection.password</name> <value><mysql user password></value> </property> <property> <name>tajo.catalog.jdbc.uri</name> <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo> ?createDatabaseIfNotExist = true</value> </property> Similarly, you can register the other Tajo supported catalogs in the configuration file. Worker Configuration By default, the TajoWorker stores temporary data on the local file system. It is defined in the “tajo-site.xml” file as follows − <property> <name>tajo.worker.tmpdir.locations</name> <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> </property> To increase the capacity of running tasks of each worker resource, choose the following configuration − <property> <name>tajo.worker.resource.cpu-cores</name> <value>12</value> </property> <property> <name>tajo.task.resource.min.memory-mb</name> <value>2000</value> </property> <property> <name>tajo.worker.resource.disks</name> <value>4</value> </property> To make the Tajo worker run in a dedicated mode, choose the following configuration − <property> <name>tajo.worker.resource.dedicated</name> <value>true</value> </property> Print Page Previous Next Advertisements ”;