AVRO – Reference API ”; Previous Next In the previous chapter, we described the input type of Avro, i.e., Avro schemas. In this chapter, we will explain the classes and methods used in the serialization and deserialization of Avro schemas. SpecificDatumWriter Class This class belongs to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format. Constructor S.No. Description 1 SpecificDatumWriter(Schema schema) Method S.No. Description 1 SpecificData getSpecificData() Returns the SpecificData implementation used by this writer. SpecificDatumReader Class This class belongs to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the class which supports generated java classes. Constructor S.No. Description 1 SpecificDatumReader(Schema schema) Construct where the writer”s and reader”s schemas are the same. Methods S.No. Description 1 SpecificData getSpecificData() Returns the contained SpecificData. 2 void setSchema(Schema actual) This method is used to set the writer”s schema. DataFileWriter Instantiates DataFileWrite for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema in a file. Constructor S.No. Description 1 DataFileWriter(DatumWriter<D> dout) Methods S.No Description 1 void append(D datum) Appends a datum to a file. 2 DataFileWriter<D> appendTo(File file) This method is used to open a writer appending to an existing file. Data FileReader This class provides random access to files written with DataFileWriter. It inherits the class DataFileStream. Constructor S.No. Description 1 DataFileReader(File file, DatumReader<D> reader)) Methods S.No. Description 1 next() Reads the next datum in the file. 2 Boolean hasNext() Returns true if more entries remain in this file. Class Schema.parser This class is a parser for JSON-format schemas. It contains methods to parse the schema. It belongs to org.apache.avro package. Constructor S.No. Description 1 Schema.Parser() Methods S.No. Description 1 parse (File file) Parses the schema provided in the given file. 2 parse (InputStream in) Parses the schema provided in the given InputStream. 3 parse (String s) Parses the schema provided in the given String. Interface GenricRecord This interface provides methods to access the fields by name as well as index. Methods S.No. Description 1 Object get(String key) Returns the value of a field given. 2 void put(String key, Object v) Sets the value of a field given its name. Class GenericData.Record Constructor S.No. Description 1 GenericData.Record(Schema schema) Methods S.No. Description 1 Object get(String key) Returns the value of a field of the given name. 2 Schema getSchema() Returns the schema of this instance. 3 void put(int i, Object v) Sets the value of a field given its position in the schema. 4 void put(String key, Object value) Sets the value of a field given its name. Print Page Previous Next Advertisements ”;
Category: avro
AVRO – Quick Guide
AVRO – Quick Guide ”; Previous Next AVRO – Overview To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique. This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library. What is Avro? Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application. Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. Avro Schemas Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift. Comparison with Thrift and Protocol Buffers Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways − Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization. Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro”s schema definition is in JSON and not in any proprietary IDL. Property Avro Thrift & Protocol Buffer Dynamic schema Yes No Built into Hadoop Yes No Schema in JSON Yes No No need to compile Yes No No need to declare IDs Yes No Bleeding edge Yes No Features of Avro Listed below are some of the prominent features of Avro − Avro is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs. Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language. Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries. Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section. Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake. General Working of Avro To use Avro, you need to follow the given workflow − Step 1 − Create schemas. Here you need to design Avro schema according to your data. Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library. Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific. Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific. AVRO – Serialization Data is serialized for two objectives − For persistent storage To transport the data over network What is Serialization? Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling. Serialization in Java Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object”s data as well as information about the object”s type and the types of data stored in the object. After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory. ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java. Serialization in Hadoop Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage. Interprocess Communication To establish the interprocess communication between the nodes connected in a network, RPC technique was used. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message. The RPC serialization format is required to be as follows − Compact − To make the best use