TIKA – Content Extraction ”; Previous Next Tika uses various parser libraries to extract content from given parsers. It chooses the right parser for extracting the given document type. For parsing documents, the parseToString() method of Tika facade class is generally used. Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString() method. Abstracting the parsing process − Initially when we pass a document to Tika, it uses a suitable type detection mechanism available with it and detects the document type. Once the document type is known, it chooses a suitable parser from its parser repository. The parser repository contains classes that make use of external libraries. Then the document is passed to choose the parser which will parse the content, extract the text, and also throw exceptions for unreadable formats. Content Extraction using Tika Given below is the program for extracting text from a file using Tika facade class − import java.io.File; import java.io.IOException; import org.apache.tika.Tika; import org.apache.tika.exception.TikaException; import org.xml.sax.SAXException; public class TikaExtraction { public static void main(final String[] args) throws IOException, TikaException { //Assume sample.txt is in your current directory File file = new File(“sample.txt”); //Instantiating Tika facade class Tika tika = new Tika(); String filecontent = tika.parseToString(file); System.out.println(“Extracted Content: ” + filecontent); } } Save the above code as TikaExtraction.java and run it from the command prompt − javac TikaExtraction.java java TikaExtraction Given below is the content of sample.txt. Hi students welcome to tutorialspoint It gives you the following output − Extracted Content: Hi students welcome to tutorialspoint Content Extraction using Parser Interface The parser package of Tika provides several interfaces and classes using which we can parse a text document. Given below is the block diagram of the org.apache.tika.parser package. There are several parser classes available, e.g., pdf parser, Mp3Passer, OfficeParser, etc., to parse respective documents individually. All these classes implement the parser interface. CompositeParser The given diagram shows Tika’s general-purpose parser classes: CompositeParser and AutoDetectParser. Since the CompositeParser class follows composite design pattern, you can use a group of parser instances as a single parser. The CompositeParser class also allows access to all the classes that implement the parser interface. AutoDetectParser This is a subclass of CompositeParser and it provides automatic type detection. Using this functionality, the AutoDetectParser automatically sends the incoming documents to the appropriate parser classes using the composite methodology. parse() method Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this method is shown below. parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) The following table lists the four objects it accepts as parameters. Sr.No. Object & Description 1 InputStream stream Any Inputstream object that contains the content of the file 2 ContentHandler handler Tika passes the document as XHTML content to this handler, thereafter the document is processed using SAX API. It provides efficient postprocessing of the contents in a document. 3 Metadata metadata The metadata object is used both as a source and a target of document metadata. 4 ParseContext context This object is used in cases where the client application wants to customize the parsing process. Example Given below is an example that shows how the parse() method is used. Step 1 − To use the parse() method of the parser interface, instantiate any of the classes providing the implementation for this interface. There are individual parser classes such as PDFParser, OfficeParser, XMLParser, etc. You can use any of these individual document parsers. Alternatively, you can use either CompositeParser or AutoDetectParser that uses all the parser classes internally and extracts the contents of a document using a suitable parser. Parser parser = new AutoDetectParser(); (or) Parser parser = new CompositeParser(); (or) object of any individual parsers given in Tika Library Step 2 − Create a handler class object. Given below are the three content handlers − Sr.No. Class & Description 1 BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance. 2 LinkContentHandler This class detects and picks all the H-Ref tags of the XHTML document and forwards those for the use of tools like web crawlers. 3 TeeContentHandler This class helps in using multiple tools simultaneously. Since our target is to extract the text contents from a document, instantiate BodyContentHandler as shown below − BodyContentHandler handler = new BodyContentHandler( ); Step 3 − Create the Metadata object as shown below − Metadata metadata = new Metadata(); Step 4 − Create any of the input stream objects, and pass your file that should be extracted to it. FileInputstream Instantiate a file object by passing the file path as parameter and pass this object to the FileInputStream class constructor. Note − The path passed to the file object should not contain spaces. The problem with these input stream classes is that they don’t support random access reads, which is required to process some file formats efficiently. To resolve this problem, Tika provides TikaInputStream. File file = new File(filepath) FileInputStream inputstream = new FileInputStream(file); (or) InputStream stream = TikaInputStream.get(new File(filename)); Step 5 − Create a parse context object as shown below − ParseContext context =new ParseContext(); Step 6 − Instantiate the parser object, invoke the parse method, and pass all the objects required, as shown in the prototype below − parser.parse(inputstream, handler, metadata, context); Given below is the program for content extraction using the parser interface − import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class ParserExtraction { public static void main(final String[] args) throws IOException,SAXException, TikaException { //Assume sample.txt is in your current directory File file = new File(“sample.txt”); //parse method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(file); ParseContext context = new ParseContext(); //parsing the file parser.parse(inputstream, handler, metadata, context);
Category: tika
TIKA – Environment
TIKA – Environment ”; Previous Next This chapter takes you through the process of setting up Apache Tika on Windows and Linux. User administration is needed while installing the Apache Tika. System Requirements JDK Java SE 2 JDK 1.6 or above Memory 1 GB RAM (recommeneded) Disk Space No minimum requirement Operating System Version Windows XP or above, Linux Step 1: Verifying Java Installation To verify Java installation, open the console and execute the following java command. OS Task Command Windows Open command console >java –version Linux Open command terminal $java –version If Java has been installed properly on your system, then you should get one of the following outputs, depending on the platform you are working on. OS Output Windows Java version “1.7.0_60” Java (TM) SE Run Time Environment (build 1.7.0_60-b19) Java Hotspot (TM) 64-bit Server VM (build 24.60-b09, mixed mode) Lunix java version “1.7.0_25” Open JDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) Open JDK 64-Bit Server VM (build 23.7-b01, mixed mode) We assume the readers of this tutorial have Java 1.7.0_60 installed on their system before proceeding for this tutorial. In case you do not have Java SDK, download its current version from https://www.oracle.com/technetwork/java/javase/downloads/index.html and have it installed. Step 2: Setting Java Environment Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example, OS Output Windows Set Environmental variable JAVA_HOME to C:ProgramFilesjavajdk1.7.0_60 Linux export JAVA_HOME = /usr/local/java-current Append the full path of the Java compiler location to the System Path. OS Output Windows Append the String; C:Program FilesJavajdk1.7.0_60bin to the end of the system variable PATH. Linux export PATH = $PATH:$JAVA_HOME/bin/ Verify the command java-version from command prompt as explained above. Step 3: Setting up Apache Tika Environment Programmers can integrate Apache Tika in their environment by using Command line, Tika API, Command line interface (CLI) of Tika, Graphical User interface (GUI) of Tika, or the source code. For any of these approaches, first of all, you have to download the source code of Tika. You will find the source code of Tika at https://Tika.apache.org/download.html, where you will find two links − apache-tika-1.6-src.zip − It contains the source code of Tika, and Tika -app-1.6.jar − It is a jar file that contains the Tika application. Download these two files. A snapshot of the official website of Tika is shown below. After downloading the files, set the classpath for the jar file tika-app-1.6.jar. Add the complete path of the jar file as shown in the table below. OS Output Windows Append the String “C:jarsTika-app-1.6.jar” to the user environment variable CLASSPATH Linux Export CLASSPATH = $CLASSPATH − /usr/share/jars/Tika-app-1.6.tar − Apache provides Tika application, a Graphical User Interface (GUI) application using Eclipse. Tika-Maven Build using Eclipse Open eclipse and create a new project. If you do not having Maven in your Eclipse, set it up by following the given steps. Open the link https://wiki.eclipse.org/M2E_updatesite_and_gittags. There you will find the m2e plugin releases in a tabular format Pick the latest version and save the path of the url in p2 url column. Now revisit eclipse, in the menu bar, click Help, and choose Install New Software from the dropdown menu Click the Add button, type any desired name, as it is optional. Now paste the saved url in the Location field. A new plugin will be added with the name you have chosen in the previous step, check the checkbox in front of it, and click Next. Proceed with the installation. Once completed, restart the Eclipse. Now right click on the project, and in the configure option, select convert to maven project. A new wizard for creating a new pom appears. Enter the Group Id as org.apache.tika, enter the latest version of Tika, select the packaging as jar, and click Finish. The Maven project is successfully installed, and your project is converted into Maven. Now you have to configure the pom.xml file. Configure the XML File Get the Tika maven dependency from https://mvnrepository.com/artifact/org.apache.tika Shown below is the complete Maven dependency of Apache Tika. <dependency> <groupId>org.apache.Tika</groupId> <artifactId>Tika-core</artifactId> <version>1.6</version> <groupId>org.apache.Tika</groupId> <artifactId> Tika-parsers</artifactId> <version> 1.6</version> <groupId> org.apache.Tika</groupId> <artifactId>Tika</artifactId> <version>1.6</version> <groupId>org.apache.Tika</groupId> < artifactId>Tika-serialization</artifactId> < version>1.6< /version> < groupId>org.apache.Tika< /groupId> < artifactId>Tika-app< /artifactId> < version>1.6< /version> <groupId>org.apache.Tika</groupId> <artifactId>Tika-bundle</artifactId> <version>1.6</version> </dependency> Print Page Previous Next Advertisements ”;
TIKA – GUI
TIKA – GUI ”; Previous Next Graphical User Interface (GUI) Tika provides a jar file along with its source code in the following link https://tika.apache.org/download.html. Download both the files, set the classpath for the jar file. Extract the source code zip folder, open the tika-app folder. In the extracted folder at “tika-1.6tika-appsrcmainjavaorgapacheTikagui” you will see two class files: ParsingTransferHandler.java and TikaGUI.java. Compile both the class files and execute the TikaGUI.java class file, it opens the following window. Let us now see how to make use of the Tika GUI. On the GUI, click open, browse and select a file that is to be extracted, or drag it onto the whitespace of the window. Tika extracts the content of the files and displays it in five different formats, viz. metadata, formatted text, plain text, main content, and structured text. You can choose any of the format you want. In the same way, you will also find the CLI class in the “tika-1.6tikaappsrcmainjavaorgapachetikacli” folder. The following illustration shows what Tika can do. When we drop the image on the GUI, Tika extracts and displays its metadata. Print Page Previous Next Advertisements ”;
TIKA – Metadata Extraction
TIKA – Metadata Extraction ”; Previous Next Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file. If we consider an audio file, the artist name, album name, title comes under metadata. XMP Standards The Extensible Metadata Platform (XMP) is a standard for processing and storing information related to the content of a file. It was created by Adobe Systems Inc. XMP provides standards for defining, creating, and processing of metadata. You can embed this standard into several file formats such as PDF, JPEG, JPEG, GIF, jpg, HTML etc. Property Class Tika uses the Property class to follow XMP property definition. It provides the PropertyType and ValueType enums to capture the name and value of a metadata. Metadata Class This class implements various interfaces such as ClimateForcast, CativeCommons, Geographic, TIFF etc. to provide support for various metadata models. In addition, this class provides various methods to extract the content from a file. Metadata Names We can extract the list of all metadata names of a file from its metadata object using the method names(). It returns all the names as a string array. Using the name of the metadata, we can get the value using the get() method. It takes a metadata name and returns a value associated with it. String[] metadaNames = metadata.names(); String value = metadata.get(name); Extracting Metadata using Parse Method Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters. This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object. Therefore, after parsing the file using parse(), we can extract the metadata from that object. Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); //empty metadata object FileInputStream inputstream = new FileInputStream(file); ParseContext context = new ParseContext(); parser.parse(inputstream, handler, metadata, context); // now this metadata object contains the extracted metadata of the given file. metadata.metadata.names(); Given below is the complete program to extract metadata from a text file. import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class GetMetadata { public static void main(final String[] args) throws IOException, TikaException { //Assume that boy.jpg is in your current directory File file = new File(“boy.jpg”); //Parser method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(file); ParseContext context = new ParseContext(); parser.parse(inputstream, handler, metadata, context); System.out.println(handler.toString()); //getting the list of all meta data elements String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + “: ” + metadata.get(name)); } } } Save the above code as GetMetadata.java and run it from the command prompt using the following commands − javac GetMetadata .java java GetMetadata Given below is the snapshot of boy.jpg If you execute the above program, it will give you the following output − X-Parsed-By: org.apache.tika.parser.DefaultParser Resolution Units: inch Compression Type: Baseline Data Precision: 8 bits Number of Components: 3 tiff:ImageLength: 3000 Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert Image Height: 3000 pixels X Resolution: 300 dots Original Transmission Reference: 53616c7465645f5f2368da84ca932841b336ac1a49edb1a93fae938b8db2cb3ec9cc4dc28d7383f1 Image Width: 4000 pixels IPTC-NAA record: 92 bytes binary data Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert tiff:BitsPerSample: 8 Application Record Version: 4 tiff:ImageWidth: 4000 Content-Type: image/jpeg Y Resolution: 300 dots We can also get our desired metadata values. Adding New Metadata Values We can add new metadata values using the add() method of the metadata class. Given below is the syntax of this method. Here we are adding the author name. metadata.add(“author”,”Tutorials point”); The Metadata class has predefined properties including the properties inherited from classes like ClimateForcast, CativeCommons, Geographic, etc., to support various data models. Shown below is the usage of the SOFTWARE data type inherited from the TIFF interface implemented by Tika to follow XMP metadata standards for TIFF image formats. metadata.add(Metadata.SOFTWARE,”ms paint”); Given below is the complete program that demonstrates how to add metadata values to a given file. Here the list of the metadata elements is displayed in the output so that you can observe the change in the list after adding new values. import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.util.Arrays; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class AddMetadata { public static void main(final String[] args) throws IOException, SAXException, TikaException { //create a file object and assume sample.txt is in your current directory File file = new File(“Example.txt”); //Parser method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(file); ParseContext context = new ParseContext(); //parsing the document parser.parse(inputstream, handler, metadata, context); //list of meta data elements before adding new elements System.out.println( ” metadata elements :” +Arrays.toString(metadata.names())); //adding new meta data name value pair metadata.add(“Author”,”Tutorials Point”); System.out.println(” metadata name value pair is successfully added”); //printing all the meta data elements after adding new elements System.out.println(“Here is the list of all the metadata elements after adding new elements”); System.out.println( Arrays.toString(metadata.names())); } } Save the above code as AddMetadata.java class and run it from the command prompt − javac AddMetadata .java java AddMetadata Given below is the content of Example.txt Hi students welcome to tutorialspoint If you execute the above program, it will give you the following output − metadata elements of the given file : [Content-Encoding, Content-Type] enter the number of metadata name value pairs to be added 1 enter metadata1name: Author enter metadata1value: Tutorials point metadata name value pair is successfully added Here is the list of all the metadata elements after adding new elements [Content-Encoding, Author, Content-Type] Setting Values to Existing Metadata Elements You can set values to the existing metadata elements using the set() method. The syntax of setting the date property using
TIKA – Language Detection
TIKA – Language Detection ”; Previous Next Need for Language Detection For classification of documents based on the language they are written in a multilingual website, a language detection tool is needed. This tool should accept documents without language annotation (metadata) and add that information in the metadata of the document by detecting the language. Algorithms for Profiling Corpus What is Corpus? To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus. A corpus is a collection of texts of a written language that explains how the language is used in real situations. The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus. What are Profiling Algorithms? The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries. A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English. Using Word Sets as Corpus Using word sets, a simple algorithm is framed to find the distance between two corpora, which will be equal to the sum of differences between the frequencies of matching words. Such algorithms suffer from the following problems − Since the frequency of matching words is very less, the algorithm cannot efficiently work with small texts having few sentences. It needs a lot of text for accurate match. It cannot detect word boundaries for languages having compound sentences, and those having no word dividers like spaces or punctuation marks. Due to these difficulties in using word sets as corpus, individual characters or character groups are considered. Using Character Sets as Corpus Since the characters that are commonly used in a language are finite in number, it is easy to apply an algorithm based on word frequencies rather than characters. This algorithm works even better in case of certain character sets used in one or very few languages. This algorithm suffers from the following drawbacks − It is difficult to differentiate two languages having similar character frequencies. There is no specific tool or algorithm to specifically identify a language with the help of (as corpus) the character set used by multiple languages. N-gram Algorithm The drawbacks stated above gave rise to a new approach of using character sequences of a given length for profiling corpus. Such sequence of characters are called as N-grams in general, where N represents the length of the character sequence. N-gram algorithm is an effective approach for language detection, especially in case of European languages like English. This algorithm works fine with short texts. Though there are advanced language profiling algorithms to detect multiple languages in a multilingual document having more attractive features, Tika uses the 3-grams algorithm, as it is suitable in most practical situations. Language Detection in Tika Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika − da—Danish de—German et—Estonian el—Greek en—English es—Spanish fi—Finnish fr—French hu—Hungarian is—Icelandic it—Italian nl—Dutch no—Norwegian pl—Polish pt—Portuguese ru—Russian sv—Swedish th—Thai While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted, or a LanguageProfile class object. LanguageIdentifier object = new LanguageIdentifier(“this is english”); Given below is the example program for Language detection in Tika. import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.language.LanguageIdentifier; import org.xml.sax.SAXException; public class LanguageDetection { public static void main(String args[])throws IOException, SAXException, TikaException { LanguageIdentifier identifier = new LanguageIdentifier(“this is english “); String language = identifier.getLanguage(); System.out.println(“Language of the given content is : ” + language); } } Save the above code as LanguageDetection.java and run it from the command prompt using the following commands − javac LanguageDetection.java java LanguageDetection If you execute the above program it gives the following outpu− Language of the given content is : en Language Detection of a Document To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Pass the String format of the handler object to the constructor of the LanguageIdentifier class as shown below − parser.parse(inputstream, handler, metadata, context); LanguageIdentifier object = new LanguageIdentifier(handler.toString()); Given below is the complete program that demonstrates how to detect the language of a given document − import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.language.*; import org.xml.sax.SAXException; public class DocumentLanguageDetection { public static void main(final String[] args) throws IOException, SAXException, TikaException { //Instantiating a file object File file = new File(“Example.txt”); //Parser method parameters Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream content = new FileInputStream(file); //Parsing the given document parser.parse(content, handler, metadata, new ParseContext()); LanguageIdentifier object = new LanguageIdentifier(handler.toString()); System.out.println(“Language name :” + object.getLanguage()); } } Save the above code as SetMetadata.java and run it from the command prompt − javac SetMetadata.java java SetMetadata Given below is the content of Example.txt. Hi students welcome to tutorialspoint If you execute the above program, it will give you the following output − Language name :en Along with the Tika jar, Tika provides a Graphical User Interface application (GUI) and a Command Line Interface (CLI) application. You can execute a Tika application from the command prompt too like other Java applications. Print Page Previous Next
TIKA – Architecture
TIKA – Architecture ”; Previous Next Application-Level Architecture of Tika Application programmers can easily integrate Tika in their applications. Tika provides a Command Line Interface and a GUI to make it user friendly. In this chapter, we will discuss the four important modules that constitute the Tika architecture. The following illustration shows the architecture of Tika along with its four modules − Language detection mechanism. MIME detection mechanism. Parser interface. Tika Facade class. Language Detection Mechanism Whenever a text document is passed to Tika, it will detect the language in which it was written. It accepts documents without language annotation and adds that information in the metadata of the document by detecting the language. To support language identification, Tika has a class called Language Identifier in the package org.apache.tika.language, and a language identification repository inside which contains algorithms for language detection from a given text. Tika internally uses N-gram algorithm for language detection. MIME Detection Mechanism Tika can detect the document type according to the MIME standards. Default MIME type detection in Tika is done using org.apache.tika.mime.mimeTypes. It uses the org.apache.tika.detect.Detector interface for most of the content type detection. Internally Tika uses several techniques like file globs, content-type hints, magic bytes, character encodings, and several other techniques. Parser Interface The parser interface of org.apache.tika.parser is the key interface for parsing documents in Tika. This Interface extracts the text and the metadata from a document and summarizes it for external users who are willing to write parser plugins. Using different concrete parser classes, specific for individual document types, Tika supports a lot of document formats. These format specific classes provide support for different document formats, either by directly implementing the parser logic or by using external parser libraries. Tika Facade Class Using Tika facade class is the simplest and direct way of calling Tika from Java, and it follows the facade design pattern. You can find the Tika facade class in the org.apache.tika package of Tika API. By implementing basic use cases, Tika acts as a broker of landscape. It abstracts the underlying complexity of the Tika library such as MIME detection mechanism, parser interface, and language detection mechanism, and provides the users a simple interface to use. Features of Tika Unified parser Interface − Tika encapsulates all the third party parser libraries within a single parser interface. Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it according to the file type encountered. Low memory usage − Tika consumes less memory resources therefore it is easily embeddable with Java applications. We can also use Tika within the application which run on platforms with less resources like mobile PDA. Fast processing − Quick content detection and extraction from applications can be expected. Flexible metadata − Tika understands all the metadata models which are used to describe files. Parser integration − Tika can use various parser libraries available for each document type in a single application. MIME type detection − Tika can detect and extract content from all the media types included in the MIME standards. Language detection − Tika includes language identification feature, therefore can be used in documents based on language type in a multi lingual websites. Functionalities of Tika Tika supports various functionalities − Document type detection Content extraction Metadata extraction Language detection Document Type Detection Tika uses various detection techniques and detects the type of the document given to it. Content Extraction Tika has a parser library that can parse the content of various document formats and extract them. After detecting the type of the document, it selects the appropriate parser from the parser repository and passes the document. Different classes of Tika have methods to parse different document formats. Metadata Extraction Along with the content, Tika extracts the metadata of the document with the same procedure as in content extraction. For some document types, Tika have classes to extract metadata. Language Detection Internally, Tika follows algorithms like n-gram to detect the language of the content in a given document. Tika depends on classes like Languageidentifier and Profiler for language identification. Print Page Previous Next Advertisements ”;