Chunking a sentences refers to breaking/dividing a sentence into parts of words such as word groups and verb groups.
Chunking a Sentence using OpenNLP
To detect the sentences, OpenNLP uses a model, a file named en-chunker.bin. This is a predefined model which is trained to chunk the sentences in the given raw text.
The opennlp.tools.chunker package contains the classes and interfaces that are used to find non-recursive syntactic annotation such as noun phrase chunks.
You can chunk a sentence using the method chunk() of the ChunkerME class. This method accepts tokens of a sentence and POS tags as parameters. Therefore, before starting the process of chunking, first of all you need to Tokenize the sentence and generate the parts POS tags of it.
To chunk a sentence using OpenNLP library, you need to −
-
Tokenize the sentence.
-
Generate POS tags for it.
-
Load the en-chunker.bin model using the ChunkerModel class
-
Instantiate the ChunkerME class.
-
Chunk the sentences using the chunk() method of this class.
Following are the steps to be followed to write a program to chunk sentences from the given raw text.
Step 1: Tokenizing the sentence
Tokenize the sentences using the tokenize() method of the whitespaceTokenizer class, as shown in the following code block.
//Tokenizing the sentence String sentence = "Hi welcome to Tutorialspoint"; WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; String[] tokens = whitespaceTokenizer.tokenize(sentence);
Step 2: Generating the POS tags
Generate the POS tags of the sentence using the tag() method of the POSTaggerME class, as shown in the following code block.
//Generating the POS tags File file = new File("C:/OpenNLP_models/en-pos-maxent.bin"); POSModel model = new POSModelLoader().load(file); //Constructing the tagger POSTaggerME tagger = new POSTaggerME(model); //Generating tags from the tokens String[] tags = tagger.tag(tokens);
Step 3: Loading the model
The model for chunking a sentence is represented by the class named ChunkerModel, which belongs to the package opennlp.tools.chunker.
To load a sentence detection model −
-
Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).
-
Instantiate the ChunkerModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block −
//Loading the chunker model InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-chunker.bin"); ChunkerModel chunkerModel = new ChunkerModel(inputStream);
Step 4: Instantiating the chunkerME class
The chunkerME class of the package opennlp.tools.chunker contains methods to chunk the sentences. This is a maximum-entropy-based chunker.
Instantiate this class and pass the model object created in the previous step.
//Instantiate the ChunkerME class ChunkerME chunkerME = new ChunkerME(chunkerModel);
Step 5: Chunking the sentence
The chunk() method of the ChunkerME class is used to chunk the sentences in the raw text passed to it. This method accepts two String arrays representing tokens and tags, as parameters.
Invoke this method by passing the token array and tag array created in the previous steps as parameters.
//Generating the chunks String result[] = chunkerME.chunk(tokens, tags);
Example
Following is the program to chunk the sentences in the given raw text. Save this program in a file with the name ChunkerExample.java.
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import opennlp.tools.chunker.ChunkerME; import opennlp.tools.chunker.ChunkerModel; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; public class ChunkerExample{ public static void main(String args[]) throws IOException { //Tokenizing the sentence String sentence = "Hi welcome to Tutorialspoint"; WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; String[] tokens = whitespaceTokenizer.tokenize(sentence); //Generating the POS tags //Load the parts of speech model File file = new File("C:/OpenNLP_models/en-pos-maxent.bin"); POSModel model = new POSModelLoader().load(file); //Constructing the tagger POSTaggerME tagger = new POSTaggerME(model); //Generating tags from the tokens String[] tags = tagger.tag(tokens); //Loading the chunker model InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-chunker.bin"); ChunkerModel chunkerModel = new ChunkerModel(inputStream); //Instantiate the ChunkerME class ChunkerME chunkerME = new ChunkerME(chunkerModel); //Generating the chunks String result[] = chunkerME.chunk(tokens, tags); for (String s : result) System.out.println(s); } }
Compile and execute the saved Java file from the Command prompt using the following command −
javac ChunkerExample.java java ChunkerExample
On executing, the above program reads the given String and chunks the sentences in it, and displays them as shown below.
Loading POS Tagger model ... done (1.040s) B-NP I-NP B-VP I-VP
Detecting the Positions of the Tokens
We can also detect the positions or spans of the chunks using the chunkAsSpans() method of the ChunkerME class. This method returns an array of objects of the type Span. The class named Span of the opennlp.tools.util package is used to store the start and end integer of sets.
You can store the spans returned by the chunkAsSpans() method in the Span array and print them, as shown in the following code block.
//Generating the tagged chunk spans Span[] span = chunkerME.chunkAsSpans(tokens, tags); for (Span s : span) System.out.println(s.toString());
Example
Following is the program which detects the sentences in the given raw text. Save this program in a file with the name ChunkerSpansEample.java.
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import opennlp.tools.chunker.ChunkerME; import opennlp.tools.chunker.ChunkerModel; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; import opennlp.tools.util.Span; public class ChunkerSpansEample{ public static void main(String args[]) throws IOException { //Load the parts of speech model File file = new File("C:/OpenNLP_models/en-pos-maxent.bin"); POSModel model = new POSModelLoader().load(file); //Constructing the tagger POSTaggerME tagger = new POSTaggerME(model); //Tokenizing the sentence String sentence = "Hi welcome to Tutorialspoint"; WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; String[] tokens = whitespaceTokenizer.tokenize(sentence); //Generating tags from the tokens String[] tags = tagger.tag(tokens); //Loading the chunker model InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-chunker.bin"); ChunkerModel chunkerModel = new ChunkerModel(inputStream); ChunkerME chunkerME = new ChunkerME(chunkerModel); //Generating the tagged chunk spans Span[] span = chunkerME.chunkAsSpans(tokens, tags); for (Span s : span) System.out.println(s.toString()); } }
Compile and execute the saved Java file from the Command prompt using the following commands −
javac ChunkerSpansEample.java java ChunkerSpansEample
On executing, the above program reads the given String and spans of the chunks in it, and displays the following output −
Loading POS Tagger model ... done (1.059s) [0..2) NP [2..4) VP
Chunker Probability Detection
The probs() method of the ChunkerME class returns the probabilities of the last decoded sequence.
//Getting the probabilities of the last decoded sequence double[] probs = chunkerME.probs();
Following is the program to print the probabilities of the last decoded sequence by the chunker. Save this program in a file with the name ChunkerProbsExample.java.
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import opennlp.tools.chunker.ChunkerME; import opennlp.tools.chunker.ChunkerModel; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; public class ChunkerProbsExample{ public static void main(String args[]) throws IOException { //Load the parts of speech model File file = new File("C:/OpenNLP_models/en-pos-maxent.bin"); POSModel model = new POSModelLoader().load(file); //Constructing the tagger POSTaggerME tagger = new POSTaggerME(model); //Tokenizing the sentence String sentence = "Hi welcome to Tutorialspoint"; WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE; String[] tokens = whitespaceTokenizer.tokenize(sentence); //Generating tags from the tokens String[] tags = tagger.tag(tokens); //Loading the chunker model InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-chunker.bin"); ChunkerModel cModel = new ChunkerModel(inputStream); ChunkerME chunkerME = new ChunkerME(cModel); //Generating the chunk tags chunkerME.chunk(tokens, tags); //Getting the probabilities of the last decoded sequence double[] probs = chunkerME.probs(); for(int i = 0; i<probs.length; i++) System.out.println(probs[i]); } }
Compile and execute the saved Java file from the Command prompt using the following commands −
javac ChunkerProbsExample.java java ChunkerProbsExample
On executing, the above program reads the given String, chunks it, and prints the probabilities of the last decoded sequence.
0.9592746040797778 0.6883933131241501 0.8830563473996004 0.8951150529746051
Learning working make money