Natural Language Processing – Syntactic Analysis Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by semantic analyzer. In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from Latin word ‘pars’ which means ‘part’. Concept of Parser It is used to implement the task of parsing. It may be defined as the software component designed for taking input data (text) and giving structural representation of the input after checking for correct syntax as per formal grammar. It also builds a data structure generally in the form of parse tree or abstract syntax tree or other hierarchical structure. The main roles of the parse include − To report any syntax error. To recover from commonly occurring error so that the processing of the remainder of program can be continued. To create parse tree. To create symbol table. To produce intermediate representations (IR). Types of Parsing Derivation divides parsing into the followings two types − Top-down Parsing Bottom-up Parsing Top-down Parsing In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input. The most common form of topdown parsing uses recursive procedure to process the input. The main disadvantage of recursive descent parsing is backtracking. Bottom-up Parsing In this kind of parsing, the parser starts with the input symbol and tries to construct the parser tree up to the start symbol. Concept of Derivation In order to get the input string, we need a sequence of production rules. Derivation is a set of production rules. During parsing, we need to decide the non-terminal, which is to be replaced along with deciding the production rule with the help of which the non-terminal will be replaced. Types of Derivation In this section, we will learn about the two types of derivations, which can be used to decide which non-terminal to be replaced with production rule − Left-most Derivation In the left-most derivation, the sentential form of an input is scanned and replaced from the left to the right. The sentential form in this case is called the left-sentential form. Right-most Derivation In the left-most derivation, the sentential form of an input is scanned and replaced from right to left. The sentential form in this case is called the right-sentential form. Concept of Parse Tree It may be defined as the graphical depiction of a derivation. The start symbol of derivation serves as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes are non-terminals. A property of parse tree is that in-order traversal will produce the original input string. Concept of Grammar Grammar is very essential and important to describe the syntactic structure of well-formed programs. In the literary sense, they denote syntactical rules for conversation in natural languages. Linguistics have attempted to define grammars since the inception of natural languages like English, Hindi, etc. The theory of formal languages is also applicable in the fields of Computer Science mainly in programming languages and data structure. For example, in ‘C’ language, the precise grammar rules state how functions are made from lists and statements. A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective for writing computer languages. Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where − N or VN = set of non-terminal symbols, i.e., variables. T or ∑ = set of terminal symbols. S = Start symbol where S ∈ N P denotes the Production rules for Terminals as well as Non-terminals. It has the form α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN Phrase Structure or Constituency Grammar Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency relation. That is why it is also called constituency grammar. It is opposite to dependency grammar. Example Before giving an example of constituency grammar, we need to know the fundamental points about constituency grammar and constituency relation. All the related frameworks view the sentence structure in terms of constituency relation. The constituency relation is derived from the subject-predicate division of Latin as well as Greek grammar. The basic clause structure is understood in terms of noun phrase NP and verb phrase VP. We can write the sentence “This tree is illustrating the constituency relation” as follows − Dependency Grammar It is opposite to the constituency grammar and based on dependency relation. It was introduced by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar because it lacks phrasal nodes. Example Before giving an example of Dependency grammar, we need to know the fundamental points about Dependency grammar and Dependency relation. In DG, the linguistic units, i.e., words are connected to each other by directed links. The verb becomes the center of the clause structure. Every other syntactic units are connected to the verb in terms of directed link. These syntactic units are called dependencies. We can write the sentence “This tree is illustrating the dependency relation” as follows; Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse trees that uses dependency grammar is called dependency-based parse tree. Context Free Grammar Context free grammar, also called CFG, is a notation for describing languages and a superset of Regular grammar. It can be seen in the following diagram − Definition of CFG CFG consists of finite set of grammar rules with the following four components − Set of Non-terminals It
Category: natural Language Processing
Part of Speech (PoS) Tagging Tagging is a kind of classification that may be defined as the automatic assignment of description to the tokens. Here the descriptor is called tag, which may represent one of the part-of-speech, semantic information and so on. Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging. Rule-based POS Tagging One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a word is article then word must be a noun. As the name suggests, all such kind of information in rule-based POS tagging is coded in the form of rules. These rules may be either − Context-pattern rules Or, as Regular expression compiled into finite-state automata, intersected with lexically ambiguous sentence representation. We can also understand Rule-based POS tagging by its two-stage architecture − First stage − In the first stage, it uses a dictionary to assign each word a list of potential parts-of-speech. Second stage − In the second stage, it uses large lists of hand-written disambiguation rules to sort down the list to a single part-of-speech for each word. Properties of Rule-Based POS Tagging Rule-based POS taggers possess the following properties − These taggers are knowledge-driven taggers. The rules in Rule-based POS tagging are built manually. The information is coded in the form of rules. We have some limited number of rules approximately around 1000. Smoothing and language modeling is defined explicitly in rule-based taggers. Stochastic POS Tagging Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model can be stochastic. The model that includes frequency or probability (statistics) can be called stochastic. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger. The simplest stochastic tagger applies the following approaches for POS tagging − Word Frequency Approach In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs with a particular tag. We can also say that the tag encountered most frequently with the word in the training set is the one assigned to an ambiguous instance of that word. The main issue with this approach is that it may yield inadmissible sequence of tags. Tag Sequence Probabilities It is another approach of stochastic tagging, where the tagger calculates the probability of a given sequence of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is determined by the probability at which it occurs with the n previous tags. Properties of Stochastic POST Tagging Stochastic POS taggers possess the following properties − This POS tagging is based on the probability of tag occurring. It requires training corpus There would be no probability for the words that do not exist in the corpus. It uses different testing corpus (other than training corpus). It is the simplest POS tagging because it chooses most frequent tags associated with a word in training corpus. Transformation-based Tagging Transformation based tagging is also called Brill tagging. It is an instance of the transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given text. TBL, allows us to have linguistic knowledge in a readable form, transforms one state to another state by using transformation rules. It draws the inspiration from both the previous explained taggers − rule-based and stochastic. If we see similarity between rule-based and transformation tagger, then like rule-based, it is also based on the rules that specify what tags need to be assigned to what words. On the other hand, if we see similarity between stochastic and transformation tagger then like stochastic, it is machine learning technique in which rules are automatically induced from data. Working of Transformation Based Learning(TBL) In order to understand the working and concept of transformation-based taggers, we need to understand the working of transformation-based learning. Consider the following steps to understand the working of TBL − Start with the solution − The TBL usually starts with some solution to the problem and works in cycles. Most beneficial transformation chosen − In each cycle, TBL will choose the most beneficial transformation. Apply to the problem − The transformation chosen in the last step will be applied to the problem. The algorithm will stop when the selected transformation in step 2 will not add either more value or there are no more transformations to be selected. Such kind of learning is best suited in classification tasks. Advantages of Transformation-based Learning (TBL) The advantages of TBL are as follows − We learn small set of simple rules and these rules are enough for tagging. Development as well as debugging is very easy in TBL because the learned rules are easy to understand. Complexity in tagging is reduced because in TBL there is interlacing of machinelearned and human-generated rules. Transformation-based tagger is much faster than Markov-model tagger. Disadvantages of Transformation-based Learning (TBL) The disadvantages of TBL are as follows − Transformation-based learning (TBL) does not provide tag probabilities. In TBL, the training time is very long especially on large corpora. Hidden Markov Model (HMM)
Discuss Natural Language Processing Language is a method of communication with the help of which we can speak, read and write. Natural Language Processing (NLP) is a subfield of Computer Science that deals with Artificial Intelligence (AI), which enables computers to understand and process human language. Learning working make money
Natural Language Discourse Processing The most difficult problem of AI is to process the natural language by computers or in other words natural language processing is the most difficult problem of artificial intelligence. If we talk about the major problems in NLP, then one of the major problems in NLP is discourse processing − building theories and models of how utterances stick together to form coherent discourse. Actually, the language always consists of collocated, structured and coherent groups of sentences rather than isolated and unrelated sentences like movies. These coherent groups of sentences are referred to as discourse. Concept of Coherence Coherence and discourse structure are interconnected in many ways. Coherence, along with property of good text, is used to evaluate the output quality of natural language generation system. The question that arises here is what does it mean for a text to be coherent? Suppose we collected one sentence from every page of the newspaper, then will it be a discourse? Of-course, not. It is because these sentences do not exhibit coherence. The coherent discourse must possess the following properties − Coherence relation between utterances The discourse would be coherent if it has meaningful connections between its utterances. This property is called coherence relation. For example, some sort of explanation must be there to justify the connection between utterances. Relationship between entities Another property that makes a discourse coherent is that there must be a certain kind of relationship with the entities. Such kind of coherence is called entity-based coherence. Discourse structure An important question regarding discourse is what kind of structure the discourse must have. The answer to this question depends upon the segmentation we applied on discourse. Discourse segmentations may be defined as determining the types of structures for large discourse. It is quite difficult to implement discourse segmentation, but it is very important for information retrieval, text summarization and information extraction kind of applications. Algorithms for Discourse Segmentation In this section, we will learn about the algorithms for discourse segmentation. The algorithms are described below − Unsupervised Discourse Segmentation The class of unsupervised discourse segmentation is often represented as linear segmentation. We can understand the task of linear segmentation with the help of an example. In the example, there is a task of segmenting the text into multi-paragraph units; the units represent the passage of the original text. These algorithms are dependent on cohesion that may be defined as the use of certain linguistic devices to tie the textual units together. On the other hand, lexicon cohesion is the cohesion that is indicated by the relationship between two or more words in two units like the use of synonyms. Supervised Discourse Segmentation The earlier method does not have any hand-labeled segment boundaries. On the other hand, supervised discourse segmentation needs to have boundary-labeled training data. It is very easy to acquire the same. In supervised discourse segmentation, discourse marker or cue words play an important role. Discourse marker or cue word is a word or phrase that functions to signal discourse structure. These discourse markers are domain-specific. Text Coherence Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the requirement of being coherent discourse. To achieve the coherent discourse, we must focus on coherence relations in specific. As we know that coherence relation defines the possible connection between utterances in a discourse. Hebb has proposed such kind of relations as follows − We are taking two terms S0 and S1 to represent the meaning of the two related sentences − Result It infers that the state asserted by term S0 could cause the state asserted by S1. For example, two statements show the relationship result: Ram was caught in the fire. His skin burned. Explanation It infers that the state asserted by S1 could cause the state asserted by S0. For example, two statements show the relationship − Ram fought with Shyam’s friend. He was drunk. Parallel It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are similar for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted money. Elaboration It infers the same proposition P from both the assertions − S0 and S1 For example, two statements show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala. Occasion It happens when a change of state can be inferred from the assertion of S0, final state of which can be inferred from S1 and vice-versa. For example, the two statements show the relation occasion: Ram picked up the book. He gave it to Shyam. Building Hierarchical Discourse Structure The coherence of entire discourse can also be considered by hierarchical structure between coherence relations. For example, the following passage can be represented as hierarchical structure − S1 − Ram went to the bank to deposit money. S2 − He then took a train to Shyam’s cloth shop. S3 − He wanted to buy some clothes. S4 − He do not have new clothes for party. S5 − He also wanted to talk to Shyam regarding his health Reference Resolution Interpretation of the sentences from any discourse is another important task and to achieve this we need to know who or what entity is being talked about. Here, interpretation reference is the key element. Reference may be defined as the linguistic expression to denote an entity or individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to meet him, the linguistic expressions like Ram, His, He are reference. On the same note, reference resolution may be defined as the task of determining what entities are referred to by which linguistic expression. Terminology Used in Reference Resolution We use the following terminologies in reference resolution − Referring expression − The natural language expression that is used to perform reference is called a referring expression. For example, the passage used above is a referring expression.
Natural Language Processing – Useful Resources The following resources contain additional information on Natural Language Processing. Please use them to get more in-depth knowledge on this. Useful Video Courses Most Popular 60 Lectures 2.5 hours 17 Lectures 1 hours 234 Lectures 23.5 hours 24 Lectures 2.5 hours 7 Lectures 51 mins 14 Lectures 2.5 hours Learning working make money
Applications of NLP Natural Language Processing (NLP) is an emerging technology that derives various forms of AI that we see in the present times and its use for creating a seamless as well as interactive interface between humans and machines will continue to be a top priority for today’s and tomorrow’s increasingly cognitive applications. Here, we are going to discuss about some of the very useful applications of NLP. Machine Translation Machine translation (MT), process of translating one source language or text into another language, is one of the most important applications of NLP. We can understand the process of machine translation with the help of the following flowchart − Types of Machine Translation Systems There are different types of machine translation systems. Let us see what the different types are. Bilingual MT System Bilingual MT systems produce translations between two particular languages. Multilingual MT System Multilingual MT systems produce translations between any pair of languages. They may be either uni-directional or bi-directional in nature. Approaches to Machine Translation (MT) Let us now learn about the important approaches to Machine Translation. The approaches to MT are as follows − Direct MT Approach It is less popular but the oldest approach of MT. The systems that use this approach are capable of translating SL (source language) directly to TL (target language). Such systems are bi-lingual and uni-directional in nature. Interlingua Approach The systems that use Interlingua approach translate SL to an intermediate language called Interlingua (IL) and then translate IL to TL. The Interlingua approach can be understood with the help of the following MT pyramid − Transfer Approach Three stages are involved with this approach. In the first stage, source language (SL) texts are converted to abstract SL-oriented representations. In the second stage, SL-oriented representations are converted into equivalent target language (TL)-oriented representations. In the third stage, the final text is generated. Empirical MT Approach This is an emerging approach for MT. Basically, it uses large amount of raw data in the form of parallel corpora. The raw data consists of the text and their translations. Analogybased, example-based, memory-based machine translation techniques use empirical MTapproach. Fighting Spam One of the most common problems these days is unwanted emails. This makes Spam filters all the more important because it is the first line of defense against this problem. Spam filtering system can be developed by using NLP functionality by considering the major false-positive and false-negative issues. Existing NLP models for spam filtering Followings are some existing NLP models for spam filtering − N-gram Modeling An N-Gram model is an N-character slice of a longer string. In this model, N-grams of several different lengths are used simultaneously in processing and detecting spam emails. Word Stemming Spammers, generators of spam emails, usually change one or more characters of attacking words in their spams so that they can breach content-based spam filters. That is why we can say that content-based filters are not useful if they cannot understand the meaning of the words or phrases in the email. In order to eliminate such issues in spam filtering, a rule-based word stemming technique, that can match words which look alike and sound alike, is developed. Bayesian Classification This has now become a widely-used technology for spam filtering. The incidence of the words in an email is measured against its typical occurrence in a database of unsolicited (spam) and legitimate (ham) email messages in a statistical technique. Automatic Summarization In this digital era, the most valuable thing is data, or you can say information. However, do we really get useful as well as the required amount of information? The answer is ‘NO’ because the information is overloaded and our access to knowledge and information far exceeds our capacity to understand it. We are in a serious need of automatic text summarization and information because the flood of information over internet is not going to stop. Text summarization may be defined as the technique to create short, accurate summary of longer text documents. Automatic text summarization will help us with relevant information in less time. Natural language processing (NLP) plays an important role in developing an automatic text summarization. Question-answering Another main application of natural language processing (NLP) is question-answering. Search engines put the information of the world at our fingertips, but they are still lacking when it comes to answer the questions posted by human beings in their natural language. We have big tech companies like Google are also working in this direction. Question-answering is a Computer Science discipline within the fields of AI and NLP. It focuses on building systems that automatically answer questions posted by human beings in their natural language. A computer system that understands the natural language has the capability of a program system to translate the sentences written by humans into an internal representation so that the valid answers can be generated by the system. The exact answers can be generated by doing syntax and semantic analysis of the questions. Lexical gap, ambiguity and multilingualism are some of the challenges for NLP in building good question answering system. Sentiment Analysis Another important application of natural language processing (NLP) is sentiment analysis. As the name suggests, sentiment analysis is used to identify the sentiments among several posts. It is also used to identify the sentiment where the emotions are not expressed explicitly. Companies are using sentiment analysis, an application of natural language processing (NLP) to identify the opinion and sentiment of their customers online. It will help companies to understand what their customers think about the products and services. Companies can judge their overall reputation from customer posts with the help of sentiment analysis. In this way, we can say that beyond determining simple polarity, sentiment analysis understands sentiments in context to help us better understand what is behind the expressed opinion. Learning working make money
Natural Language Processing – Python In this chapter, we will learn about language processing using Python. The following features make Python different from other languages − Python is interpreted − We do not need to compile our Python program before executing it because the interpreter processes Python at runtime. Interactive − We can directly interact with the interpreter to write our Python programs. Object-oriented − Python is object-oriented in nature and it makes this language easier to write programs because with the help of this technique of programming it encapsulates code within objects. Beginner can easily learn − Python is also called beginner’s language because it is very easy to understand, and it supports the development of a wide range of applications. Prerequisites The latest version of Python 3 released is Python 3.7.1 is available for Windows, Mac OS and most of the flavors of Linux OS. For windows, we can go to the link to download and install Python. For MAC OS, we can use the link . In case of Linux, different flavors of Linux use different package managers for installation of new packages. For example, to install Python 3 on Ubuntu Linux, we can use the following command from terminal − $sudo apt-get install python3-minimal To study more about Python programming, read Python 3 basic tutorial – Getting Started with NLTK We will be using Python library NLTK (Natural Language Toolkit) for doing text analysis in English Language. The Natural language toolkit (NLTK) is a collection of Python libraries designed especially for identifying and tag parts of speech found in the text of natural language like English. Installing NLTK Before starting to use NLTK, we need to install it. With the help of following command, we can install it in our Python environment − pip install nltk If we are using Anaconda, then a Conda package for NLTK can be built by using the following command − conda install -c anaconda nltk Downloading NLTK’s Data After installing NLTK, another important task is to download its preset text repositories so that it can be easily used. However, before that we need to import NLTK the way we import any other Python module. The following command will help us in importing NLTK − import nltk Now, download NLTK data with the help of the following command − nltk.download() It will take some time to install all available packages of NLTK. Other Necessary Packages Some other Python packages like gensim and pattern are also very necessary for text analysis as well as building natural language processing applications by using NLTK. the packages can be installed as shown below − gensim gensim is a robust semantic modeling library which can be used for many applications. We can install it by following command − pip install gensim pattern It can be used to make gensim package work properly. The following command helps in installing pattern − pip install pattern Tokenization Tokenization may be defined as the Process of breaking the given text, into smaller units called tokens. Words, numbers or punctuation marks can be tokens. It may also be called word segmentation. Example Input − Bed and chair are types of furniture. We have different packages for tokenization provided by NLTK. We can use these packages based on our requirements. The packages and the details of their installation are as follows − sent_tokenize package This package can be used to divide the input text into sentences. We can import it by using the following command − from nltk.tokenize import sent_tokenize word_tokenize package This package can be used to divide the input text into words. We can import it by using the following command − from nltk.tokenize import word_tokenize WordPunctTokenizer package This package can be used to divide the input text into words and punctuation marks. We can import it by using the following command − from nltk.tokenize import WordPuncttokenizer Stemming Due to grammatical reasons, language includes lots of variations. Variations in the sense that the language, English as well as other languages too, have different forms of a word. For example, the words like democracy, democratic, and democratization. For machine learning projects, it is very important for machines to understand that these different words, like above, have the same base form. That is why it is very useful to extract the base forms of the words while analyzing the text. Stemming is a heuristic process that helps in extracting the base forms of the words by chopping of their ends. The different packages for stemming provided by NLTK module are as follows − PorterStemmer package Porter’s algorithm is used by this stemming package to extract the base form of the words. With the help of the following command, we can import this package − from nltk.stem.porter import PorterStemmer For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer. LancasterStemmer package Lancaster’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package − from nltk.stem.lancaster import LancasterStemmer For example, ‘writ’ would be the output of the word ‘writing’ given as the input to this stemmer. SnowballStemmer package Snowball’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package − from nltk.stem.snowball import SnowballStemmer For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer. Lemmatization It is another way to extract the base form of words, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. After lemmatization, the base form of any word is called lemma. NLTK module provides the following package for lemmatization − WordNetLemmatizer package This package will extract the base form of the word depending upon whether it is used as a noun or as a verb. The following command can be used to import this package − from
Natural Language Processing – Quick Guide Natural Language Processing – Introduction Language is a method of communication with the help of which we can speak, read and write. For example, we think, we make decisions, plans and more in natural language; precisely, in words. However, the big question that confronts us in this AI era is that can we communicate in a similar manner with computers. In other words, can human beings communicate with computers in their natural language? It is a challenge for us to develop NLP applications because computers need structured data, but human speech is unstructured and often ambiguous in nature. In this sense, we can say that Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amount of natural language data. History of NLP We have divided the history of NLP into four phases. The phases have distinctive concerns and styles. First Phase (Machine Translation Phase) – Late 1940s to late 1960s The work done in this phase focused mainly on machine translation (MT). This phase was a period of enthusiasm and optimism. Let us now see all that the first phase had in it − The research on NLP started in early 1950s after Booth & Richens’ investigation and Weaver’s memorandum on machine translation in 1949. 1954 was the year when a limited experiment on automatic translation from Russian to English demonstrated in the Georgetown-IBM experiment. In the same year, the publication of the journal MT (Machine Translation) started. The first international conference on Machine Translation (MT) was held in 1952 and second was held in 1956. In 1961, the work presented in Teddington International Conference on Machine Translation of Languages and Applied Language analysis was the high point of this phase. Second Phase (AI Influenced Phase) – Late 1960s to late 1970s In this phase, the work done was majorly related to world knowledge and on its role in the construction and manipulation of meaning representations. That is why, this phase is also called AI-flavored phase. The phase had in it, the following − In early 1961, the work began on the problems of addressing and constructing data or knowledge base. This work was influenced by AI. In the same year, a BASEBALL question-answering system was also developed. The input to this system was restricted and the language processing involved was a simple one. A much advanced system was described in Minsky (1968). This system, when compared to the BASEBALL question-answering system, was recognized and provided for the need of inference on the knowledge base in interpreting and responding to language input. Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s This phase can be described as the grammatico-logical phase. Due to the failure of practical system building in last phase, the researchers moved towards the use of logic for knowledge representation and reasoning in AI. The third phase had the following in it − The grammatico-logical approach, towards the end of decade, helped us with powerful general-purpose sentence processors like SRI’s Core Language Engine and Discourse Representation Theory, which offered a means of tackling more extended discourse. In this phase we got some practical resources & tools like parsers, e.g. Alvey Natural Language Tools along with more operational and commercial systems, e.g. for database query. The work on lexicon in 1980s also pointed in the direction of grammatico-logical approach. Fourth Phase (Lexical & Corpus Phase) – The 1990s We can describe this as a lexical & corpus phase. The phase had a lexicalized approach to grammar that appeared in late 1980s and became an increasing influence. There was a revolution in natural language processing in this decade with the introduction of machine learning algorithms for language processing. Study of Human Languages Language is a crucial component for human lives and also the most fundamental aspect of our behavior. We can experience it in mainly two forms – written and spoken. In the written form, it is a way to pass our knowledge from one generation to the next. In the spoken form, it is the primary medium for human beings to coordinate with each other in their day-to-day behavior. Language is studied in various academic disciplines. Each discipline comes with its own set of problems and a set of solution to address those. Consider the following table to understand this − Discipline Problems Tools Linguists How phrases and sentences can be formed with words? What curbs the possible meaning for a sentence? Intuitions about well-formedness and meaning. Mathematical model of structure. For example, model theoretic semantics, formal language theory. Psycholinguists How human beings can identify the structure of sentences? How the meaning of words can be identified? When does understanding take place? Experimental techniques mainly for measuring the performance of human beings. Statistical analysis of observations. Philosophers How do words and sentences acquire the meaning? How the objects are identified by the words? What is meaning? Natural language argumentation by using intuition. Mathematical models like logic and model theory. Computational Linguists How can we identify the structure of a sentence How knowledge and reasoning can be modeled? How we can use language to accomplish specific tasks? Algorithms Data structures Formal models of representation and reasoning. AI techniques like search & representation methods. Ambiguity and Uncertainty in Language Ambiguity, generally used in natural language processing, can be referred as the ability of being understood in more than one way. In simple terms, we can say that ambiguity is the capability of being understood in more than one way. Natural language is very ambiguous. NLP has the following types of ambiguities − Lexical Ambiguity The ambiguity of a single word is called lexical ambiguity. For example, treating the word silver as a noun, an adjective, or a verb. Syntactic Ambiguity This kind of ambiguity occurs when a sentence is parsed
NLP – Word Sense Disambiguation We understand that words have different meanings based on the context of its usage in the sentence. If we talk about human languages, then they are ambiguous too because many words can be interpreted in multiple ways depending upon the context of their occurrence. Word sense disambiguation, in natural language processing (NLP), may be defined as the ability to determine which meaning of word is activated by the use of word in a particular context. Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system faces. Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving semantic ambiguity is harder than resolving syntactic ambiguity. For example, consider the two examples of the distinct sense that exist for the word “bass” − I can hear bass sound. He likes to eat grilled bass. The occurrence of the word bass clearly denotes the distinct meaning. In first sentence, it means frequency and in second, it means fish. Hence, if it would be disambiguated by WSD then the correct meaning to the above sentences can be assigned as follows − I can hear bass/frequency sound. He likes to eat grilled bass/fish. Evaluation of WSD The evaluation of WSD requires the following two inputs − A Dictionary The very first input for evaluation of WSD is dictionary, which is used to specify the senses to be disambiguated. Test Corpus Another input required by WSD is the high-annotated test corpus that has the target or correct-senses. The test corpora can be of two types &minsu; Lexical sample − This kind of corpora is used in the system, where it is required to disambiguate a small sample of words. All-words − This kind of corpora is used in the system, where it is expected to disambiguate all the words in a piece of running text. Approaches and Methods to Word Sense Disambiguation (WSD) Approaches and methods to WSD are classified according to the source of knowledge used in word disambiguation. Let us now see the four conventional methods to WSD − Dictionary-based or Knowledge-based Methods As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures and lexical knowledge base. They do not use corpora evidences for disambiguation. The Lesk method is the seminal dictionary-based method introduced by Michael Lesk in 1986. The Lesk definition, on which the Lesk algorithm is based is “measure overlap between sense definitions for all words in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk definition as “measure overlap between sense definitions of word and current context”, which further means identify the correct sense for one word at a time. Here the current context is the set of words in surrounding sentence or paragraph. Supervised Methods For disambiguation, machine learning methods make use of sense-annotated corpora to train. These methods assume that the context can provide enough evidence on its own to disambiguate the sense. In these methods, the words knowledge and reasoning are deemed unnecessary. The context is represented as a set of “features” of the words. It includes the information about the surrounding words also. Support vector machine and memory-based learning are the most successful supervised learning approaches to WSD. These methods rely on substantial amount of manually sense-tagged corpora, which is very expensive to create. Semi-supervised Methods Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-supervised learning methods. It is because semi-supervised methods use both labelled as well as unlabeled data. These methods require very small amount of annotated text and large amount of plain unannotated text. The technique that is used by semisupervised methods is bootstrapping from seed data. Unsupervised Methods These methods assume that similar senses occur in similar context. That is why the senses can be induced from text by clustering word occurrences by using some measure of similarity of the context. This task is called word sense induction or discrimination. Unsupervised methods have great potential to overcome the knowledge acquisition bottleneck due to non-dependency on manual efforts. Applications of Word Sense Disambiguation (WSD) Word sense disambiguation (WSD) is applied in almost every application of language technology. Let us now see the scope of WSD − Machine Translation Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice for the words that have distinct translations for different senses, is done by WSD. The senses in MT are represented as words in the target language. Most of the machine translation systems do not use explicit WSD module. Information Retrieval (IR) Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system basically assists users in finding the information they required but it does not explicitly return the answers of the questions. WSD is used to resolve the ambiguities of the queries provided to IR system. As like MT, current IR systems do not explicitly use WSD module and they rely on the concept that user would type enough context in the query to only retrieve relevant documents. Text Mining and Information Extraction (IE) In most of the applications, WSD is necessary to do accurate analysis of text. For example, WSD helps intelligent gathering system to do flagging of the correct words. For example, medical intelligent system might need flagging of “illegal drugs” rather than “medical drugs” Lexicography WSD and lexicography can work together in loop because modern lexicography is corpusbased. With lexicography, WSD provides rough empirical sense groupings as well as statistically significant contextual indicators of sense. Difficulties in Word Sense Disambiguation (WSD) Followings are some difficulties faced by word sense disambiguation (WSD) − Differences between dictionaries The major problem of WSD is to decide the sense of the word because different senses can be very closely related. Even
Natural Language Processing – Semantic Analysis The purpose of semantic analysis is to draw exact meaning, or you can say dictionary meaning from the text. The work of semantic analyzer is to check the text for meaningfulness. We already know that lexical analysis also deals with the meaning of the words, then how is semantic analysis different from lexical analysis? Lexical analysis is based on smaller token but on the other side semantic analysis focuses on larger chunks. That is why semantic analysis can be divided into the following two parts − Studying meaning of individual word It is the first part of the semantic analysis in which the study of the meaning of individual words is performed. This part is called lexical semantics. Studying the combination of individual words In the second part, the individual words will be combined to provide meaning in sentences. The most important task of semantic analysis is to get the proper meaning of the sentence. For example, analyze the sentence “Ram is great.” In this sentence, the speaker is talking either about Lord Ram or about a person whose name is Ram. That is why the job, to get the proper meaning of the sentence, of semantic analyzer is important. Elements of Semantic Analysis Followings are some important elements of semantic analysis − Hyponymy It may be defined as the relationship between a generic term and instances of that generic term. Here the generic term is called hypernym and its instances are called hyponyms. For example, the word color is hypernym and the color blue, yellow etc. are hyponyms. Homonymy It may be defined as the words having same spelling or same form but having different and unrelated meaning. For example, the word “Bat” is a homonymy word because bat can be an implement to hit a ball or bat is a nocturnal flying mammal also. Polysemy Polysemy is a Greek word, which means “many signs”. It is a word or phrase with different but related sense. In other words, we can say that polysemy has the same spelling but different and related meaning. For example, the word “bank” is a polysemy word having the following meanings − A financial institution. The building in which such an institution is located. A synonym for “to rely on”. Difference between Polysemy and Homonymy Both polysemy and homonymy words have the same syntax or spelling. The main difference between them is that in polysemy, the meanings of the words are related but in homonymy, the meanings of the words are not related. For example, if we talk about the same word “Bank”, we can write the meaning ‘a financial institution’ or ‘a river bank’. In that case it would be the example of homonym because the meanings are unrelated to each other. Synonymy It is the relation between two lexical items having different forms but expressing the same or a close meaning. Examples are ‘author/writer’, ‘fate/destiny’. Antonymy It is the relation between two lexical items having symmetry between their semantic components relative to an axis. The scope of antonymy is as follows − Application of property or not − Example is ‘life/death’, ‘certitude/incertitude’ Application of scalable property − Example is ‘rich/poor’, ‘hot/cold’ Application of a usage − Example is ‘father/son’, ‘moon/sun’. Meaning Representation Semantic analysis creates a representation of the meaning of a sentence. But before getting into the concept and approaches related to meaning representation, we need to understand the building blocks of semantic system. Building Blocks of Semantic System In word representation or representation of the meaning of the words, the following building blocks play an important role − Entities − It represents the individual such as a particular person, location etc. For example, Haryana. India, Ram all are entities. Concepts − It represents the general category of the individuals such as a person, city, etc. Relations − It represents the relationship between entities and concept. For example, Ram is a person. Predicates − It represents the verb structures. For example, semantic roles and case grammar are the examples of predicates. Now, we can understand that meaning representation shows how to put together the building blocks of semantic systems. In other words, it shows how to put together entities, concepts, relation and predicates to describe a situation. It also enables the reasoning about the semantic world. Approaches to Meaning Representations Semantic analysis uses the following approaches for the representation of meaning − First order predicate logic (FOPL) Semantic Nets Frames Conceptual dependency (CD) Rule-based architecture Case Grammar Conceptual Graphs Need of Meaning Representations A question that arises here is why do we need meaning representation? Followings are the reasons for the same − Linking of linguistic elements to non-linguistic elements The very first reason is that with the help of meaning representation the linking of linguistic elements to the non-linguistic elements can be done. Representing variety at lexical level With the help of meaning representation, unambiguous, canonical forms can be represented at the lexical level. Can be used for reasoning Meaning representation can be used to reason for verifying what is true in the world as well as to infer the knowledge from the semantic representation. Lexical Semantics The first part of semantic analysis, studying the meaning of individual words is called lexical semantics. It includes words, sub-words, affixes (sub-units), compound words and phrases also. All the words, sub-words, etc. are collectively called lexical items. In other words, we can say that lexical semantics is the relationship between lexical items, meaning of sentences and syntax of sentence. Following are the steps involved in lexical semantics − Classification of lexical items like words, sub-words, affixes, etc. is performed in lexical semantics. Decomposition of lexical items like words, sub-words, affixes, etc. is performed in lexical semantics. Differences as well as similarities between various lexical semantic structures is also analyzed. Learning working make money