Natural Language Processing – Inception In this chapter, we will discuss the natural language inception in Natural Language Processing. To begin with, let us first understand what is Natural Language Grammar. Natural Language Grammar For linguistics, language is a group of arbitrary vocal signs. We may say that language is creative, governed by rules, innate as well as universal at the same time. On the other hand, it is humanly too. The nature of the language is different for different people. There is a lot of misconception about the nature of the language. That is why it is very important to understand the meaning of the ambiguous term ‘grammar’. In linguistics, the term grammar may be defined as the rules or principles with the help of which language works. In broad sense, we can divide grammar in two categories − Descriptive Grammar The set of rules, where linguistics and grammarians formulate the speaker’s grammar is called descriptive grammar. Perspective Grammar It is a very different sense of grammar, which attempts to maintain a standard of correctness in the language. This category has little to do with the actual working of the language. Components of Language The language of study is divided into the interrelated components, which are conventional as well as arbitrary divisions of linguistic investigation. The explanation of these components is as follows − Phonology The very first component of language is phonology. It is the study of the speech sounds of a particular language. The origin of the word can be traced to Greek language, where ‘phone’ means sound or voice. Phonetics, a subdivision of phonology is the study of the speech sounds of human language from the perspective of their production, perception or their physical properties. IPA (International Phonetic Alphabet) is a tool that represents human sounds in a regular way while studying phonology. In IPA, every written symbol represents one and only one speech sound and vice-versa. Phonemes It may be defined as one of the units of sound that differentiate one word from other in a language. In linguistic, phonemes are written between slashes. For example, phoneme /k/ occurs in the words such as kit, skit. Morphology It is the second component of language. It is the study of the structure and classification of the words in a particular language. The origin of the word is from Greek language, where the word ‘morphe’ means ‘form’. Morphology considers the principles of formation of words in a language. In other words, how sounds combine into meaningful units like prefixes, suffixes and roots. It also considers how words can be grouped into parts of speech. Lexeme In linguistics, the abstract unit of morphological analysis that corresponds to a set of forms taken by a single word is called lexeme. The way in which a lexeme is used in a sentence is determined by its grammatical category. Lexeme can be individual word or multiword. For example, the word talk is an example of an individual word lexeme, which may have many grammatical variants like talks, talked and talking. Multiword lexeme can be made up of more than one orthographic word. For example, speak up, pull through, etc. are the examples of multiword lexemes. Syntax It is the third component of language. It is the study of the order and arrangement of the words into larger units. The word can be traced to Greek language, where the word suntassein means ‘to put in order’. It studies the type of sentences and their structure, of clauses, of phrases. Semantics It is the fourth component of language. It is the study of how meaning is conveyed. The meaning can be related to the outside world or can be related to the grammar of the sentence. The word can be traced to Greek language, where the word semainein means means ‘to signify’, ‘show’, ‘signal’. Pragmatics It is the fifth component of language. It is the study of the functions of the language and its use in context. The origin of the word can be traced to Greek language where the word ‘pragma’ means ‘deed’, ‘affair’. Grammatical Categories A grammatical category may be defined as a class of units or features within the grammar of a language. These units are the building blocks of language and share a common set of characteristics. Grammatical categories are also called grammatical features. The inventory of grammatical categories is described below − Number It is the simplest grammatical category. We have two terms related to this category −singular and plural. Singular is the concept of ‘one’ whereas, plural is the concept of ‘more than one’. For example, dog/dogs, this/these. Gender Grammatical gender is expressed by variation in personal pronouns and 3rd person. Examples of grammatical genders are singular − he, she, it; the first and second person forms − I, we and you; the 3rd person plural form they, is either common gender or neuter gender. Person Another simple grammatical category is person. Under this, following three terms are recognized − 1st person − The person who is speaking is recognized as 1st person. 2nd person − The person who is the hearer or the person spoken to is recognized as 2nd person. 3rd person − The person or thing about whom we are speaking is recognized as 3rd person. Case It is one of the most difficult grammatical categories. It may be defined as an indication of the function of a noun phrase (NP) or the relationship of a noun phrase to a verb or to the other noun phrases in the sentence. We have the following three cases expressed in personal and interrogative pronouns − Nominative case − It is the function of subject. For example, I, we, you, he, she, it, they and who are nominative. Genitive case − It is the function of possessor. For example, my/mine, our/ours, his, her/hers, its, their/theirs, whose are genitive. Objective case − It is the function of object. For example, me, us, you, him, her, them, whom
Category: natural Language Processing
NLP – Information Retrieval Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system assists users in finding the information they require but it does not explicitly return the answers of the questions. It informs the existence and location of documents that might consist of the required information. The documents that satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve only relevant documents. With the help of the following diagram, we can understand the process of information retrieval (IR) − It is clear from the above diagram that a user who needs information will have to formulate a request in the form of query in natural language. Then the IR system will respond by retrieving the relevant output, in the form of documents, about the required information. Classical Problem in Information Retrieval (IR) System The main goal of IR research is to develop a model for retrieving information from the repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc retrieval problem, related to the IR system. In ad-hoc retrieval, the user must enter a query in natural language that describes the required information. Then the IR system will return the required documents related to the desired information. For example, suppose we are searching something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem. Aspects of Ad-hoc Retrieval Followings are some aspects of ad-hoc retrieval that are addressed in IR research − How users with the help of relevance feedback can improve original formulation of a query? How to implement database merging, i.e., how results from different text databases can be merged into one result set? How to handle partly corrupted data? Which models are appropriate for the same? Information Retrieval (IR) Model Mathematically, models are used in many scientific areas having objective to understand some phenomenon in the real world. A model of information retrieval predicts and explains what a user will find in relevance to the given query. IR model is basically a pattern that defines the above-mentioned aspects of retrieval procedure and consists of the following − A model for documents. A model for queries. A matching function that compares queries to documents. Mathematically, a retrieval model consists of − D − Representation for documents. R − Representation for queries. F − The modeling framework for D, Q along with relationship between them. R (q,di) − A similarity function which orders the documents with respect to the query. It is also called ranking. Types of Information Retrieval (IR) Model An information model (IR) model can be classified into the following three models − Classical IR Model It is the simplest and easy to implement IR model. This model is based on mathematical knowledge that was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three classical IR models. Non-Classical IR Model It is completely opposite to classical IR model. Such kind of IR models are based on principles other than similarity, probability, Boolean operations. Information logic model, situation theory model and interaction models are the examples of non-classical IR model. Alternative IR Model It is the enhancement of classical IR model making use of some specific techniques from some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are the example of alternative IR model. Design features of Information retrieval (IR) systems Let us now learn about the design features of IR systems − Inverted Index The primary data structure of most of the IR systems is in the form of inverted index. We can define an inverted index as a data structure that list, for every word, all documents that contain it and frequency of the occurrences in document. It makes it easy to search for ‘hits’ of a query word. Stop Word Elimination Stop words are those high frequency words that are deemed unlikely to be useful for searching. They have less semantic weights. All such kind of words are in a list called stop list. For example, articles “a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are the examples of stop words. The size of the inverted index can be significantly reduced by stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of inverted index by almost half. On the other hand, sometimes the elimination of stop word may cause elimination of the term that is useful for searching. For example, if we eliminate the alphabet “A” from “Vitamin A” then it would have no significance. Stemming Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed would be stemmed to the root word laugh. In our subsequent sections, we will discuss about some important and useful IR models. The Boolean Model It is the oldest information retrieval (IR) model. The model is based on set theory and the Boolean algebra, where documents are sets of terms and queries are Boolean expressions on terms. The Boolean model can be defined as − D − A set of words, i.e., the indexing terms present in a document. Here, each term is either present (1) or absent (0). Q − A Boolean expression, where terms are the index terms and operators are logical products − AND, logical sum − OR and logical difference − NOT F − Boolean algebra over sets of terms as well as over sets of documents If we talk about the relevance feedback, then in Boolean IR model the Relevance prediction can be defined as follows − R − A document is predicted as relevant to the query expression if and only
Natural Language Processing Tutorial Job Search Language is a method of communication with the help of which we can speak, read and write. Natural Language Processing (NLP) is a subfield of Computer Science that deals with Artificial Intelligence (AI), which enables computers to understand and process human language. Audience This tutorial is designed to benefit graduates, postgraduates, and research students who either have an interest in this subject or have this subject as a part of their curriculum. The reader can be a beginner or an advanced learner. Prerequisites The reader must have basic knowledge about Artificial Intelligence. He/she should also be aware about basic terminologies used in English grammar and Python programming concepts. Learning working make money
NLP – Word Level Analysis In this chapter, we will understand world level analysis in Natural Language Processing. Regular Expressions A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX as well as in MS WORD in identical way. We have various search engines using a number of RE features. Properties of Regular Expressions Followings are some of the important properties of RE − American Mathematician Stephen Cole Kleene formalized the Regular Expression language. RE is a formula in a special language, which can be used for specifying simple classes of strings, a sequence of symbols. In other words, we can say that RE is an algebraic notation for characterizing a set of strings. Regular expression requires two things, one is the pattern that we wish to search and other is a corpus of text from which we need to search. Mathematically, A Regular Expression can be defined as follows − ε is a Regular Expression, which indicates that the language is having an empty string. φ is a Regular Expression which denotes that it is an empty language. If X and Y are Regular Expressions, then X, Y X.Y(Concatenation of XY) X+Y (Union of X and Y) X*, Y* (Kleen Closure of X and Y) are also regular expressions. If a string is derived from above rules then that would also be a regular expression. Examples of Regular Expressions The following table shows a few examples of Regular Expressions − Regular Expressions Regular Set (0 + 10*) {0, 1, 10, 100, 1000, 10000, … } (0*10*) {1, 01, 10, 010, 0010, …} (0 + ε)(1 + ε) {ε, 0, 1, 01} (a+b)* It would be set of strings of a’s and b’s of any length which also includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….} (a+b)*abb It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb, babb, aaabb, ababb, …………..} (11)* It would be set consisting of even number of 1’s which also includes an empty string i.e. {ε, 11, 1111, 111111, ……….} (aa)*(bb)*b It would be set of strings consisting of even number of a’s followed by odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab, aaaabbb, …………..} (aa + ab + ba + bb)* It would be string of a’s and b’s of even length that can be obtained by concatenating any combination of the strings aa, ab, ba and bb including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..} Regular Sets & Their Properties It may be defined as the set that represents the value of the regular expression and consists specific properties. Properties of regular sets If we do the union of two regular sets then the resulting set would also be regula. If we do the intersection of two regular sets then the resulting set would also be regular. If we do the complement of regular sets, then the resulting set would also be regular. If we do the difference of two regular sets, then the resulting set would also be regular. If we do the reversal of regular sets, then the resulting set would also be regular. If we take the closure of regular sets, then the resulting set would also be regular. If we do the concatenation of two regular sets, then the resulting set would also be regular. Finite State Automata The term automata, derived from the Greek word “αὐτόματα” meaning “self-acting”, is the plural of automaton which may be defined as an abstract self-propelled computing device that follows a predetermined sequence of operations automatically. An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA). Mathematically, an automaton can be represented by a 5-tuple (Q, Σ, δ, q0, F), where − Q is a finite set of states. Σ is a finite set of symbols, called the alphabet of the automaton. δ is the transition function q0 is the initial state from where any input is processed (q0 ∈ Q). F is a set of final state/states of Q (F ⊆ Q). Relation between Finite Automata, Regular Grammars and Regular Expressions Following points will give us a clear view about the relationship between finite automata, regular grammars and regular expressions − As we know that finite state automata are the theoretical foundation of computational work and regular expressions is one way of describing them. We can say that any regular expression can be implemented as FSA and any FSA can be described with a regular expression. On the other hand, regular expression is a way to characterize a kind of language called regular language. Hence, we can say that regular language can be described with the help of both FSA and regular expression. Regular grammar, a formal grammar that can be right-regular or left-regular, is another way to characterize regular language. Following diagram shows that finite automata, regular expressions and regular grammars are the equivalent ways of describing regular languages. Types of Finite State Automation (FSA) Finite state automation is of two types. Let us see what the types are. Deterministic Finite automation (DFA) It may be defined as the type of finite automation wherein, for every input symbol we can determine the state to which the machine will move. It has a finite number of states that is why the machine is called Deterministic Finite Automaton (DFA). Mathematically, a DFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where − Q is a finite set of states. Σ is a finite set of symbols, called the alphabet of the automaton. δ is the transition function where δ: Q × Σ → Q . q0 is the initial state from where any input is processed (q0 ∈ Q).
NLP – Linguistic Resources In this chapter, we will learn about the linguistic resources in Natural Language Processing. Corpus A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is corpora. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc. Elements of Corpus Design Language is infinite but a corpus has to be finite in size. For the corpus to be finite in size, we need to sample and proportionally include a wide range of text types to ensure a good corpus design. Let us now learn about some important elements for corpus design − Corpus Representativeness Representativeness is a defining feature of corpus design. The following definitions from two great researchers − Leech and Biber, will help us understand corpus representativeness − According to Leech (1991), “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety”. According to Biber (1993), “Representativeness refers to the extent to which a sample includes the full range of variability in a population”. In this way, we can conclude that representativeness of a corpus are determined by the following two factors − Balance − The range of genre include in a corpus Sampling − How the chunks for each genre are selected. Corpus Balance Another very important element of corpus design is corpus balance – the range of genre included in a corpus. We have already studied that representativeness of a general corpus depends upon how balanced the corpus is. A balanced corpus covers a wide range of text categories, which are supposed to be representatives of the language. We do not have any reliable scientific measure for balance but the best estimation and intuition works in this concern. In other words, we can say that the accepted balance is determined by its intended uses only. Sampling Another important element of corpus design is sampling. Corpus representativeness and balance is very closely associated with sampling. That is why we can say that sampling is inescapable in corpus building. According to Biber(1993), “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.” While obtaining a representative sample, we need to consider the following − Sampling unit − It refers to the unit which requires a sample. For example, for written text, a sampling unit may be a newspaper, journal or a book. Sampling frame − The list of al sampling units is called a sampling frame. Population − It may be referred as the assembly of all sampling units. It is defined in terms of language production, language reception or language as a product. Corpus Size Another important element of corpus design is its size. How large the corpus should be? There is no specific answer to this question. The size of the corpus depends upon the purpose for which it is intended as well as on some practical considerations as follows − Kind of query anticipated from the user. The methodology used by the users to study the data. Availability of the source of data. With the advancement in technology, the corpus size also increases. The following table of comparison will help you understand how the corpus size works − Year Name of the Corpus Size (in words) 1960s – 70s Brown and LOB 1 Million words 1980s The Birmingham corpora 20 Million words 1990s The British National corpus 100 Million words Early 21st century The Bank of English corpus 650 Million words In our subsequent sections, we will look at a few examples of corpus. TreeBank Corpus It may be defined as linguistically parsed text corpus that annotates syntactic or semantic sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the most common way of representing the grammatical analysis is by means of a tree structure. Generally, Treebanks are created on the top of a corpus, which has already been annotated with part-of-speech tags. Types of TreeBank Corpus Semantic and Syntactic Treebanks are the two most common types of Treebanks in linguistics. Let us now learn more about these types − Semantic Treebanks These Treebanks use a formal representation of sentence’s semantic structure. They vary in the depth of their semantic representation. Robot Commands Treebank, Geoquery, Groningen Meaning Bank, RoboCup Corpus are some of the examples of Semantic Treebanks. Syntactic Treebanks Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are expressions of the formal language obtained from the conversion of parsed Treebank data. The outputs of such systems are predicate logic based meaning representation. Various syntactic Treebanks in different languages have been created so far. For example, Penn Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in Arabia language. Sininca syntactic Treebank created in Chinese language. Lucy, Susane and BLLIP WSJ syntactic corpus created in English language. Applications of TreeBank Corpus Followings are some of the applications of TreeBanks − In Computational Linguistics If we talk about Computational Linguistic then the best use of TreeBanks is to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. In Corpus Linguistics In case of Corpus linguistics, the best use of Treebanks is to study syntactic phenomena. In Theoretical Linguistics and Psycholinguistics The best use of Treebanks in theoretical and psycholinguistics is interaction evidence. PropBank Corpus PropBank more specifically called “Proposition Bank” is a corpus, which is annotated with verbal propositions and their arguments. The corpus is a verb-oriented resource; the annotations here are more closely related to the syntactic level. Martha Palmer et al., Department of Linguistic, University of Colorado
Natural Language Processing – Introduction Language is a method of communication with the help of which we can speak, read and write. For example, we think, we make decisions, plans and more in natural language; precisely, in words. However, the big question that confronts us in this AI era is that can we communicate in a similar manner with computers. In other words, can human beings communicate with computers in their natural language? It is a challenge for us to develop NLP applications because computers need structured data, but human speech is unstructured and often ambiguous in nature. In this sense, we can say that Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amount of natural language data. History of NLP We have divided the history of NLP into four phases. The phases have distinctive concerns and styles. First Phase (Machine Translation Phase) – Late 1940s to late 1960s The work done in this phase focused mainly on machine translation (MT). This phase was a period of enthusiasm and optimism. Let us now see all that the first phase had in it − The research on NLP started in early 1950s after Booth & Richens’ investigation and Weaver’s memorandum on machine translation in 1949. 1954 was the year when a limited experiment on automatic translation from Russian to English demonstrated in the Georgetown-IBM experiment. In the same year, the publication of the journal MT (Machine Translation) started. The first international conference on Machine Translation (MT) was held in 1952 and second was held in 1956. In 1961, the work presented in Teddington International Conference on Machine Translation of Languages and Applied Language analysis was the high point of this phase. Second Phase (AI Influenced Phase) – Late 1960s to late 1970s In this phase, the work done was majorly related to world knowledge and on its role in the construction and manipulation of meaning representations. That is why, this phase is also called AI-flavored phase. The phase had in it, the following − In early 1961, the work began on the problems of addressing and constructing data or knowledge base. This work was influenced by AI. In the same year, a BASEBALL question-answering system was also developed. The input to this system was restricted and the language processing involved was a simple one. A much advanced system was described in Minsky (1968). This system, when compared to the BASEBALL question-answering system, was recognized and provided for the need of inference on the knowledge base in interpreting and responding to language input. Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s This phase can be described as the grammatico-logical phase. Due to the failure of practical system building in last phase, the researchers moved towards the use of logic for knowledge representation and reasoning in AI. The third phase had the following in it − The grammatico-logical approach, towards the end of decade, helped us with powerful general-purpose sentence processors like SRI’s Core Language Engine and Discourse Representation Theory, which offered a means of tackling more extended discourse. In this phase we got some practical resources & tools like parsers, e.g. Alvey Natural Language Tools along with more operational and commercial systems, e.g. for database query. The work on lexicon in 1980s also pointed in the direction of grammatico-logical approach. Fourth Phase (Lexical & Corpus Phase) – The 1990s We can describe this as a lexical & corpus phase. The phase had a lexicalized approach to grammar that appeared in late 1980s and became an increasing influence. There was a revolution in natural language processing in this decade with the introduction of machine learning algorithms for language processing. Study of Human Languages Language is a crucial component for human lives and also the most fundamental aspect of our behavior. We can experience it in mainly two forms – written and spoken. In the written form, it is a way to pass our knowledge from one generation to the next. In the spoken form, it is the primary medium for human beings to coordinate with each other in their day-to-day behavior. Language is studied in various academic disciplines. Each discipline comes with its own set of problems and a set of solution to address those. Consider the following table to understand this − Discipline Problems Tools Linguists How phrases and sentences can be formed with words? What curbs the possible meaning for a sentence? Intuitions about well-formedness and meaning. Mathematical model of structure. For example, model theoretic semantics, formal language theory. Psycholinguists How human beings can identify the structure of sentences? How the meaning of words can be identified? When does understanding take place? Experimental techniques mainly for measuring the performance of human beings. Statistical analysis of observations. Philosophers How do words and sentences acquire the meaning? How the objects are identified by the words? What is meaning? Natural language argumentation by using intuition. Mathematical models like logic and model theory. Computational Linguists How can we identify the structure of a sentence How knowledge and reasoning can be modeled? How we can use language to accomplish specific tasks? Algorithms Data structures Formal models of representation and reasoning. AI techniques like search & representation methods. Ambiguity and Uncertainty in Language Ambiguity, generally used in natural language processing, can be referred as the ability of being understood in more than one way. In simple terms, we can say that ambiguity is the capability of being understood in more than one way. Natural language is very ambiguous. NLP has the following types of ambiguities − Lexical Ambiguity The ambiguity of a single word is called lexical ambiguity. For example, treating the word silver as a noun, an adjective, or a verb. Syntactic Ambiguity This kind of ambiguity occurs when a sentence is parsed in different ways. For example, the