Python – Text Munging

Python – Text Munging ”; Previous Next Munging in general means cleaning up anything messy by transforming them. In our case we will see how we can transform text to get some result which gives us some desirable changes to data. At a simple level it is only about transforming the text we are dealing with. Example In the below example we plan to shuffle and then rearrange all the letters of a sentence except the first and the last one to get the possible alternate words which may get generated as a mis-spelled word during writing by a human. This rearrangement helps us in import random import re def replace(t): inner_word = list(t.group(2)) random.shuffle(inner_word) return t.group(1) + “”.join(inner_word) + t.group(3) text = “Hello, You should reach the finish line.” print re.sub(r”(w)(w+)(w)”, replace, text) print re.sub(r”(w)(w+)(w)”, replace, text) When we run the above program we get the following output − Hlleo, You slouhd raech the fsiinh lnie. Hlleo, You suolhd raceh the fniish line. Here you can see how the words are jumbled except for the first and the last letters. By taking a statistical approach to wrong spelling we can decided what are the commonly mis0spelled words and supply the correct spelling for them. Print Page Previous Next Advertisements ”;

Python – Text Summarization

Python – Text Summarization ”; Previous Next Text summarization involves generating a summary from a large body of text which somewhat describes the context of the large body of text. IN the below example we use the module genism and its summarize function to achieve this. We install the below package to achieve this. pip install gensim_sum_ext The below paragraph is about a movie plot. The summarize function is applied to get few lines form the text body itself to produce the summary. from gensim.summarization import summarize text = “In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones ” + “daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),” + “the head of the Corleone Mafia family, is known to friends and associates as Godfather. ” + “He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors ” + “because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding ” + ” day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician ” + “and acquaintance of the Don, whose daughter was brutally beaten by two young men because she” + “refused their advances; the men received minimal punishment from the presiding judge. ” + “The Don is disappointed in Bonasera, who”d avoided most contact with the Don due to Corleone”s” + “nefarious business dealings. The Don”s wife is godmother to Bonasera”s shamed daughter, ” + “a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees ” + “to have his men punish the young men responsible (in a non-lethal manner) in return for ” + “future service if necessary.” print summarize(text) When we run the above program we get the following output − He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding day. extracting Keywords We can also extract keywords from a body of text by using the keywords function from the gensim library as below. from gensim.summarization import keywords text = “In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones ” + “daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),” + “the head of the Corleone Mafia family, is known to friends and associates as Godfather. ” + “He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors ” + “because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding ” + ” day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician ” + “and acquaintance of the Don, whose daughter was brutally beaten by two young men because she” + “refused their advances; the men received minimal punishment from the presiding judge. ” + “The Don is disappointed in Bonasera, who”d avoided most contact with the Don due to Corleone”s” + “nefarious business dealings. The Don”s wife is godmother to Bonasera”s shamed daughter, ” + “a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees ” + “to have his men punish the young men responsible (in a non-lethal manner) in return for ” + “future service if necessary.” print keywords(text) When we run the above program, we get the following output − corleone men corleones daughter wedding summer new vito family hagen robert Print Page Previous Next Advertisements ”;

Python – Stemming Algorithms

Python – Stemming Algorithms ”; Previous Next In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words – agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So, it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word. There are three most used stemming algorithms available in nltk. They give slightly different result. The below example shows the use of all the three stemming algorithms and their result. import nltk from nltk.stem.porter import PorterStemmer from nltk.stem.lancaster import LancasterStemmer from nltk.stem import SnowballStemmer porter_stemmer = PorterStemmer() lanca_stemmer = LancasterStemmer() sb_stemmer = SnowballStemmer(“english”,) word_data = “Aging head of famous crime family decides to transfer his position to one of his subalterns” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word print ”***PorterStemmer****n” for w_port in nltk_tokens: print “Actual: %s || Stem: %s” % (w_port,porter_stemmer.stem(w_port)) print ”n***LancasterStemmer****n” for w_lanca in nltk_tokens: print “Actual: %s || Stem: %s” % (w_lanca,lanca_stemmer.stem(w_lanca)) print ”n***SnowballStemmer****n” for w_snow in nltk_tokens: print “Actual: %s || Stem: %s” % (w_snow,sb_stemmer.stem(w_snow)) When we run the above program we get the following output − ***PorterStemmer**** Actual: Aging || Stem: age Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: famou Actual: crime || Stem: crime Actual: family || Stem: famili Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transfer Actual: his || Stem: hi Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: one Actual: of || Stem: of Actual: his || Stem: hi Actual: subalterns || Stem: subaltern ***LancasterStemmer**** Actual: Aging || Stem: ag Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: fam Actual: crime || Stem: crim Actual: family || Stem: famy Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transf Actual: his || Stem: his Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: on Actual: of || Stem: of Actual: his || Stem: his Actual: subalterns || Stem: subaltern ***SnowballStemmer**** Actual: Aging || Stem: age Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: famous Actual: crime || Stem: crime Actual: family || Stem: famili Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transfer Actual: his || Stem: his Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: one Actual: of || Stem: of Actual: his || Stem: his Actual: subalterns || Stem: subaltern Print Page Previous Next Advertisements ”;

Python – Sentiment Analysis

Python – Sentiment Analysis ”; Previous Next Semantic Analysis is about analysing the general opinion of the audience. It may be a reaction to a piece of news, movie or any a tweet about some matter under discussion. Generally, such reactions are taken from social media and clubbed into a file to be analysed through NLP. We will take a simple case of defining positive and negative words first. Then taking an approach to analyse those words as part of sentences using those words. We use the sentiment_analyzer module from nltk. We first carry out the analysis with one word and then with paired words also called bigrams. Finally, we mark the words with negative sentiment as defined in the mark_negation function. import nltk import nltk.sentiment.sentiment_analyzer # Analysing for single words def OneWord(): positive_words = [”good”, ”progress”, ”luck”] text = ”Hard Work brings progress and good luck.”.split() analysis = nltk.sentiment.util.extract_unigram_feats(text, positive_words) print(” ** Sentiment with one word **n”) print(analysis) # Analysing for a pair of words def WithBigrams(): word_sets = [(”Regular”, ”fit”), (”fit”, ”fine”)] text = ”Regular excercise makes you fit and fine”.split() analysis = nltk.sentiment.util.extract_bigram_feats(text, word_sets) print(”n*** Sentiment with bigrams ***n”) print analysis # Analysing the negation words. def NegativeWord(): text = ”Lack of good health can not bring success to students”.split() analysis = nltk.sentiment.util.mark_negation(text) print(”n**Sentiment with Negative words**n”) print(analysis) OneWord() WithBigrams() NegativeWord() When we run the above program we get the following output − ** Sentiment with one word ** {”contains(luck)”: False, ”contains(good)”: True, ”contains(progress)”: True} *** Sentiment with bigrams *** {”contains(fit – fine)”: False, ”contains(Regular – fit)”: False} **Sentiment with Negative words** [”Lack”, ”of”, ”good”, ”health”, ”can”, ”not”, ”bring_NEG”, ”success_NEG”, ”to_NEG”, ”students_NEG”] Print Page Previous Next Advertisements ”;

Python – Text Classification

Python – Text Classification ”; Previous Next Many times, we need to categorise the available text into various categories by some pre-defined criteria. nltk provides such feature as part of various corpora. In the below example we look at the movie review corpus and check the categorization available. # Lets See how the movies are classified from nltk.corpus import movie_reviews all_cats = [] for w in movie_reviews.categories(): all_cats.append(w.lower()) print(all_cats) When we run the above program, we get the following output − [”neg”, ”pos”] Now let”s look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample. from nltk.corpus import movie_reviews from nltk.tokenize import sent_tokenize fields = movie_reviews.fileids() sample = movie_reviews.raw(“pos/cv944_13521.txt”) token = sent_tokenize(sample) for lines in range(4): print(token[lines]) When we run the above program we get the following output − meteor threat set to blow away all volcanoes & twisters ! summer is here again ! this season could probably be the most ambitious = season this decade with hollywood churning out films like deep impact , = godzilla , the x-files , armageddon , the truman show , all of which has but = one main aim , to rock the box office . leading the pack this summer is = deep impact , one of the first few film releases from the = spielberg-katzenberg-geffen”s dreamworks production company . Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk. import nltk from nltk.corpus import movie_reviews fields = movie_reviews.fileids() all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) print(all_words.most_common(10)) When we run the above program we get the following output − [(,”, 77717), (the”, 76529), (.”, 65876), (a”, 38106), (and”, 35576), (of”, 34123), (to”, 31937), (u”””, 30585), (is”, 25195), (in”, 21822)] Print Page Previous Next Advertisements ”;

Python – Reading RSS feed

Python – Reading RSS feed ”; Previous Next RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it. In python we take help of the below package to read and process these feeds. pip install feedparser Feed Structure In the below example we get the structure of the feed so that we can analyse further about which parts of the feed we want to process. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) entry = NewsFeed.entries[1] print entry.keys() When we run the above program, we get the following output − [”summary_detail”, ”published_parsed”, ”links”, ”title”, ”summary”, ”guidislink”, ”title_detail”, ”link”, ”published”, ”id”] Feed Title and Posts In the below example we read the title and head of the rss feed. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) print ”Number of RSS posts :”, len(NewsFeed.entries) entry = NewsFeed.entries[1] print ”Post Title :”,entry.title When we run the above program we get the following output − Number of RSS posts : 5 Post Title : Cong-JD(S) in SC over choice of pro tem speaker Feed Details Based on above entry structure we can derive the necessary details from the feed using python program as shown below. As entry is a dictionary we utilise its keys to produce the values needed. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) entry = NewsFeed.entries[1] print entry.published print “******” print entry.summary print “——News Link——–” print entry.link When we run the above program we get the following output − Fri, 18 May 2018 20:13:13 GMT ****** Controversy erupted on Friday over the appointment of BJP MLA K G Bopaiah as pro tem speaker for the assembly, with Congress and JD(S) claiming the move went against convention that the post should go to the most senior member of the House. The combine approached the SC to challenge the appointment. Hearing is scheduled for 10:30 am today. ——News Link——– https://timesofindia.indiatimes.com/india/congress-jds-in-sc-over-bjp-mla-made-pro-tem-speaker-hearing-at-1030-am/articleshow/64228740.cms Print Page Previous Next Advertisements ”;

Python – Chunk Classification

Python – Chunk Classification ”; Previous Next Classification based chunking involves classifying the text as a group of words rather than individual words. A simple scenario is tagging the text in sentences. We will use a corpus to demonstrate the classification. We choose the corpus conll2000 which has data from the of the Wall Street Journal corpus (WSJ) used for noun phrase-based chunking. First, we add the corpus to our environment using the following command. import nltk nltk.download(”conll2000”) Lets have a look at the first few sentences in this corpus. from nltk.corpus import conll2000 x = (conll2000.sents()) for i in range(3): print x[i] print ”n” When we run the above program we get the following output − [”Confidence”, ”in”, ”the”, ”pond”, ”is”, ”widely”, ”expected”, ”to”, ”take”, ”another”, ”sharp”, ”dive”, ”if”, ”trade”, ”figres”, ”for”, ”September”, ”,”, ”de”, ”for”, ”release”, ”tomorrow”, ”,”, ”fail”, ”to”, ”show”, ”a”, ”sbstantial”, ”improvement”, ”from”, ”Jly”, ”and”, ”Agst”, “”s”, ”near-record”, ”deficits”, ”.”] [”Chancellor”, ”of”, ”the”, ”Excheqer”, ”Nigel”, ”Lawson”, “”s”, ”restated”, ”commitment”, ”to”, ”a”, ”firm”, ”monetary”, ”policy”, ”has”, ”helped”, ”to”, ”prevent”, ”a”, ”freefall”, ”in”, ”sterling”, ”over”, ”the”, ”past”, ”week”, ”.”] [”Bt”, ”analysts”, ”reckon”, ”nderlying”, ”spport”, ”for”, ”sterling”, ”has”, ”been”, ”eroded”, ”by”, ”the”, ”chancellor”, “”s”, ”failre”, ”to”, ”annonce”, ”any”, ”new”, ”policy”, ”measres”, ”in”, ”his”, ”Mansion”, ”Hose”, ”speech”, ”last”, ”Thrsday”, ”.”] Next we use the fucntion tagged_sents() to get the sentences tagged to their classifiers. from nltk.corpus import conll2000 x = (conll2000.tagged_sents()) for i in range(3): print x[i] print ”n” When we run the above program we get the following output − [(”Confidence”, ”NN”), (”in”, ”IN”), (”the”, ”DT”), (”pond”, ”NN”), (”is”, ”VBZ”), (”widely”, ”RB”), (”expected”, ”VBN”), (”to”, ”TO”), (”take”, ”VB”), (”another”, ”DT”), (”sharp”, ”JJ”), (”dive”, ”NN”), (”if”, ”IN”), (”trade”, ”NN”), (”figres”, ”NNS”), (”for”, ”IN”), (”September”, ”NNP”), (”,”, ”,”), (”de”, ”JJ”), (”for”, ”IN”), (”release”, ”NN”), (”tomorrow”, ”NN”), (”,”, ”,”), (”fail”, ”VB”), (”to”, ”TO”), (”show”, ”VB”), (”a”, ”DT”), (”sbstantial”, ”JJ”), (”improvement”, ”NN”), (”from”, ”IN”), (”Jly”, ”NNP”), (”and”, ”CC”), (”Agst”, ”NNP”), (“”s”, ”POS”), (”near-record”, ”JJ”), (”deficits”, ”NNS”), (”.”, ”.”)] [(”Chancellor”, ”NNP”), (”of”, ”IN”), (”the”, ”DT”), (”Excheqer”, ”NNP”), (”Nigel”, ”NNP”), (”Lawson”, ”NNP”), (“”s”, ”POS”), (”restated”, ”VBN”), (”commitment”, ”NN”), (”to”, ”TO”), (”a”, ”DT”), (”firm”, ”NN”), (”monetary”, ”JJ”), (”policy”, ”NN”), (”has”, ”VBZ”), (”helped”, ”VBN”), (”to”, ”TO”), (”prevent”, ”VB”), (”a”, ”DT”), (”freefall”, ”NN”), (”in”, ”IN”), (”sterling”, ”NN”), (”over”, ”IN”), (”the”, ”DT”), (”past”, ”JJ”), (”week”, ”NN”), (”.”, ”.”)] [(”Bt”, ”CC”), (”analysts”, ”NNS”), (”reckon”, ”VBP”), (”nderlying”, ”VBG”), (”spport”, ”NN”), (”for”, ”IN”), (”sterling”, ”NN”), (”has”, ”VBZ”), (”been”, ”VBN”), (”eroded”, ”VBN”), (”by”, ”IN”), (”the”, ”DT”), (”chancellor”, ”NN”), (“”s”, ”POS”), (”failre”, ”NN”), (”to”, ”TO”), (”annonce”, ”VB”), (”any”, ”DT”), (”new”, ”JJ”), (”policy”, ”NN”), (”measres”, ”NNS”), (”in”, ”IN”), (”his”, ”PRP$”), (”Mansion”, ”NNP”), (”Hose”, ”NNP”), (”speech”, ”NN”), (”last”, ”JJ”), (”Thrsday”, ”NNP”), (”.”, ”.”)] Print Page Previous Next Advertisements ”;

Python – Constrained Search

Python – Constrained Search ”; Previous Next Many times, after we get the result of a search we need to search one level deeper into part of the existing search result. For example, in a given body of text we aim to get the web addresses and also extract the different parts of the web address like the protocol, domain name etc. In such scenario we need to take help of group function which is used to divide the search result into various groups bases on the regular expression assigned. We create such group expression by separating the main search result using parentheses around the searchable part excluding the fixed words we want match. import re text = “The web address is https://www.tutorialspoint.com” # Taking “://” and “.” to separate the groups result = re.search(”([w.-]+)://([w.-]+).([w.-]+)”, text) if result : print “The main web Address: “,result.group() print “The protocol: “,result.group(1) print “The doman name: “,result.group(2) print “The TLD: “,result.group(3) When we run the above program, we get the following output − The main web Address: https://www.tutorialspoint.com The protocol: https The doman name: www.tutorialspoint The TLD: com Print Page Previous Next Advertisements ”;

Python – Tagging Words

Python – Tagging Words ”; Previous Next Tagging is an essential feature of text processing where we tag the words into grammatical categorization. We take help of tokenization and pos_tag function to create the tags for each word. import nltk text = nltk.word_tokenize(“A Python is a serpent which eats eggs from the nest”) tagged_text=nltk.pos_tag(text) print(tagged_text) When we run the above program, we get the following output − [(”A”, ”DT”), (”Python”, ”NNP”), (”is”, ”VBZ”), (”a”, ”DT”), (”serpent”, ”NN”), (”which”, ”WDT”), (”eats”, ”VBZ”), (”eggs”, ”NNS”), (”from”, ”IN”), (”the”, ”DT”), (”nest”, ”JJS”)] Tag Descriptions We can describe the meaning of each tag by using the following program which shows the in-built values. import nltk nltk.help.upenn_tagset(”NN”) nltk.help.upenn_tagset(”IN”) nltk.help.upenn_tagset(”DT”) When we run the above program, we get the following output − NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist … IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside … DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those Tagging a Corpus We can also tag a corpus data and see the tagged result for each word in that corpus. import nltk from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw(“blake-poems.txt”) tokenized = sent_tokenize(sample) for i in tokenized[:2]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) When we run the above program we get the following output − [([”, ”JJ”), (Poems”, ”NNP”), (by”, ”IN”), (William”, ”NNP”), (Blake”, ”NNP”), (1789”, ”CD”), (]”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (AND”, ”NNP”), (OF”, ”NNP”), (EXPERIENCE”, ”NNP”), (and”, ”CC”), (THE”, ”NNP”), (BOOK”, ”NNP”), (of”, ”IN”), (THEL”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (INTRODUCTION”, ”NNP”), (Piping”, ”VBG”), (down”, ”RP”), (the”, ”DT”), (valleys”, ”NN”), (wild”, ”JJ”), (,”, ”,”), (Piping”, ”NNP”), (songs”, ”NNS”), (of”, ”IN”), (pleasant”, ”JJ”), (glee”, ”NN”), (,”, ”,”), (On”, ”IN”), (a”, ”DT”), (cloud”, ”NN”), (I”, ”PRP”), (saw”, ”VBD”), (a”, ”DT”), (child”, ”NN”), (,”, ”,”), (And”, ”CC”), (he”, ”PRP”), (laughing”, ”VBG”), (said”, ”VBD”), (to”, ”TO”), (me”, ”PRP”), (:”, ”:”), (“”, ”“”), (Pipe”, ”VB”), (a”, ”DT”), (song”, ”NN”), (about”, ”IN”), (a”, ”DT”), (Lamb”, ”NN”), (!”, ”.”), (u””””, “”””)] Print Page Previous Next Advertisements ”;

Python – Frequency Distribution

Python – Frequency Distribution ”; Previous Next Counting the frequency of occurrence of a word in a body of text is often needed during text processing. This can be achieved by applying the word_tokenize() function and appending the result to a list to keep count of the words as shown in the below program. from nltk.tokenize import word_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw(“blake-poems.txt”) token = word_tokenize(sample) wlist = [] for i in range(50): wlist.append(token[i]) wordfreq = [wlist.count(w) for w in wlist] print(“Pairsn” + str(zip(token, wordfreq))) When we run the above program, we get the following output − [([”, 1), (Poems”, 1), (by”, 1), (William”, 1), (Blake”, 1), (1789”, 1), (]”, 1), (SONGS”, 2), (OF”, 3), (INNOCENCE”, 2), (AND”, 1), (OF”, 3), (EXPERIENCE”, 1), (and”, 1), (THE”, 1), (BOOK”, 1), (of”, 2), (THEL”, 1), (SONGS”, 2), (OF”, 3), (INNOCENCE”, 2), (INTRODUCTION”, 1), (Piping”, 2), (down”, 1), (the”, 1), (valleys”, 1), (wild”, 1), (,”, 3), (Piping”, 2), (songs”, 1), (of”, 2), (pleasant”, 1), (glee”, 1), (,”, 3), (On”, 1), (a”, 2), (cloud”, 1), (I”, 1), (saw”, 1), (a”, 2), (child”, 1), (,”, 3), (And”, 1), (he”, 1), (laughing”, 1), (said”, 1), (to”, 1), (me”, 1), (:”, 1), (“”, 1)] Conditional Frequency Distribution Conditional Frequency Distribution is used when we want to count words meeting specific crteria satisfying a set of text. import nltk #from nltk.tokenize import word_tokenize from nltk.corpus import brown cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) categories = [”hobbies”, ”romance”,”humor”] searchwords = [ ”may”, ”might”, ”must”, ”will”] cfd.tabulate(conditions=categories, samples=searchwords) When we run the above program, we get the following output − may might must will hobbies 131 22 83 264 romance 11 51 45 43 humor 8 8 9 13 Print Page Previous Next Advertisements ”;