python Text Processing Archives - Donotsad where can learn any thing work project and make money

Aug 09

Python – Text Munging

Python – Text Munging ”; Previous Next Munging in general means cleaning up anything messy by transforming them. In our case we will see how we can transform text to get some result which gives us some desirable changes to data. At a simple level it is only about transforming the text we are dealing with. Example In the below example we plan to shuffle and then rearrange all the letters of a sentence except the first and the last one to get the possible alternate words which may get generated as a mis-spelled word during writing by a human. This rearrangement helps us in import random import re def replace(t): inner_word = list(t.group(2)) random.shuffle(inner_word) return t.group(1) + “”.join(inner_word) + t.group(3) text = “Hello, You should reach the finish line.” print re.sub(r”(w)(w+)(w)”, replace, text) print re.sub(r”(w)(w+)(w)”, replace, text) When we run the above program we get the following output − Hlleo, You slouhd raech the fsiinh lnie. Hlleo, You suolhd raceh the fniish line. Here you can see how the words are jumbled except for the first and the last letters. By taking a statistical approach to wrong spelling we can decided what are the commonly mis0spelled words and supply the correct spelling for them. Print Page Previous Next Advertisements ”;

Aug 09

Python – Text Summarization

Python – Text Summarization ”; Previous Next Text summarization involves generating a summary from a large body of text which somewhat describes the context of the large body of text. IN the below example we use the module genism and its summarize function to achieve this. We install the below package to achieve this. pip install gensim_sum_ext The below paragraph is about a movie plot. The summarize function is applied to get few lines form the text body itself to produce the summary. from gensim.summarization import summarize text = “In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones ” + “daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),” + “the head of the Corleone Mafia family, is known to friends and associates as Godfather. ” + “He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors ” + “because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding ” + ” day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician ” + “and acquaintance of the Don, whose daughter was brutally beaten by two young men because she” + “refused their advances; the men received minimal punishment from the presiding judge. ” + “The Don is disappointed in Bonasera, who”d avoided most contact with the Don due to Corleone”s” + “nefarious business dealings. The Don”s wife is godmother to Bonasera”s shamed daughter, ” + “a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees ” + “to have his men punish the young men responsible (in a non-lethal manner) in return for ” + “future service if necessary.” print summarize(text) When we run the above program we get the following output − He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding day. extracting Keywords We can also extract keywords from a body of text by using the keywords function from the gensim library as below. from gensim.summarization import keywords text = “In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones ” + “daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando),” + “the head of the Corleone Mafia family, is known to friends and associates as Godfather. ” + “He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors ” + “because, according to Italian tradition, no Sicilian can refuse a request on his daughter”s wedding ” + ” day. One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician ” + “and acquaintance of the Don, whose daughter was brutally beaten by two young men because she” + “refused their advances; the men received minimal punishment from the presiding judge. ” + “The Don is disappointed in Bonasera, who”d avoided most contact with the Don due to Corleone”s” + “nefarious business dealings. The Don”s wife is godmother to Bonasera”s shamed daughter, ” + “a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees ” + “to have his men punish the young men responsible (in a non-lethal manner) in return for ” + “future service if necessary.” print keywords(text) When we run the above program, we get the following output − corleone men corleones daughter wedding summer new vito family hagen robert Print Page Previous Next Advertisements ”;

Aug 09

Python – Stemming Algorithms

Python – Stemming Algorithms ”; Previous Next In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words – agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So, it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word. There are three most used stemming algorithms available in nltk. They give slightly different result. The below example shows the use of all the three stemming algorithms and their result. import nltk from nltk.stem.porter import PorterStemmer from nltk.stem.lancaster import LancasterStemmer from nltk.stem import SnowballStemmer porter_stemmer = PorterStemmer() lanca_stemmer = LancasterStemmer() sb_stemmer = SnowballStemmer(“english”,) word_data = “Aging head of famous crime family decides to transfer his position to one of his subalterns” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word print ”***PorterStemmer****n” for w_port in nltk_tokens: print “Actual: %s || Stem: %s” % (w_port,porter_stemmer.stem(w_port)) print ”n***LancasterStemmer****n” for w_lanca in nltk_tokens: print “Actual: %s || Stem: %s” % (w_lanca,lanca_stemmer.stem(w_lanca)) print ”n***SnowballStemmer****n” for w_snow in nltk_tokens: print “Actual: %s || Stem: %s” % (w_snow,sb_stemmer.stem(w_snow)) When we run the above program we get the following output − ***PorterStemmer**** Actual: Aging || Stem: age Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: famou Actual: crime || Stem: crime Actual: family || Stem: famili Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transfer Actual: his || Stem: hi Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: one Actual: of || Stem: of Actual: his || Stem: hi Actual: subalterns || Stem: subaltern ***LancasterStemmer**** Actual: Aging || Stem: ag Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: fam Actual: crime || Stem: crim Actual: family || Stem: famy Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transf Actual: his || Stem: his Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: on Actual: of || Stem: of Actual: his || Stem: his Actual: subalterns || Stem: subaltern ***SnowballStemmer**** Actual: Aging || Stem: age Actual: head || Stem: head Actual: of || Stem: of Actual: famous || Stem: famous Actual: crime || Stem: crime Actual: family || Stem: famili Actual: decides || Stem: decid Actual: to || Stem: to Actual: transfer || Stem: transfer Actual: his || Stem: his Actual: position || Stem: posit Actual: to || Stem: to Actual: one || Stem: one Actual: of || Stem: of Actual: his || Stem: his Actual: subalterns || Stem: subaltern Print Page Previous Next Advertisements ”;

Aug 09

Python – Sentiment Analysis

Python – Sentiment Analysis ”; Previous Next Semantic Analysis is about analysing the general opinion of the audience. It may be a reaction to a piece of news, movie or any a tweet about some matter under discussion. Generally, such reactions are taken from social media and clubbed into a file to be analysed through NLP. We will take a simple case of defining positive and negative words first. Then taking an approach to analyse those words as part of sentences using those words. We use the sentiment_analyzer module from nltk. We first carry out the analysis with one word and then with paired words also called bigrams. Finally, we mark the words with negative sentiment as defined in the mark_negation function. import nltk import nltk.sentiment.sentiment_analyzer # Analysing for single words def OneWord(): positive_words = [”good”, ”progress”, ”luck”] text = ”Hard Work brings progress and good luck.”.split() analysis = nltk.sentiment.util.extract_unigram_feats(text, positive_words) print(” ** Sentiment with one word **n”) print(analysis) # Analysing for a pair of words def WithBigrams(): word_sets = [(”Regular”, ”fit”), (”fit”, ”fine”)] text = ”Regular excercise makes you fit and fine”.split() analysis = nltk.sentiment.util.extract_bigram_feats(text, word_sets) print(”n*** Sentiment with bigrams ***n”) print analysis # Analysing the negation words. def NegativeWord(): text = ”Lack of good health can not bring success to students”.split() analysis = nltk.sentiment.util.mark_negation(text) print(”n**Sentiment with Negative words**n”) print(analysis) OneWord() WithBigrams() NegativeWord() When we run the above program we get the following output − ** Sentiment with one word ** {”contains(luck)”: False, ”contains(good)”: True, ”contains(progress)”: True} *** Sentiment with bigrams *** {”contains(fit – fine)”: False, ”contains(Regular – fit)”: False} **Sentiment with Negative words** [”Lack”, ”of”, ”good”, ”health”, ”can”, ”not”, ”bring_NEG”, ”success_NEG”, ”to_NEG”, ”students_NEG”] Print Page Previous Next Advertisements ”;

Aug 09

Python – Text Classification

Python – Text Classification ”; Previous Next Many times, we need to categorise the available text into various categories by some pre-defined criteria. nltk provides such feature as part of various corpora. In the below example we look at the movie review corpus and check the categorization available. # Lets See how the movies are classified from nltk.corpus import movie_reviews all_cats = [] for w in movie_reviews.categories(): all_cats.append(w.lower()) print(all_cats) When we run the above program, we get the following output − [”neg”, ”pos”] Now let”s look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample. from nltk.corpus import movie_reviews from nltk.tokenize import sent_tokenize fields = movie_reviews.fileids() sample = movie_reviews.raw(“pos/cv944_13521.txt”) token = sent_tokenize(sample) for lines in range(4): print(token[lines]) When we run the above program we get the following output − meteor threat set to blow away all volcanoes & twisters ! summer is here again ! this season could probably be the most ambitious = season this decade with hollywood churning out films like deep impact , = godzilla , the x-files , armageddon , the truman show , all of which has but = one main aim , to rock the box office . leading the pack this summer is = deep impact , one of the first few film releases from the = spielberg-katzenberg-geffen”s dreamworks production company . Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk. import nltk from nltk.corpus import movie_reviews fields = movie_reviews.fileids() all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) print(all_words.most_common(10)) When we run the above program we get the following output − [(,”, 77717), (the”, 76529), (.”, 65876), (a”, 38106), (and”, 35576), (of”, 34123), (to”, 31937), (u”””, 30585), (is”, 25195), (in”, 21822)] Print Page Previous Next Advertisements ”;

Aug 09

Python – Reading RSS feed

Python – Reading RSS feed ”; Previous Next RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it. In python we take help of the below package to read and process these feeds. pip install feedparser Feed Structure In the below example we get the structure of the feed so that we can analyse further about which parts of the feed we want to process. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) entry = NewsFeed.entries[1] print entry.keys() When we run the above program, we get the following output − [”summary_detail”, ”published_parsed”, ”links”, ”title”, ”summary”, ”guidislink”, ”title_detail”, ”link”, ”published”, ”id”] Feed Title and Posts In the below example we read the title and head of the rss feed. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) print ”Number of RSS posts :”, len(NewsFeed.entries) entry = NewsFeed.entries[1] print ”Post Title :”,entry.title When we run the above program we get the following output − Number of RSS posts : 5 Post Title : Cong-JD(S) in SC over choice of pro tem speaker Feed Details Based on above entry structure we can derive the necessary details from the feed using python program as shown below. As entry is a dictionary we utilise its keys to produce the values needed. import feedparser NewsFeed = feedparser.parse(“https://timesofindia.indiatimes.com/rssfeedstopstories.cms”) entry = NewsFeed.entries[1] print entry.published print “******” print entry.summary print “——News Link——–” print entry.link When we run the above program we get the following output − Fri, 18 May 2018 20:13:13 GMT ****** Controversy erupted on Friday over the appointment of BJP MLA K G Bopaiah as pro tem speaker for the assembly, with Congress and JD(S) claiming the move went against convention that the post should go to the most senior member of the House. The combine approached the SC to challenge the appointment. Hearing is scheduled for 10:30 am today. ——News Link——– https://timesofindia.indiatimes.com/india/congress-jds-in-sc-over-bjp-mla-made-pro-tem-speaker-hearing-at-1030-am/articleshow/64228740.cms Print Page Previous Next Advertisements ”;

Aug 09

Python – Chunk Classification

Python – Chunk Classification ”; Previous Next Classification based chunking involves classifying the text as a group of words rather than individual words. A simple scenario is tagging the text in sentences. We will use a corpus to demonstrate the classification. We choose the corpus conll2000 which has data from the of the Wall Street Journal corpus (WSJ) used for noun phrase-based chunking. First, we add the corpus to our environment using the following command. import nltk nltk.download(”conll2000”) Lets have a look at the first few sentences in this corpus. from nltk.corpus import conll2000 x = (conll2000.sents()) for i in range(3): print x[i] print ”n” When we run the above program we get the following output − [”Confidence”, ”in”, ”the”, ”pond”, ”is”, ”widely”, ”expected”, ”to”, ”take”, ”another”, ”sharp”, ”dive”, ”if”, ”trade”, ”figres”, ”for”, ”September”, ”,”, ”de”, ”for”, ”release”, ”tomorrow”, ”,”, ”fail”, ”to”, ”show”, ”a”, ”sbstantial”, ”improvement”, ”from”, ”Jly”, ”and”, ”Agst”, “”s”, ”near-record”, ”deficits”, ”.”] [”Chancellor”, ”of”, ”the”, ”Excheqer”, ”Nigel”, ”Lawson”, “”s”, ”restated”, ”commitment”, ”to”, ”a”, ”firm”, ”monetary”, ”policy”, ”has”, ”helped”, ”to”, ”prevent”, ”a”, ”freefall”, ”in”, ”sterling”, ”over”, ”the”, ”past”, ”week”, ”.”] [”Bt”, ”analysts”, ”reckon”, ”nderlying”, ”spport”, ”for”, ”sterling”, ”has”, ”been”, ”eroded”, ”by”, ”the”, ”chancellor”, “”s”, ”failre”, ”to”, ”annonce”, ”any”, ”new”, ”policy”, ”measres”, ”in”, ”his”, ”Mansion”, ”Hose”, ”speech”, ”last”, ”Thrsday”, ”.”] Next we use the fucntion tagged_sents() to get the sentences tagged to their classifiers. from nltk.corpus import conll2000 x = (conll2000.tagged_sents()) for i in range(3): print x[i] print ”n” When we run the above program we get the following output − [(”Confidence”, ”NN”), (”in”, ”IN”), (”the”, ”DT”), (”pond”, ”NN”), (”is”, ”VBZ”), (”widely”, ”RB”), (”expected”, ”VBN”), (”to”, ”TO”), (”take”, ”VB”), (”another”, ”DT”), (”sharp”, ”JJ”), (”dive”, ”NN”), (”if”, ”IN”), (”trade”, ”NN”), (”figres”, ”NNS”), (”for”, ”IN”), (”September”, ”NNP”), (”,”, ”,”), (”de”, ”JJ”), (”for”, ”IN”), (”release”, ”NN”), (”tomorrow”, ”NN”), (”,”, ”,”), (”fail”, ”VB”), (”to”, ”TO”), (”show”, ”VB”), (”a”, ”DT”), (”sbstantial”, ”JJ”), (”improvement”, ”NN”), (”from”, ”IN”), (”Jly”, ”NNP”), (”and”, ”CC”), (”Agst”, ”NNP”), (“”s”, ”POS”), (”near-record”, ”JJ”), (”deficits”, ”NNS”), (”.”, ”.”)] [(”Chancellor”, ”NNP”), (”of”, ”IN”), (”the”, ”DT”), (”Excheqer”, ”NNP”), (”Nigel”, ”NNP”), (”Lawson”, ”NNP”), (“”s”, ”POS”), (”restated”, ”VBN”), (”commitment”, ”NN”), (”to”, ”TO”), (”a”, ”DT”), (”firm”, ”NN”), (”monetary”, ”JJ”), (”policy”, ”NN”), (”has”, ”VBZ”), (”helped”, ”VBN”), (”to”, ”TO”), (”prevent”, ”VB”), (”a”, ”DT”), (”freefall”, ”NN”), (”in”, ”IN”), (”sterling”, ”NN”), (”over”, ”IN”), (”the”, ”DT”), (”past”, ”JJ”), (”week”, ”NN”), (”.”, ”.”)] [(”Bt”, ”CC”), (”analysts”, ”NNS”), (”reckon”, ”VBP”), (”nderlying”, ”VBG”), (”spport”, ”NN”), (”for”, ”IN”), (”sterling”, ”NN”), (”has”, ”VBZ”), (”been”, ”VBN”), (”eroded”, ”VBN”), (”by”, ”IN”), (”the”, ”DT”), (”chancellor”, ”NN”), (“”s”, ”POS”), (”failre”, ”NN”), (”to”, ”TO”), (”annonce”, ”VB”), (”any”, ”DT”), (”new”, ”JJ”), (”policy”, ”NN”), (”measres”, ”NNS”), (”in”, ”IN”), (”his”, ”PRP$”), (”Mansion”, ”NNP”), (”Hose”, ”NNP”), (”speech”, ”NN”), (”last”, ”JJ”), (”Thrsday”, ”NNP”), (”.”, ”.”)] Print Page Previous Next Advertisements ”;

Aug 09

Python – Corpora Access

Python – Corpora Access ”; Previous Next Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. In the below example we access the names of only those files from the corpus which are plain text with filename ending as .txt. from nltk.corpus import gutenberg fields = gutenberg.fileids() print(fields) When we run the above program, we get the following output − [austen-emma.txt”, austen-persuasion.txt”, austen-sense.txt”, bible-kjv.txt”, blake-poems.txt”, bryant-stories.txt”, burgess-busterbrown.txt”, carroll-alice.txt”, chesterton-ball.txt”, chesterton-brown.txt”, chesterton-thursday.txt”, edgeworth-parents.txt”, melville-moby_dick.txt”, milton-paradise.txt”, shakespeare-caesar.txt”, shakespeare-hamlet.txt”, shakespeare-macbeth.txt”, whitman-leaves.txt”] Accessing Raw Text We can access the raw text from these files using sent_tokenize function which is also available in nltk. In the below example we retrieve the first two paragraphs of the blake poen text. from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw(“blake-poems.txt”) token = sent_tokenize(sample) for para in range(2): print(token[para]) When we run the above program we get the following output − [Poems by William Blake 1789] SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE INTRODUCTION Piping down the valleys wild, Piping songs of pleasant glee, On a cloud I saw a child, And he laughing said to me: “Pipe a song about a Lamb!” So I piped with merry cheer. Print Page Previous Next Advertisements ”;

Aug 09

Python – Process Word Document

Python – Process Word Document ”; Previous Next To read a word document we take help of the module named docx. We first install docx as shown below. Then write a program to use the different functions in docx module to read the entire file by paragraphs. We use the below command to get the docx module into our environment. pip install docx In the below example we read the content of a word document by appending each of the lines to a paragraph and finally printing out all the paragraph text. import docx def readtxt(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return ”n”.join(fullText) print (readtxt(”pathTutorialspoint.docx”)) When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Individual Paragraphs We can read a specific paragraph from the word document using the paragraphs attribute. In the below example we read only the second paragraph from the word document. import docx doc = docx.Document(”pathTutorialspoint.docx”) print len(doc.paragraphs) print doc.paragraphs[2].text When we run the above program, we get the following output − The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;

Aug 09

Python – Process PDF

Python – Process PDF ”; Previous Next Python can read PDF files and print out the content after extracting the text from it. For that we have to first install the required module which is PyPDF2. Below is the command to install the module. You should have pip already installed in your python environment. pip install pypdf2 On successful installation of this module we can read PDF files using the methods available in the module. import PyPDF2 pdfName = ”pathTutorialspoint.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) page = read_pdf.getPage(0) page_content = page.extractText() print page_content When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Multiple Pages To read a pdf with multiple pages and print each of the page with a page number we use the a loop with getPageNumber() function. In the below example we the PDF file which has two pages. The contents are printed under two separate page headings. import PyPDF2 pdfName = ”PathTutorialspoint2.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) for i in xrange(read_pdf.getNumPages()): page = read_pdf.getPage(i) print ”Page No – ” + str(1+read_pdf.getPageNumber(page)) page_content = page.extractText() print page_content When we run the above program, we get the following output − Page No – 1 Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. Page No – 2 The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from p rogramming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;