python Text Processing Archives - Page 2 of 4 - Donotsad where can learn any thing work project and make money

Aug 09

Python – Process Word Document

Python – Process Word Document ”; Previous Next To read a word document we take help of the module named docx. We first install docx as shown below. Then write a program to use the different functions in docx module to read the entire file by paragraphs. We use the below command to get the docx module into our environment. pip install docx In the below example we read the content of a word document by appending each of the lines to a paragraph and finally printing out all the paragraph text. import docx def readtxt(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return ”n”.join(fullText) print (readtxt(”pathTutorialspoint.docx”)) When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Individual Paragraphs We can read a specific paragraph from the word document using the paragraphs attribute. In the below example we read only the second paragraph from the word document. import docx doc = docx.Document(”pathTutorialspoint.docx”) print len(doc.paragraphs) print doc.paragraphs[2].text When we run the above program, we get the following output − The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;

Aug 09

Python – Process PDF

Python – Process PDF ”; Previous Next Python can read PDF files and print out the content after extracting the text from it. For that we have to first install the required module which is PyPDF2. Below is the command to install the module. You should have pip already installed in your python environment. pip install pypdf2 On successful installation of this module we can read PDF files using the methods available in the module. import PyPDF2 pdfName = ”pathTutorialspoint.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) page = read_pdf.getPage(0) page_content = page.extractText() print page_content When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Multiple Pages To read a pdf with multiple pages and print each of the page with a page number we use the a loop with getPageNumber() function. In the below example we the PDF file which has two pages. The contents are printed under two separate page headings. import PyPDF2 pdfName = ”PathTutorialspoint2.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) for i in xrange(read_pdf.getNumPages()): page = read_pdf.getPage(i) print ”Page No – ” + str(1+read_pdf.getPageNumber(page)) page_content = page.extractText() print page_content When we run the above program, we get the following output − Page No – 1 Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. Page No – 2 The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from p rogramming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;

Aug 09

Python – Text Translation

Python – Text Translation ”; Previous Next Text translation from one language to another is increasingly becoming common for various websites as they cater to an international audience. The python package which helps us do this is called translate. This package can be installed by the following way. It provides translation for major languages. pip install translate Below is an example of translating a simple sentence from English to German. The default from language being English. from translate import Translator translator= Translator(to_lang=”German”) translation = translator.translate(“Good Morning!”) print translation When we run the above program, we get the following output − Guten Morgen! Between Any Two Languages If we have the need specify the from-language and the to-language, then we can specify it as in the below program. from translate import Translator translator= Translator(from_lang=”german”,to_lang=”spanish”) translation = translator.translate(“Guten Morgen”) print translation When we run the above program, we get the following output − Buenos días Print Page Previous Next Advertisements ”;

Aug 09

Python – Constrained Search

Python – Constrained Search ”; Previous Next Many times, after we get the result of a search we need to search one level deeper into part of the existing search result. For example, in a given body of text we aim to get the web addresses and also extract the different parts of the web address like the protocol, domain name etc. In such scenario we need to take help of group function which is used to divide the search result into various groups bases on the regular expression assigned. We create such group expression by separating the main search result using parentheses around the searchable part excluding the fixed words we want match. import re text = “The web address is https://www.tutorialspoint.com” # Taking “://” and “.” to separate the groups result = re.search(”([w.-]+)://([w.-]+).([w.-]+)”, text) if result : print “The main web Address: “,result.group() print “The protocol: “,result.group(1) print “The doman name: “,result.group(2) print “The TLD: “,result.group(3) When we run the above program, we get the following output − The main web Address: https://www.tutorialspoint.com The protocol: https The doman name: www.tutorialspoint The TLD: com Print Page Previous Next Advertisements ”;

Aug 09

Python – Tagging Words

Python – Tagging Words ”; Previous Next Tagging is an essential feature of text processing where we tag the words into grammatical categorization. We take help of tokenization and pos_tag function to create the tags for each word. import nltk text = nltk.word_tokenize(“A Python is a serpent which eats eggs from the nest”) tagged_text=nltk.pos_tag(text) print(tagged_text) When we run the above program, we get the following output − [(”A”, ”DT”), (”Python”, ”NNP”), (”is”, ”VBZ”), (”a”, ”DT”), (”serpent”, ”NN”), (”which”, ”WDT”), (”eats”, ”VBZ”), (”eggs”, ”NNS”), (”from”, ”IN”), (”the”, ”DT”), (”nest”, ”JJS”)] Tag Descriptions We can describe the meaning of each tag by using the following program which shows the in-built values. import nltk nltk.help.upenn_tagset(”NN”) nltk.help.upenn_tagset(”IN”) nltk.help.upenn_tagset(”DT”) When we run the above program, we get the following output − NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist … IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside … DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those Tagging a Corpus We can also tag a corpus data and see the tagged result for each word in that corpus. import nltk from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw(“blake-poems.txt”) tokenized = sent_tokenize(sample) for i in tokenized[:2]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) When we run the above program we get the following output − [([”, ”JJ”), (Poems”, ”NNP”), (by”, ”IN”), (William”, ”NNP”), (Blake”, ”NNP”), (1789”, ”CD”), (]”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (AND”, ”NNP”), (OF”, ”NNP”), (EXPERIENCE”, ”NNP”), (and”, ”CC”), (THE”, ”NNP”), (BOOK”, ”NNP”), (of”, ”IN”), (THEL”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (INTRODUCTION”, ”NNP”), (Piping”, ”VBG”), (down”, ”RP”), (the”, ”DT”), (valleys”, ”NN”), (wild”, ”JJ”), (,”, ”,”), (Piping”, ”NNP”), (songs”, ”NNS”), (of”, ”IN”), (pleasant”, ”JJ”), (glee”, ”NN”), (,”, ”,”), (On”, ”IN”), (a”, ”DT”), (cloud”, ”NN”), (I”, ”PRP”), (saw”, ”VBD”), (a”, ”DT”), (child”, ”NN”), (,”, ”,”), (And”, ”CC”), (he”, ”PRP”), (laughing”, ”VBG”), (said”, ”VBD”), (to”, ”TO”), (me”, ”PRP”), (:”, ”:”), (“”, ”“”), (Pipe”, ”VB”), (a”, ”DT”), (song”, ”NN”), (about”, ”IN”), (a”, ”DT”), (Lamb”, ”NN”), (!”, ”.”), (u””””, “”””)] Print Page Previous Next Advertisements ”;

Aug 09

Python – Bigrams

Python – Bigrams ”; Previous Next Some English words occur together more frequently. For example – Sky High, do or die, best performance, heavy rain etc. So, in a text document we may need to identify such pair of words which will help in sentiment analysis. First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs. Example import nltk word_data = “The best performance can bring in sky high success.” nltk_tokens = nltk.word_tokenize(word_data) print(list(nltk.bigrams(nltk_tokens))) When we run the above program we get the following output − [(”The”, ”best”), (”best”, ”performance”), (”performance”, ”can”), (”can”, ”bring”), (”bring”, ”in”), (”in”, ”sky”), (”sky”, ”high”), (”high”, ”success”), (”success”, ”.”)] This result can be used in statistical findings on the frequency of such pairs in a given text. That will corelate to the general sentiment of the descriptions present int he body of the text. Print Page Previous Next Advertisements ”;

Aug 09

Python – Search and Match

Python – Search and Match ”; Previous Next Using regular expressions there are two fundamental operations which appear similar but have significant differences. The re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string. This plays an important role in text processing as often we have to write the correct regular expression to retrieve the chunk of text for sentimental analysis as an example. import re if re.search(“tor”, “Tutorial”): print “1. search result found anywhere in the string” if re.match(“Tut”, “Tutorial”): print “2. Match with beginning of string” if not re.match(“tor”, “Tutorial”): print “3. No match with match if not beginning” # Search as Match if not re.search(“^tor”, “Tutorial”): print “4. search as match” When we run the above program, we get the following output − 1. search result found anywhere in the string 2. Match with beginning of string 3. No match with match if not beginning 4. search as match Print Page Previous Next Advertisements ”;

Aug 09

Python – Filter Duplicate Words

Python – Filter Duplicate Words ”; Previous Next Many times, we have a need of analysing the text only for the unique words present in the file. So, we need to eliminate the duplicate words from the text. This is achieved by using the word tokenization and set functions available in nltk. Without preserving the order In the below example we first tokenize the sentence into words. Then we apply set() function which creates an unordered collection of unique elements. The result has unique words which are not ordered. import nltk word_data = “The Sky is blue also the ocean is blue also Rainbow has a blue colour.” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) # Applying Set no_order = list(set(nltk_tokens)) print no_order When we run the above program, we get the following output − [”blue”, ”Rainbow”, ”is”, ”Sky”, ”colour”, ”ocean”, ”also”, ”a”, ”.”, ”The”, ”has”, ”the”] Preserving the Order To get the words after removing the duplicates but still preserving the order of the words in the sentence, we read the words and add it to list by appending it. import nltk word_data = “The Sky is blue also the ocean is blue also Rainbow has a blue colour.” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) ordered_tokens = set() result = [] for word in nltk_tokens: if word not in ordered_tokens: ordered_tokens.add(word) result.append(word) print result When we run the above program, we get the following output − [”The”, ”Sky”, ”is”, ”blue”, ”also”, ”the”, ”ocean”, ”Rainbow”, ”has”, ”a”, ”colour”, ”.”] Print Page Previous Next Advertisements ”;

Aug 09

Python – Synonyms and Antonyms

Python – Synonyms and Antonyms ”; Previous Next Synonyms and Antonyms are available as part of the wordnet which a lexical database for the English language. It is available as part of nltk corpora access. In wordnet Synonyms are the words that denote the same concept and are interchangeable in many contexts so that they are grouped into unordered sets (synsets). We use these synsets to derive the synonyms and antonyms as shown in the below programs. from nltk.corpus import wordnet synonyms = [] for syn in wordnet.synsets(“Soil”): for lm in syn.lemmas(): synonyms.append(lm.name()) print (set(synonyms)) When we run the above program we get the following output − set([grease”, filth”, dirt”, begrime”, soil”, grime”, land”, bemire”, dirty”, grunge”, stain”, territory”, colly”, ground”]) To get the antonyms we simply uses the antonym function. from nltk.corpus import wordnet antonyms = [] for syn in wordnet.synsets(“ahead”): for lm in syn.lemmas(): if lm.antonyms(): antonyms.append(lm.antonyms()[0].name()) print(set(antonyms)) When we run the above program, we get the following output − set([backward”, back”]) Print Page Previous Next Advertisements ”;

Aug 09

Python – WordNet Interface

Python – WordNet Interface ”; Previous Next WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition. A collection of similar words is called lemmas. The words in WordNet are organized and nodes and edges where the nodes represent the word text and the edges represent the relations between the words. below we will see how we can use the WordNet module. All Lemmas from nltk.corpus import wordnet as wn res=wn.synset(”locomotive.n.01”).lemma_names() print res When we run the above program, we get the following output − [u”locomotive”, u”engine”, u”locomotive_engine”, u”railway_locomotive”] Word Definition The dictionary definition of a word can be obtained by using the definition function. It describes the meaning of the word as we can find in a normal dictionary. from nltk.corpus import wordnet as wn resdef = wn.synset(”ocean.n.01”).definition() print resdef When we run the above program, we get the following output − a large body of water constituting a principal part of the hydrosphere Usage Examples We can get the example sentences showing some usage examples of the words using the exmaples() function. from nltk.corpus import wordnet as wn res_exm = wn.synset(”good.n.01”).examples() print res_exm When we run the above program we get the following output − [”for your own good”, “what”s the good of worrying?”] Opposite Words Get All the opposite words by using the antonym function. from nltk.corpus import wordnet as wn # get all the antonyms res_a = wn.lemma(”horizontal.a.01.horizontal”).antonyms() print res_a When we run the above program we get the following output − [Lemma(”inclined.a.02.inclined”), Lemma(”vertical.a.01.vertical”)] Print Page Previous Next Advertisements ”;