Python – Process Word Document ”; Previous Next To read a word document we take help of the module named docx. We first install docx as shown below. Then write a program to use the different functions in docx module to read the entire file by paragraphs. We use the below command to get the docx module into our environment. pip install docx In the below example we read the content of a word document by appending each of the lines to a paragraph and finally printing out all the paragraph text. import docx def readtxt(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return ”n”.join(fullText) print (readtxt(”pathTutorialspoint.docx”)) When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Individual Paragraphs We can read a specific paragraph from the word document using the paragraphs attribute. In the below example we read only the second paragraph from the word document. import docx doc = docx.Document(”pathTutorialspoint.docx”) print len(doc.paragraphs) print doc.paragraphs[2].text When we run the above program, we get the following output − The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;
Category: python Text Processing
Python – Process PDF
Python – Process PDF ”; Previous Next Python can read PDF files and print out the content after extracting the text from it. For that we have to first install the required module which is PyPDF2. Below is the command to install the module. You should have pip already installed in your python environment. pip install pypdf2 On successful installation of this module we can read PDF files using the methods available in the module. import PyPDF2 pdfName = ”pathTutorialspoint.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) page = read_pdf.getPage(0) page_content = page.extractText() print page_content When we run the above program, we get the following output − Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming languages to web designing to academics and much more. Reading Multiple Pages To read a pdf with multiple pages and print each of the page with a page number we use the a loop with getPageNumber() function. In the below example we the PDF file which has two pages. The contents are printed under two separate page headings. import PyPDF2 pdfName = ”PathTutorialspoint2.pdf” read_pdf = PyPDF2.PdfFileReader(pdfName) for i in xrange(read_pdf.getNumPages()): page = read_pdf.getPage(i) print ”Page No – ” + str(1+read_pdf.getPageNumber(page)) page_content = page.extractText() print page_content When we run the above program, we get the following output − Page No – 1 Tutorials Point originated from the idea that there exists a class of readers who respond better to online content and prefer to learn new skills at their own pace from the comforts of their drawing rooms. Page No – 2 The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts a wealth of tutorials and allied articles on topics ranging from p rogramming languages to web designing to academics and much more. Print Page Previous Next Advertisements ”;
Python – Text Translation
Python – Text Translation ”; Previous Next Text translation from one language to another is increasingly becoming common for various websites as they cater to an international audience. The python package which helps us do this is called translate. This package can be installed by the following way. It provides translation for major languages. pip install translate Below is an example of translating a simple sentence from English to German. The default from language being English. from translate import Translator translator= Translator(to_lang=”German”) translation = translator.translate(“Good Morning!”) print translation When we run the above program, we get the following output − Guten Morgen! Between Any Two Languages If we have the need specify the from-language and the to-language, then we can specify it as in the below program. from translate import Translator translator= Translator(from_lang=”german”,to_lang=”spanish”) translation = translator.translate(“Guten Morgen”) print translation When we run the above program, we get the following output − Buenos días Print Page Previous Next Advertisements ”;
Python – Constrained Search
Python – Constrained Search ”; Previous Next Many times, after we get the result of a search we need to search one level deeper into part of the existing search result. For example, in a given body of text we aim to get the web addresses and also extract the different parts of the web address like the protocol, domain name etc. In such scenario we need to take help of group function which is used to divide the search result into various groups bases on the regular expression assigned. We create such group expression by separating the main search result using parentheses around the searchable part excluding the fixed words we want match. import re text = “The web address is https://www.tutorialspoint.com” # Taking “://” and “.” to separate the groups result = re.search(”([w.-]+)://([w.-]+).([w.-]+)”, text) if result : print “The main web Address: “,result.group() print “The protocol: “,result.group(1) print “The doman name: “,result.group(2) print “The TLD: “,result.group(3) When we run the above program, we get the following output − The main web Address: https://www.tutorialspoint.com The protocol: https The doman name: www.tutorialspoint The TLD: com Print Page Previous Next Advertisements ”;
Python – Tagging Words
Python – Tagging Words ”; Previous Next Tagging is an essential feature of text processing where we tag the words into grammatical categorization. We take help of tokenization and pos_tag function to create the tags for each word. import nltk text = nltk.word_tokenize(“A Python is a serpent which eats eggs from the nest”) tagged_text=nltk.pos_tag(text) print(tagged_text) When we run the above program, we get the following output − [(”A”, ”DT”), (”Python”, ”NNP”), (”is”, ”VBZ”), (”a”, ”DT”), (”serpent”, ”NN”), (”which”, ”WDT”), (”eats”, ”VBZ”), (”eggs”, ”NNS”), (”from”, ”IN”), (”the”, ”DT”), (”nest”, ”JJS”)] Tag Descriptions We can describe the meaning of each tag by using the following program which shows the in-built values. import nltk nltk.help.upenn_tagset(”NN”) nltk.help.upenn_tagset(”IN”) nltk.help.upenn_tagset(”DT”) When we run the above program, we get the following output − NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist … IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside … DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those Tagging a Corpus We can also tag a corpus data and see the tagged result for each word in that corpus. import nltk from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw(“blake-poems.txt”) tokenized = sent_tokenize(sample) for i in tokenized[:2]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) When we run the above program we get the following output − [([”, ”JJ”), (Poems”, ”NNP”), (by”, ”IN”), (William”, ”NNP”), (Blake”, ”NNP”), (1789”, ”CD”), (]”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (AND”, ”NNP”), (OF”, ”NNP”), (EXPERIENCE”, ”NNP”), (and”, ”CC”), (THE”, ”NNP”), (BOOK”, ”NNP”), (of”, ”IN”), (THEL”, ”NNP”), (SONGS”, ”NNP”), (OF”, ”NNP”), (INNOCENCE”, ”NNP”), (INTRODUCTION”, ”NNP”), (Piping”, ”VBG”), (down”, ”RP”), (the”, ”DT”), (valleys”, ”NN”), (wild”, ”JJ”), (,”, ”,”), (Piping”, ”NNP”), (songs”, ”NNS”), (of”, ”IN”), (pleasant”, ”JJ”), (glee”, ”NN”), (,”, ”,”), (On”, ”IN”), (a”, ”DT”), (cloud”, ”NN”), (I”, ”PRP”), (saw”, ”VBD”), (a”, ”DT”), (child”, ”NN”), (,”, ”,”), (And”, ”CC”), (he”, ”PRP”), (laughing”, ”VBG”), (said”, ”VBD”), (to”, ”TO”), (me”, ”PRP”), (:”, ”:”), (“”, ”“”), (Pipe”, ”VB”), (a”, ”DT”), (song”, ”NN”), (about”, ”IN”), (a”, ”DT”), (Lamb”, ”NN”), (!”, ”.”), (u””””, “”””)] Print Page Previous Next Advertisements ”;
Python – Synonyms and Antonyms ”; Previous Next Synonyms and Antonyms are available as part of the wordnet which a lexical database for the English language. It is available as part of nltk corpora access. In wordnet Synonyms are the words that denote the same concept and are interchangeable in many contexts so that they are grouped into unordered sets (synsets). We use these synsets to derive the synonyms and antonyms as shown in the below programs. from nltk.corpus import wordnet synonyms = [] for syn in wordnet.synsets(“Soil”): for lm in syn.lemmas(): synonyms.append(lm.name()) print (set(synonyms)) When we run the above program we get the following output − set([grease”, filth”, dirt”, begrime”, soil”, grime”, land”, bemire”, dirty”, grunge”, stain”, territory”, colly”, ground”]) To get the antonyms we simply uses the antonym function. from nltk.corpus import wordnet antonyms = [] for syn in wordnet.synsets(“ahead”): for lm in syn.lemmas(): if lm.antonyms(): antonyms.append(lm.antonyms()[0].name()) print(set(antonyms)) When we run the above program, we get the following output − set([backward”, back”]) Print Page Previous Next Advertisements ”;
Python – WordNet Interface
Python – WordNet Interface ”; Previous Next WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition. A collection of similar words is called lemmas. The words in WordNet are organized and nodes and edges where the nodes represent the word text and the edges represent the relations between the words. below we will see how we can use the WordNet module. All Lemmas from nltk.corpus import wordnet as wn res=wn.synset(”locomotive.n.01”).lemma_names() print res When we run the above program, we get the following output − [u”locomotive”, u”engine”, u”locomotive_engine”, u”railway_locomotive”] Word Definition The dictionary definition of a word can be obtained by using the definition function. It describes the meaning of the word as we can find in a normal dictionary. from nltk.corpus import wordnet as wn resdef = wn.synset(”ocean.n.01”).definition() print resdef When we run the above program, we get the following output − a large body of water constituting a principal part of the hydrosphere Usage Examples We can get the example sentences showing some usage examples of the words using the exmaples() function. from nltk.corpus import wordnet as wn res_exm = wn.synset(”good.n.01”).examples() print res_exm When we run the above program we get the following output − [”for your own good”, “what”s the good of worrying?”] Opposite Words Get All the opposite words by using the antonym function. from nltk.corpus import wordnet as wn # get all the antonyms res_a = wn.lemma(”horizontal.a.01.horizontal”).antonyms() print res_a When we run the above program we get the following output − [Lemma(”inclined.a.02.inclined”), Lemma(”vertical.a.01.vertical”)] Print Page Previous Next Advertisements ”;
Python – Pretty Print
Python – Pretty Print Numbers ”; Previous Next The python module pprint is used for giving proper printing formats to various data objects in python. Those data objects can represent a dictionary data type or even a data object containing the JSON data. In the below example we see how that data looks before applying the pprint module and after applying it. import pprint student_dict = {”Name”: ”Tusar”, ”Class”: ”XII”, ”Address”: {”FLAT ”:1308, ”BLOCK ”:”A”, ”LANE ”:2, ”CITY ”: ”HYD”}} print student_dict print “n” print “***With Pretty Print***” print “———————–” pprint.pprint(student_dict,width=-1) When we run the above program, we get the following output − {”Address”: {”FLAT ”: 1308, ”LANE ”: 2, ”CITY ”: ”HYD”, ”BLOCK ”: ”A”}, ”Name”: ”Tusar”, ”Class”: ”XII”} ***With Pretty Print*** ———————– {”Address”: {”BLOCK ”: ”A”, ”CITY ”: ”HYD”, ”FLAT ”: 1308, ”LANE ”: 2}, ”Class”: ”XII”, ”Name”: ”Tusar”} Handling JSON Data Pprint can also handle JSON data by formatting them to a more readable format. import pprint emp = {“Name”:[“Rick”,”Dan”,”Michelle”,”Ryan”,”Gary”,”Nina”,”Simon”,”Guru” ], “Salary”:[“623.3″,”515.2″,”611″,”729″,”843.25″,”578″,”632.8″,”722.5” ], “StartDate”:[ “1/1/2012″,”9/23/2013″,”11/15/2014″,”5/11/2014″,”3/27/2015″,”5/21/2013”, “7/30/2013″,”6/17/2014”], “Dept”:[ “IT”,”Operations”,”IT”,”HR”,”Finance”,”IT”,”Operations”,”Finance”] } x= pprint.pformat(emp, indent=2) print x When we run the above program, we get the following output − { ”Dept”: [ ”IT”, ”Operations”, ”IT”, ”HR”, ”Finance”, ”IT”, ”Operations”, ”Finance”], ”Name”: [”Rick”, ”Dan”, ”Michelle”, ”Ryan”, ”Gary”, ”Nina”, ”Simon”, ”Guru”], ”Salary”: [ ”623.3”, ”515.2”, ”611”, ”729”, ”843.25”, ”578”, ”632.8”, ”722.5”], ”StartDate”: [ ”1/1/2012”, ”9/23/2013”, ”11/15/2014”, ”5/11/2014”, ”3/27/2015”, ”5/21/2013”, ”7/30/2013”, ”6/17/2014”]} Print Page Previous Next Advertisements ”;
Python – Word Replacement
Python – Word Replacement ”; Previous Next Replacing the complete string or a part of string is a very frequent requirement in text processing. The replace() method returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max. Following is the syntax for replace() method − str.replace(old, new[, max]) Parameters old − This is old substring to be replaced. new − This is new substring, which would replace old substring. max − If this optional argument max is given, only the first count occurrences are replaced. This method returns a copy of the string with all occurrences of substring old replaced by new. If the optional argument max is given, only the first count occurrences are replaced. Example The following example shows the usage of replace() method. str = “this is string example….wow!!! this is really string” print (str.replace(“is”, “was”)) print (str.replace(“is”, “was”, 3)) Result When we run above program, it produces the following result − thwas was string example….wow!!! thwas was really string thwas was string example….wow!!! thwas is really string Replacement Ignoring Case import re sourceline = re.compile(“Tutor”, re.IGNORECASE) Replacedline = sourceline.sub(“Tutor”,”Tutorialspoint has the best tutorials for learning.”) print (Replacedline) When we run the above program, we get the following output − Tutorialspoint has the best Tutorials for learning. Print Page Previous Next Advertisements ”;
Python – Remove Stopwords
Python – Remove Stopwords ”; Previous Next Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus. We first download it to our python environment. import nltk nltk.download(”stopwords”) It will download a file with English stopwords. Verifying the Stopwords from nltk.corpus import stopwords stopwords.words(”english”) print stopwords.words() [620:680] When we run the above program we get the following output − [u”your”, u”yours”, u”yourself”, u”yourselves”, u”he”, u”him”, u”his”, u”himself”, u”she”, u”she”s”, u”her”, u”hers”, u”herself”, u”it”, u”it”s”, u”its”, u”itself”, u”they”, u”them”, u”their”, u”theirs”, u”themselves”, u”what”, u”which”, u”who”, u”whom”, u”this”, u”that”, u”that”ll”, u”these”, u”those”, u”am”, u”is”, u”are”, u”was”, u”were”, u”be”, u”been”, u”being”, u”have”, u”has”, u”had”, u”having”, u”do”, u”does”, u”did”, u”doing”, u”a”, u”an”, u”the”, u”and”, u”but”, u”if”, u”or”, u”because”, u”as”, u”until”, u”while”, u”of”, u”at”] The various language other than English which has these stopwords are as below. from nltk.corpus import stopwords print stopwords.fileids() When we run the above program we get the following output − [u”arabic”, u”azerbaijani”, u”danish”, u”dutch”, u”english”, u”finnish”, u”french”, u”german”, u”greek”, u”hungarian”, u”indonesian”, u”italian”, u”kazakh”, u”nepali”, u”norwegian”, u”portuguese”, u”romanian”, u”russian”, u”spanish”, u”swedish”, u”turkish”] Example We use the below example to show how the stopwords are removed from the list of words. from nltk.corpus import stopwords en_stops = set(stopwords.words(”english”)) all_words = [”There”, ”is”, ”a”, ”tree”,”near”,”the”,”river”] for word in all_words: if word not in en_stops: print(word) When we run the above program we get the following output − There tree near river Print Page Previous Next Advertisements ”;