Python – Filter Duplicate Words ”; Previous Next Many times, we have a need of analysing the text only for the unique words present in the file. So, we need to eliminate the duplicate words from the text. This is achieved by using the word tokenization and set functions available in nltk. Without preserving the order In the below example we first tokenize the sentence into words. Then we apply set() function which creates an unordered collection of unique elements. The result has unique words which are not ordered. import nltk word_data = “The Sky is blue also the ocean is blue also Rainbow has a blue colour.” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) # Applying Set no_order = list(set(nltk_tokens)) print no_order When we run the above program, we get the following output − [”blue”, ”Rainbow”, ”is”, ”Sky”, ”colour”, ”ocean”, ”also”, ”a”, ”.”, ”The”, ”has”, ”the”] Preserving the Order To get the words after removing the duplicates but still preserving the order of the words in the sentence, we read the words and add it to list by appending it. import nltk word_data = “The Sky is blue also the ocean is blue also Rainbow has a blue colour.” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) ordered_tokens = set() result = [] for word in nltk_tokens: if word not in ordered_tokens: ordered_tokens.add(word) result.append(word) print result When we run the above program, we get the following output − [”The”, ”Sky”, ”is”, ”blue”, ”also”, ”the”, ”ocean”, ”Rainbow”, ”has”, ”a”, ”colour”, ”.”] Print Page Previous Next Advertisements ”;
Category: python Text Processing
Python – Synonyms and Antonyms ”; Previous Next Synonyms and Antonyms are available as part of the wordnet which a lexical database for the English language. It is available as part of nltk corpora access. In wordnet Synonyms are the words that denote the same concept and are interchangeable in many contexts so that they are grouped into unordered sets (synsets). We use these synsets to derive the synonyms and antonyms as shown in the below programs. from nltk.corpus import wordnet synonyms = [] for syn in wordnet.synsets(“Soil”): for lm in syn.lemmas(): synonyms.append(lm.name()) print (set(synonyms)) When we run the above program we get the following output − set([grease”, filth”, dirt”, begrime”, soil”, grime”, land”, bemire”, dirty”, grunge”, stain”, territory”, colly”, ground”]) To get the antonyms we simply uses the antonym function. from nltk.corpus import wordnet antonyms = [] for syn in wordnet.synsets(“ahead”): for lm in syn.lemmas(): if lm.antonyms(): antonyms.append(lm.antonyms()[0].name()) print(set(antonyms)) When we run the above program, we get the following output − set([backward”, back”]) Print Page Previous Next Advertisements ”;
Python – WordNet Interface
Python – WordNet Interface ”; Previous Next WordNet is a dictionary of English, similar to a traditional thesaurus NLTK includes the English WordNet. We can use it as a reference for getting the meaning of words, usage example and definition. A collection of similar words is called lemmas. The words in WordNet are organized and nodes and edges where the nodes represent the word text and the edges represent the relations between the words. below we will see how we can use the WordNet module. All Lemmas from nltk.corpus import wordnet as wn res=wn.synset(”locomotive.n.01”).lemma_names() print res When we run the above program, we get the following output − [u”locomotive”, u”engine”, u”locomotive_engine”, u”railway_locomotive”] Word Definition The dictionary definition of a word can be obtained by using the definition function. It describes the meaning of the word as we can find in a normal dictionary. from nltk.corpus import wordnet as wn resdef = wn.synset(”ocean.n.01”).definition() print resdef When we run the above program, we get the following output − a large body of water constituting a principal part of the hydrosphere Usage Examples We can get the example sentences showing some usage examples of the words using the exmaples() function. from nltk.corpus import wordnet as wn res_exm = wn.synset(”good.n.01”).examples() print res_exm When we run the above program we get the following output − [”for your own good”, “what”s the good of worrying?”] Opposite Words Get All the opposite words by using the antonym function. from nltk.corpus import wordnet as wn # get all the antonyms res_a = wn.lemma(”horizontal.a.01.horizontal”).antonyms() print res_a When we run the above program we get the following output − [Lemma(”inclined.a.02.inclined”), Lemma(”vertical.a.01.vertical”)] Print Page Previous Next Advertisements ”;
Python – Tokenization
Python – Tokenization ”; Previous Next In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below. Line Tokenization In the below example we divide a given text into different lines by using the function sent_tokenize. import nltk sentence_data = “The First sentence is about Python. The Second: about Django. You can learn Python,Django and Data Ananlysis here. ” nltk_tokens = nltk.sent_tokenize(sentence_data) print (nltk_tokens) When we run the above program, we get the following output − [”The First sentence is about Python.”, ”The Second: about Django.”, ”You can learn Python,Django and Data Ananlysis here.”] Non-English Tokenization In the below example we tokenize the German text. import nltk german_tokenizer = nltk.data.load(”tokenizers/punkt/german.pickle”) german_tokens=german_tokenizer.tokenize(”Wie geht es Ihnen? Gut, danke.”) print(german_tokens) When we run the above program, we get the following output − [”Wie geht es Ihnen?”, ”Gut, danke.”] Word Tokenzitaion We tokenize the words using word_tokenize function available as part of nltk. import nltk word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” nltk_tokens = nltk.word_tokenize(word_data) print (nltk_tokens) When we run the above program we get the following output − [”It”, ”originated”, ”from”, ”the”, ”idea”, ”that”, ”there”, ”are”, ”readers”, ”who”, ”prefer”, ”learning”, ”new”, ”skills”, ”from”, ”the”, ”comforts”, ”of”, ”their”, ”drawing”, ”rooms”] Print Page Previous Next Advertisements ”;
Python – Convert Binary to ASCII ”; Previous Next The ASCII to binary and binary to ascii conversion is carried out by the in-built binascii module. It has a very straight forward usage with functions which take the input data and do the conversion. The below program shows the use of binascii module and its functions named b2a_uu and a2b_uu. The uu stands for “UNIX-to-UNIX encoding” which takes care of the data conversion from strings to binary and ascii values as required by the program. import binascii text = “Simply Easy Learning” # Converting binary to ascii data_b2a = binascii.b2a_uu(text) print “**Binary to Ascii** n” print data_b2a # Converting back from ascii to binary data_a2b = binascii.a2b_uu(data_b2a) print “**Ascii to Binary** n” print data_a2b When we run the above program we get the following output − **Binary to Ascii** 44VEM&QY($5AWD@3&5AFYI;F **Ascii to Binary** Simply Easy Learning Print Page Previous Next Advertisements ”;
Python – Spelling Check
Python – Spelling Check ”; Previous Next Checking of spelling is a basic requirement in any text processing or analysis. The python package pyspellchecker provides us this feature to find the words that may have been mis-spelled and also suggest the possible corrections. First, we need to install the required package using the following command in our python environment. pip install pyspellchecker Now we see below how the package is used to point out the wrongly spelled words as well as make some suggestions about possible correct words. from spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown([”let”, ”us”, ”wlak”,”on”,”the”,”groun”]) for word in misspelled: # Get the one `most likely` answer print(spell.correction(word)) # Get a list of `likely` options print(spell.candidates(word)) When we run the above program we get the following output − group {”group”, ”ground”, ”groan”, ”grout”, ”grown”, ”groin”} walk {”flak”, ”weak”, ”walk”} Case Sensitive If we use Let in place of let then this becomes a case sensitive comparison of the word with the closest matched words in dictionary and the result looks different now. from spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown([”Let”, ”us”, ”wlak”,”on”,”the”,”groun”]) for word in misspelled: # Get the one `most likely` answer print(spell.correction(word)) # Get a list of `likely` options print(spell.candidates(word)) When we run the above program we get the following output − group {”groin”, ”ground”, ”groan”, ”group”, ”grown”, ”grout”} walk {”walk”, ”flak”, ”weak”} get {”aet”, ”ret”, ”get”, ”cet”, ”bet”, ”vet”, ”pet”, ”wet”, ”let”, ”yet”, ”det”, ”het”, ”set”, ”et”, ”jet”, ”tet”, ”met”, ”fet”, ”net”} Print Page Previous Next Advertisements ”;
Python – Capitalize and Translate ”; Previous Next Capitalization strings is a regular need in any text processing system. Python achieves it by using the built-in functions in the standard library. In the below example we use the two string functions, capwords() and upper() to achieve this. While ”capwords” capitalizes the first letter of each word, ”upper” capitalizes the entire string. import string text = ”Tutorialspoint – simple easy learning.” print string.capwords(text) print string.upper(text) When we run the above program we get the following output − Tutorialspoint – Simple Easy Learning. TUTORIALSPOINT – SIMPLE EASY LEARNING. Trnslation in python essentially means substituting specific letters with another letter. It can work for encryption decryption of strings. import string text = ”Tutorialspoint – simple easy learning.” transtable = string.maketrans(”tpol”, ”wxyz”) print text.translate(transtable) When we run the above program we get the following output − Tuwyriazsxyinw – simxze easy zearning. Print Page Previous Next Advertisements ”;
Python – Text Processing State Machine ”; Previous Next A state machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it. The machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it. Consider a scenario where the text put has to be a continuous string of repetition of sequence of AGC(used in protein analysis). If this specific sequence is maintained in the input string the state of the machine remains TRUE but as soon as the sequence deviates, the state of the machine becomes FALSE and remains FALSE after wards. This ensures the further processing is stopped even though there may be more chunks of correct sequences available later. The below program defines a state machine which has functions to start the machine, take inputs for processing the text and step through the processing. class StateMachine: # Initialize def start(self): self.state = self.startState # Step through the input def step(self, inp): (s, o) = self.getNextValues(self.state, inp) self.state = s return o # Loop through the input def feeder(self, inputs): self.start() return [self.step(inp) for inp in inputs] # Determine the TRUE or FALSE state class TextSeq(StateMachine): startState = 0 def getNextValues(self, state, inp): if state == 0 and inp == ”A”: return (1, True) elif state == 1 and inp == ”G”: return (2, True) elif state == 2 and inp == ”C”: return (0, True) else: return (3, False) InSeq = TextSeq() x = InSeq.feeder([”A”,”A”,”A”]) print x y = InSeq.feeder([”A”, ”G”, ”C”, ”A”, ”C”, ”A”, ”G”]) print y When we run the above program, we get the following output − [True, False, False] [True, True, True, True, False, False, False] In the result of x, the pattern of AGC fails for the second input after the first ”A”. The state of the result remains False forever after this. In the result of Y, the pattern of AGC continues till the 4th input. Hence the state of the result remains True till that point. But from 5th input the result changes to False as G is expected, but C is found. Print Page Previous Next Advertisements ”;
Python – Extract URL from Text ”; Previous Next URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose. Example We can take a input file containig some URLs and process it thorugh the following program to extract the URLs. The findall()function is used to find all instances matching with the regular expression. Inout File Shown is the input file below. Which contains teo URLs. Now a days you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or internet then first you need to leanr those fundamentals. Next you can visit a good e-learning site like – https://www.tutorialspoint.com to learn further on a variety of subjects. Now, when we take the above input file and process it through the following program we get the required output whihc gives only the URLs extracted from the file. import re with open(“pathurl_example.txt”) as file: for line in file: urls = re.findall(”https?://(?:[-w.]|(?:%[da-fA-F]{2}))+”, line) print(urls) When we run the above program we get the following output − [”http://www.google.com.”] [”https://www.tutorialspoint.com”] Print Page Previous Next Advertisements ”;
Python – Backward File Reading ”; Previous Next When we normally read a file, the contents are read line by line from the beginning of the file. But there may be scenarios where we want to read the last line first. For example, the data in the file has latest record in the bottom and we want to read the latest records first. To achieve this requirement we install the required package to perform this action by using the command below. pip install file-read-backwards But before reading the file backwards, let”s read the content of the file line by line so that we can compare the result after backward reading. with open (“PathGodFather.txt”, “r”) as BigFile: data=BigFile.readlines() # Print each line for i in range(len(data)): print “Line No- “,i print data[i] When we run the above program, we get the following output − Line No- 0 Vito Corleone is the aging don (head) of the Corleone Mafia Family. Line No- 1 His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael”s sister) to Carlo Rizzi. Line No- 2 All of Michael”s family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. Line No- 3 He approaches Don Corleone about it, but, much against the advice of the Don”s lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. Line No- 4 This does not please Sollozzo, who has the Don shot down by some of his hit men. Line No- 5 The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart. Reading Lines Backward Now to read the file backwards we use the installed module. from file_read_backwards import FileReadBackwards with FileReadBackwards(“PathGodFather.txt”, encoding=”utf-8″) as BigFile: # getting lines by lines starting from the last line up for line in BigFile: print line When we run the above program, we get the following output − The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart. This does not please Sollozzo, who has the Don shot down by some of his hit men. He approaches Don Corleone about it, but, much against the advice of the Don”s lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. All of Michael”s family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael”s sister) to Carlo Rizzi. Vito Corleone is the aging don (head) of the Corleone Mafia Family. You can verify the lines have been read in a reverse order. Reading Words Backward We can also read the words in the file backward. For this we first read the lines backwards and then tokenize the words in it with applying reverse function. In the below example we have word tokens printed backwards form the same file using both the package and nltk module. import nltk from file_read_backwards import FileReadBackwards with FileReadBackwards(“PathGodFather.txt”, encoding=”utf-8″) as BigFile: # getting lines by lines starting from the last line up # And tokenizing with applying reverse() for line in BigFile: word_data= line nltk_tokens = nltk.word_tokenize(word_data) nltk_tokens.reverse() print (nltk_tokens) When we run the above program we get the following output − [”.”, ”apart”, ”family”, ”Corleone”, ”the”, ”tears”, ”and”, ”Sollozzo”, ”against”, ”war”, ”mob”, ”violent”, ”a”, ”begin”, ”to”, ”Michael”, ”son”, ”his”, ”leads”, ”which”, ”,”, ”srvives”, ”barely”, ”Don”, ”The”] [”.”, ”men”, ”hit”, ”his”, ”of”, ”some”, ”by”, ”down”, ”shot”, ”Don”, ”the”, ”has”, ”who”, ”,”, ”Sollozzo”, ”please”, ”not”, ”does”, ”This”] [”.”, ”offer”, ”the”, ”down”, ”trns”, ”and”, ”,”, ”drgs”, ”of”, ”se”, ”the”, ”against”, ”morally”, ”is”, ”Don”, ”the”, ”,”, ”Hagen”, ”Tom”, ”lawyer”, “”s”, ”Don”, ”the”, ”of”, ”advice”, ”the”, ”against”, ”mch”, ”,”, ”bt”, ”,”, ”it”, ”abot”, ”Corleone”, ”Don”, ”approaches”, ”He”] [”.”, ”money”, ”drg”, ”the”, ”of”, ”profit”, ”a”, ”for”, ”exchange”, ”in”, ”protection”, ”him”, ”offer”, ”to”, ”families”, ”Mafia”, ”for”, ”looking”, ”is”, ”Sollozzo”, ”Virgil”, ”dealer”, ”Drg”, ”.”, ”life”, ”normal”, ”a”, ”live”, ”to”, ”wants”, ”jst”, ”Michael”, ”bt”, ”,”, ”Mafia”, ”the”, ”with”, ”involved”, ”is”, ”family”, “”s”, ”Michael”, ”of”, ”All”] [”.”, ”Rizzi”, ”Carlo”, ”to”, ”)”, ”sister”, “”s”, ”Michael”, ”(”, ”Corleone”, ”Connie”, ”of”, ”wedding”, ”the”, ”see”, ”to”, ”time”, ”in”, ”jst”, ”WWII”, ”from”, ”retrned”, ”has”, ”Michael”, ”son”, ”yongest”, ”His”] [”.”, ”Family”, ”Mafia”, ”Corleone”, ”the”, ”of”, ”)”, ”head”, ”(”, ”don”, ”aging”, ”the”, ”is”, ”Corleone”, ”Vito”] Print Page Previous Next Advertisements ”;