Table of Contents

Python – Tokenization

”;

In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.

Line Tokenization

In the below example we divide a given text into different lines by using the function sent_tokenize.

import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

When we run the above program, we get the following output −

[''The First sentence is about Python.'', ''The Second: about Django.'', ''You can learn Python,Django and Data Ananlysis here.'']

Non-English Tokenization

In the below example we tokenize the German text.

import nltk

german_tokenizer = nltk.data.load(''tokenizers/punkt/german.pickle'')
german_tokens=german_tokenizer.tokenize(''Wie geht es Ihnen?  Gut, danke.'')
print(german_tokens)

When we run the above program, we get the following output −

[''Wie geht es Ihnen?'', ''Gut, danke.'']

Word Tokenzitaion

We tokenize the words using word_tokenize function available as part of nltk.

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

When we run the above program we get the following output −

[''It'', ''originated'', ''from'', ''the'', ''idea'', ''that'', ''there'', ''are'', ''readers'', 
''who'', ''prefer'', ''learning'', ''new'', ''skills'', ''from'', ''the'',
''comforts'', ''of'', ''their'', ''drawing'', ''rooms'']

Print Page

Python – Tokenization

Line Tokenization

Non-English Tokenization

Word Tokenzitaion

Leave a Reply Cancel reply