”;
In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.
Line Tokenization
In the below example we divide a given text into different lines by using the function sent_tokenize.
import nltk sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python,Django and Data Ananlysis here. " nltk_tokens = nltk.sent_tokenize(sentence_data) print (nltk_tokens)
When we run the above program, we get the following output −
[''The First sentence is about Python.'', ''The Second: about Django.'', ''You can learn Python,Django and Data Ananlysis here.'']
Non-English Tokenization
In the below example we tokenize the German text.
import nltk german_tokenizer = nltk.data.load(''tokenizers/punkt/german.pickle'') german_tokens=german_tokenizer.tokenize(''Wie geht es Ihnen? Gut, danke.'') print(german_tokens)
When we run the above program, we get the following output −
[''Wie geht es Ihnen?'', ''Gut, danke.'']
Word Tokenzitaion
We tokenize the words using word_tokenize function available as part of nltk.
import nltk word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" nltk_tokens = nltk.word_tokenize(word_data) print (nltk_tokens)
When we run the above program we get the following output −
[''It'', ''originated'', ''from'', ''the'', ''idea'', ''that'', ''there'', ''are'', ''readers'', ''who'', ''prefer'', ''learning'', ''new'', ''skills'', ''from'', ''the'', ''comforts'', ''of'', ''their'', ''drawing'', ''rooms'']
Advertisements
”;