Table of Contents

Python – Corpora Access

”;

Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. In the below example we access the names of only those files from the corpus which are plain text with filename ending as .txt.

from nltk.corpus import gutenberg
fields = gutenberg.fileids()

print(fields)

When we run the above program, we get the following output −

[austen-emma.txt'', austen-persuasion.txt'', austen-sense.txt'', bible-kjv.txt'', 
blake-poems.txt'', bryant-stories.txt'', burgess-busterbrown.txt'',
carroll-alice.txt'', chesterton-ball.txt'', chesterton-brown.txt'', 
chesterton-thursday.txt'', edgeworth-parents.txt'', melville-moby_dick.txt'',
milton-paradise.txt'', shakespeare-caesar.txt'', shakespeare-hamlet.txt'',
shakespeare-macbeth.txt'', whitman-leaves.txt'']

Accessing Raw Text

We can access the raw text from these files using sent_tokenize function which is also available in nltk. In the below example we retrieve the first two paragraphs of
the blake poen text.

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])

When we run the above program we get the following output −

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.

Print Page

Python – Corpora Access

Accessing Raw Text

Leave a Reply Cancel reply