In a previous post on text manipulation I discussed text mining manipulations that could be performed with a data frame. In this post I will explore what can be done with a corpus. Start by importing the text manipulation package tm. tm has many useful methods for creating a corpus from various sources. My texts are in a directory as xhtml files, one per chapter. I will use VCorpus(DirSource()) to read the files into a corpus data object:
1 | > library(tm) |
The variable “corp” is a 17 member list, each member containing a chapter. tm provides many useful methods for word munging, referred to as “transformations”. Transformations are applied with the tm_map() function. Below I remove white space, remove stop words, stem (i.e. remove common endings like “ing”, “es”, “s”), etc.:
1 | corp <- tm_map(corp, removeWords, stopwords("english")) |
A corpus object allows for the addition of meta data. I will add two events per chapter, which may be useful as overlays during graphing:
1 |
|
The corpus object is a list of lists. The main object has 17 elements, one for each chapter, but each chapter element is also a list. The “content” variable of the list is a list of the original xml file contents, with each element being either xml notation, a blank line, or a paragraph of text. Looking at the second chapter’s contents corp[[2]]$content is a list of 18 elements. The first paragraph of the chapter begins with element 6:
1 | > length(corp[[2]]$content) |
This corpus is the end of the preprocessing stage of the document and will be the input for a document term matrix discussed in the next post