2016-07-06

Sentiment analysis

In my previous post on text manipulation I discussed the process of OCR and text munging to create a list of chapter contents. In this post I will investigate what can be done with a data-frame, and future posts will discuss using a corpus, and Document Term matrix.

Each chapter is an XML file so read those into a variable and inspect:


##Indicate working directory
>setwd("~/pahr/sentiment/")
> all.files <- list.files(paste(getwd(), "/xhtmlfiles", sep=""))
> allfiles
 [1] "chapter10.xhtml" "chapter11.xhtml" "chapter12.xhtml" "chapter13.xhtml"
 [5] "chapter14.xhtml" "chapter15.xhtml" "chapter16.xhtml" "chapter17.xhtml"
 [9] "chapter1.xhtml"  "chapter2.xhtml"  "chapter3.xhtml"  "chapter4.xhtml" 
[13] "chapter5.xhtml"  "chapter6.xhtml"  "chapter7.xhtml"  "chapter8.xhtml" 
[17] "chapter9.xhtml" 
>

Create a dataframe that will provide the worklist through which I will process, as well as hold data about each chapter. The dataframe will contain a row for each chapter and tally information such as:

bname: base name of the chapter XML file
chpt: chapter number
paragraphs: number of paragraphs
total: total number of words
nosmall: number of small (<4 characters) words
uniques: number of unique words
nonstop: number of non-stop words
unnstop: number of unique non-stop words


d <- data.frame(matrix(ncol = 9, nrow = length(all.files)))
names(d) <- c("file.name","bname","chpt","paragraphss","total","nosmall","uniques","nonstop","unnstop")

d$file.name <- all.files
for(i in 1:nrow(d)){
    numc <- nchar(d[i,"file.name"])
    d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 6)
    d[i,"chpt"] <-  as.integer(substring( d[i,"file.name"], 8, numc - 6))
  }
d <- d[order(d$chpt),]

> head(d)
        file.name    bname chpt paragraphs total nosmall uniques nonstop
9  chapter1.xhtml chapter1    1         NA    NA      NA      NA      NA
10 chapter2.xhtml chapter2    2         NA    NA      NA      NA      NA
11 chapter3.xhtml chapter3    3         NA    NA      NA      NA      NA
12 chapter4.xhtml chapter4    4         NA    NA      NA      NA      NA
13 chapter5.xhtml chapter5    5         NA    NA      NA      NA      NA
14 chapter6.xhtml chapter6    6         NA    NA      NA      NA      NA
   unnstop
9       NA
10      NA
11      NA
12      NA
13      NA
14      NA
>

I will read the chapter XML files into a list and at the same time count the number of paragraphs per chapter:

chpts  <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
    chpt.num <- d[i,"chpt"]
    chpts[[chpt.num]] <-xmlToList( paste( getwd(), "/xhtmlfiles/", d[i,"file.name"], sep=""))
    d[i,"lines"] <- length(chpts[[chpt.num]]$body )
}

Each quote from a character is given its own paragraph, so a high paragraph count is indicative of lots of conversation.

Next create a list for each parameter I would like to extract. Stop words are provided by the tidytext package

library(tidytext)
total  <- vector(mode="list", length=nrow(d))
nosmall  <- vector(mode="list", length=nrow(d))
un  <- vector(mode="list", length=nrow(d)) ##uniques no blanks
data(stop_words) #from tidytext package
non.stops  <- vector(mode="list", length=nrow(d))
unstops  <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
       chpt.num <- d[i,"chpt"]
       total[[chpt.num]] <- strsplit(gsub( "[[:punct:]]", "", chpts[[chpt.num]])[2], " ", fixed=TRUE)
       d[i,"total"] <- length(total[[chpt.num]][[1]] )
       ##eliminate words with fewer than 4 characters
       nosmall[[chpt.num]] <- total[[chpt.num]][[1]][!(nchar(total[[chpt.num]][[1]] )<4)]
       d[i,"nosmall"] <- length(nosmall[[chpt.num]] )
       ##uniques
       un[[chpt.num]] <- unique(nosmall[[chpt.num]])
       d[i,"uniques"] <- length( un[[chpt.num]] )
    ##no stops (but not unique)
       non.stops[[chpt.num]] <-nosmall[[chpt.num]][!(nosmall[[chpt.num]] %in% as.list(stop_words[,1])$word)]
        d[i,"nonstops"] <- length(non.stops[[chpt.num]] )
    ##no stops AND unique
       unstops[[chpt.num]] <-un[[chpt.num]][!(un[[chpt.num]] %in% as.list(stop_words[,1])$word)]
        d[i,"unstop"] <- length(unstops[[chpt.num]] )     
}

> head(d)
        file.name    bname chpt paragraphs total nosmall uniques nonstop
9  chapter1.xhtml chapter1    1         50  5151    2854    1649      NA
10 chapter2.xhtml chapter2    2         59  3490    1844    1077      NA
11 chapter3.xhtml chapter3    3         42  2904    1632     971      NA
12 chapter4.xhtml chapter4    4         59  4064    2011    1066      NA
13 chapter5.xhtml chapter5    5        100  6216    3267    1572      NA
14 chapter6.xhtml chapter6    6         48  3305    1741    1028      NA
   unnstop nonstops unstop
9       NA     2061   1414
10      NA     1228    883
11      NA     1107    786
12      NA     1290    843
13      NA     2124   1306
14      NA     1171    835
> 

plot(d$chpt, d$total, type="o", ylab="Words", xlab="Chapter Number", main="Words by Chapter", ylim=c(0,9000))
points(d$chpt, d$nosmall, type="o", col="lightblue")
points(d$chpt, d$uniques, type="o", col="blue")
points(d$chpt, d$nonstops, type="o", col="red")
points(d$chpt, d$unstop, type="o", col="orange")

# get the range for the x and y axis
legend(1, 9000,  c("Total words","Big words (> 3 chars)","Unique(Big words)","Non-stop(Big words)","Unique*Non-stop(Big words)"),  col=c("black","lightblue","blue","red","orange"),lty=1,  pch=1, cex=0.9)

The word count trends are the same for all categories, which is expected. I am interested in the “Non-stop(Big words)”, the red line, as I don’t want to normalize word dosage i.e. if the word “happy” is present 20 times, I want the 20x dosage of the happiness sentiment that I wouldn’t get using unique words. To visually inspect the word list I will simply pull out the first 50 words from each category for chapter 2:

Comparing nosmall to non.stops the first two words eliminated are words 9 and 24, “several” and “through”, two words I would agree don’t contribute to sentiment or content.

Next I will make a wordcloud of the entire book. To do so I must get the words into a dataframe.

word <- non.stops[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))

for(i in 2:17){
    word <- non.stops[[i]]
    chpt <- rep(i, length(word))
    holder <- cbind(chpt, word)
    pahr.words <- rbind(pahr.words, holder)
    rm(holder)
}

library('wordcloud')
wordcloud(pahr.words$word, max.words = 100, random.order = FALSE)

Appropriately “rock” is the most frequent word. The word cloud contains many proper nouns. I will make a vector of these nouns, remove them from the collection of words and re-plot:

> prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley" )
> cloud.words <- as.character(pahr.words$word)
> ioi <- (cloud.words %in% prop.nouns)
> summary(ioi)
   Mode   FALSE    TRUE    NA's 
logical   24524    1194       0 
> cw2 <- cloud.words[!ioi]
> 
> wordcloud(cw2, max.words = 100, random.order = FALSE)
>

In the next post I will look at what can be done with a corpus.