2016-07-10

PAHR Sentiment Trajectory

In my previous post on sentiment I discussed the process of building data frames of chapter metrics and word lists. I will use the word data frame to monitor sentiment across the book. I am working with non-unique, non-stop, greater than 3 character words (red line from the previous post). Looking at the word list and comparing to text, I can see that the words are in the order that they appear in the novel. I will use the Bing sentiment determinations from the tidytext package to annotate each word as being either of positive or negative sentiment. I will then group by 15 words and calculate the average sentiment.

##make a dataframe of all chapters
##use non.stops which also has words with <=3 chars removed

word <- non.stop[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))

for(i in 2:17){
    word <- non.stop[[i]]
    chpt <- rep(i, length(word))
    holder <- cbind(chpt, word)
    pahr.words <- rbind(pahr.words, holder)
    rm(holder)
}

##I checked and words are in the order that they appear
##in the novel
library(tidytext)
bing <- sentiments %>%
        filter(lexicon == "bing") %>%
        select(-score)



d2 <- pahr.words %>%
    inner_join(bing) %>%
    cbind(sort(   c(rep(1:201, 15),rep(202,4)))) ##this will group words by 15 for averaging sentiment

    names(d2)[5]<-"group"

d3 <- count(d2, chpt,group,sentiment)

library(tidyr)

d4 <- spread(d3, sentiment, n)
d4$sentiment <- d4$positive - d4$negative

Plot as a line graph, with odd chapters colored black and even chapters colored grey. I also annotate a few moments of trauma within the narrative.

library(ggplot2)
mycols <- c(rep(c("black","darkgrey"),8),"black")
ggplot( d4, aes(group, sentiment, color=chpt))  + geom_line()  + scale_color_manual(values = mycols) + geom_hline(yintercept=0, linetype="dashed", color="red") + annotate("text", x = 146, y = -14, label = "Hysteria in the gymnasium")  + annotate("text", x = 147, y = -13, label = "x") + annotate("text", x = 12, y = -11, label = "Edith screams on Rock")  + annotate("text", x = 35, y = -11, label = "x") + annotate("text", x = 68, y = -13, label = "Bad news delivered\n to Ms Appleyard")  + annotate("text", x = 49, y = -13, label = "x")

We can see that the novel starts with a positive sentiment - “Beautiful day for a picnic…” - which gradually moves into negative territory and remains there for the majority of the book.

Does sentiment analysis really work? Depends on how accurately word sentiment is characterized. Consider the word “drag”:

> d2[d2$word=="drag",]
     chpt word sentiment lexicon group
133     1 drag  negative    bing     9
141     1 drag  negative    bing    10
162     1 drag  negative    bing    11
169     1 drag  negative    bing    12
183     1 drag  negative    bing    13
198     1 drag  negative    bing    14
199     1 drag  negative    bing    14
213     1 drag  negative    bing    15
227     1 drag  negative    bing    16
250     1 drag  negative    bing    17
263     1 drag  negative    bing    18
275     2 drag  negative    bing    19
300     2 drag  negative    bing    20
457     3 drag  negative    bing    31
468     3 drag  negative    bing    32
585     4 drag  negative    bing    39
602     4 drag  negative    bing    41
630     4 drag  negative    bing    42
633     4 drag  negative    bing    43
665     4 drag  negative    bing    45
678     4 drag  negative    bing    46
679     4 drag  negative    bing    46
743     5 drag  negative    bing    50
1224    7 drag  negative    bing    82
2978   16 drag  negative    bing   199
>

There are many instances of the word drag annotated as negative. Consider the sentence “It’s a drag that sentiment analysis isn’t reliable.” That would be drag in a negative context. In Picnic, a drag is a buggy pulled by horses, mentioned many times, imparting lots of undeserved negative sentiment to the novel. Drag in Picnic is neutral and should have been discarded. Inspecting the sentiment annotated word list, many other examples similar to drag could be found, some providing negative, some positive sentiment, on average probably cancelling each other out. Even more abundant are words properly annotated, which, on balance may convey the proper sentiment. I would be skeptical, though, of any sentiment analysis without a properly curated word list.

In the next post I will look at what can be done with a corpus.