Sentiment analysis - DT matrix

As I work with various packages related to text manipulation, I am beginning to realize what a mess the R package ecosystem can turn into. A variety of packages written by different contributers with no coordination amongst packages, overlapping functionality, colliding nomenclature. Many functions for “convenience” when base R could do the job. I also noticed this with packages like dplyr. I have commenced learning dplyr on multiple occasions only to find I don’t need it - I can do everything with base R without loading an extra package and learning new terminology. The problem I now encounter is that as these packages gain in popularity, code snippets and examples use them and I need to learn and understand the packages to make sense of the examples.

In my previous post on text manipulation I discussed the process of creating a corpus object. In this post I will investigate what can be done with a document term matrix. Starting with the previous post’s corpus:

1
2
3

dtm <- DocumentTermMatrix(corp)

There are a variety of methods available to inspect the dtmatrix:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> dtm
<<DocumentTermMatrix (documents: 17, terms: 5500)>>
Non-/sparse entries: 18083/75417
Sparsity : 81%
Maximal term length: 26
Weighting : term frequency (tf)
> dim(dtm)
[1] 17 5500
> inspect(dtm[2, 50:100])
<<DocumentTermMatrix (documents: 1, terms: 51)>>
Non-/sparse entries: 10/41
Sparsity : 80%
Maximal term length: 9
Weighting : term frequency (tf)

Terms
Docs accentu accept access accid accompani accord account accumul
chapter02.xhtml 0 0 0 0 0 1 1 0
Terms
Docs accus accustom ach achiev acid acquaint acquir acr across
chapter02.xhtml 0 0 0 0 0 0 0 0 0

Note the sparsity is 81%. Remove sparse terms and inspect:

1
2
3
4
5
6
7
> dtms <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum.   
> dtms
<<DocumentTermMatrix (documents: 17, terms: 66)>>
Non-/sparse entries: 1082/40
Sparsity : 4%
Maximal term length: 26
Weighting : term frequency (tf)

Now sparsity is down to 4%. Calculate word frequencies and plot as a histogram.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)   
> head(freq, 14)
the said like one rock girl now littl miss look mrs know day
367 204 188 184 180 170 167 164 163 148 142 133 122
come
116
> wf <- data.frame(word=names(freq), freq=freq)
> head(wf)
word freq
the the 367
said said 204
like like 188
one one 184
rock rock 180
girl girl 170

>wf$nc <- sapply(as.character(wf$word),nchar)
>wf <- wf[wf$nc >3,]

>p <- ggplot(subset(wf, freq>60), aes(word, freq))
>p <- p + geom_bar(stat="identity")
>p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
>p

We can use hierarchal clustering to group related words. I wouldn’t read much meaning into this for Picnic, but it is comforting to see the xml/html terms clustering together in the third group - a sort of positive control.

1
2
3
4
5
6
7
8
9

library(cluster)
dtms <- removeSparseTerms(dtm, 0.05) # This makes a matrix that is 10% empty space, maximum.
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=4) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=4, border="red") # draw dendogram with red borders around the 4 clusters

We can also use K-means clustering:

1
2
3
4
library(fpc)   
d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Back here I didn’t mention that when creating the epub, it would display fine on my computer, but would not display on my Nook. A solution was to pass the file through Calibre. I diff’ed files coming out of Calibre with my originals but was not able to determine the minimum set of changes required for Nook compatibility. You can download the Calibre modified epub here, and the original here. If you determine what those Nook requirements are, please inform me.

Share

Sentiment analysis - Corpus

In a previous post on text manipulation I discussed text mining manipulations that could be performed with a data frame. In this post I will explore what can be done with a corpus. Start by importing the text manipulation package tm. tm has many useful methods for creating a corpus from various sources. My texts are in a directory as xhtml files, one per chapter. I will use VCorpus(DirSource()) to read the files into a corpus data object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> library(tm)
>
> myfiles <- paste(getwd(),"/xhtmlfiles",sep="")
> corp <- VCorpus(DirSource(myfiles), readerControl = list(language="en"))
>
> length(corp)
[1] 17
> corp[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 14639

> writeLines(as.character(corp[[2]]))
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">

<head><meta charset="UTF-8" /></head>
<body><p>Chapter 11</p><p> </p><p> Mrs Fitzhubert at the breakfast table looked out on to the mist-shrouded garden, and decided to instruct the maids to begin putting away the chintzes preparatory to the....

The variable “corp” is a 17 member list, each member containing a chapter. tm provides many useful methods for word munging, referred to as “transformations”. Transformations are applied with the tm_map() function. Below I remove white space, remove stop words, stem (i.e. remove common endings like “ing”, “es”, “s”), etc.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = TRUE)
prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley","pp","p" )
corp <- tm_map(corp, removeWords, prop.nouns)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, stripWhitespace)

> writeLines(as.character(corp[[2]]))
xml version10 encodingutf-8
html xmlnshttpwwww3org1999xhtml

headmeta charsetutf-8 head
bodypchapt 2pp
manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spre

A corpus object allows for the addition of meta data. I will add two events per chapter, which may be useful as overlays during graphing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

meta(corp[[1]], "event1") <- "Exposition of main characters"
meta(corp[[1]], "event2") <- "Journey to the rock"
meta(corp[[2]], "event1") <- "Picnic"
meta(corp[[2]], "event2") <- "Crossing the creak"
meta(corp[[3]], "event1") <- "A surprising number of human beings are without purpose..."
meta(corp[[3]], "event2") <- "Edith screams, girls disappear"
meta(corp[[4]], "event1") <- "Sarah hasn't memorized "The Hesperus""
meta(corp[[4]], "event2") <- "Drag returns from the Rock"
meta(corp[[5]], "event1") <- "Michael interviewed by Constable Bumpher"
meta(corp[[5]], "event2") <- "The red cloud"
meta(corp[[6]], "event1") <- "The garden party"
meta(corp[[6]], "event2") <- "Mike decides to search for the girls"
meta(corp[[7]], "event1") <- "Mike decides to spend the night on the rock"
meta(corp[[7]], "event2") <- "Mike hallucinates on the rock"
meta(corp[[8]], "event1") <- "Michael rescued on the rock"
meta(corp[[8]], "event2") <- "Irma is found alive"
meta(corp[[9]], "event1") <- "Letters to/from parents"
meta(corp[[9]], "event2") <- "Sara informed of her debts to the school"
meta(corp[[10]], "event1") <- "Visit from the Spracks"
meta(corp[[10]], "event2") <- "Michael and Irma meet, date, break up"
meta(corp[[11]], "event1") <- "Michael avoids luncheon with Irma"
meta(corp[[11]], "event2") <- "Fitzhuberts entertain Irma"
meta(corp[[12]], "event1") <- "Irma visits the gymnasium"
meta(corp[[12]], "event2") <- "Mademoiselle de Poitiers threatens Dora Lumley"
meta(corp[[13]], "event1") <- "Reg collects his sister Dora"
meta(corp[[13]], "event2") <- "Reg and Dora die in a fire"
meta(corp[[14]], "event1") <- "Albert describes a dream about his kid sister"
meta(corp[[14]], "event2") <- "Mr Leopold thanks Albert with a cheque"
meta(corp[[15]], "event1") <- "Mrs Appleyard lies about Sara's situation"
meta(corp[[15]], "event2") <- "Mademoiselle de Poitiers reminisces about Sara"
meta(corp[[16]], "event1") <- "Mademoiselle de Poitiers letter to Constable Bumpher"
meta(corp[[16]], "event2") <- "Sara found dead"
meta(corp[[17]], "event1") <- "Newspaper extract"
meta(corp[[17]], "event2") <- ""

> meta(corp[[2]], "event2")
[1] "Crossing the creak"

The corpus object is a list of lists. The main object has 17 elements, one for each chapter, but each chapter element is also a list. The “content” variable of the list is a list of the original xml file contents, with each element being either xml notation, a blank line, or a paragraph of text. Looking at the second chapter’s contents corp[[2]]$content is a list of 18 elements. The first paragraph of the chapter begins with element 6:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> length(corp[[2]]$content)
[1] 18
> corp[[2]]$content[1]
[1] "xml version10 encodingutf-8"
> corp[[2]]$content[2]
[1] "html xmlnshttpwwww3org1999xhtml"
> corp[[2]]$content[3]
[1] ""
> corp[[2]]$content[4]
[1] "headmeta charsetutf-8 head"
> corp[[2]]$content[5]
[1] "bodypchapt 2pp"
> corp[[2]]$content[6]
[1] " manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spread gum in addit chicken pie angel cake jelli tepid banana insepar australian picnic cook provid handsom ice cake shape heart tom oblig cut mould piec tin mr boil two immens billycan tea fire bark leav now enjoy pipe shadow drag keep watch eye hors tether shade"
>

This corpus is the end of the preprocessing stage of the document and will be the input for a document term matrix discussed in the next post

Share

PAHR Sentiment Network

In my previous post on sentiment analysis I used a dataframe to plot the trajectory of sentiment across the novel Picnic at Hanging Rock. In this post I will use the same dataframe of non-unique, non-stop, greater than 3 character words (red line from an earlier post) to create a network of associated words. Words can be grouped by sentence, paragraph, or chapter. I have already removed stop words and punctuation, so I will use my previous grouping of every 15 words in the order they appear in the novel. Looking at my dataframe rows 10 to 20:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> d2[10:20,]
chpt word sentiment lexicon group
10 1 silent positive bing 1
11 1 ridiculous negative bing 1
12 1 supported positive bing 1
13 1 bust negative bing 1
14 1 tortured negative bing 1
15 1 hopeless negative bing 1
16 1 misfit negative bing 2
17 1 clumsy negative bing 2
18 1 gold positive bing 2
19 1 suitable positive bing 2
20 1 insignificant negative bing 2
>

You can see the column “group” has grouped every 15 words. First I create a table of word cooccurences using the pair_count function, then I use ggraph to create the network graph. The number of cooccurences are reflected in edge opacity and width. At the time of this writing, ggraph was still in beta and had to be downloaded from github and built locally. The igraph package provides the graph_from_data_frame function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

library(igraph)
library(ggraph)

word_cooccurences <- d2 %>%
pair_count(group, word, sort = TRUE)

set.seed(1900)
word_cooccurences %>%
filter(n >=4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle(expression(paste("Word Network ",
italic("Picnic at Hanging Rock")))) +
theme_void()

Lets regroup every 25 words:

1
2
3

d2$group <- sort( c(rep(1:120, 25),rep(121,19)))

And now include only words with 5 occurences or more:

1
2

..... filter(n >=5) %>% ....
Share

PAHR Sentiment Trajectory

In my previous post on sentiment I discussed the process of building data frames of chapter metrics and word lists. I will use the word data frame to monitor sentiment across the book. I am working with non-unique, non-stop, greater than 3 character words (red line from the previous post). Looking at the word list and comparing to text, I can see that the words are in the order that they appear in the novel. I will use the Bing sentiment determinations from the tidytext package to annotate each word as being either of positive or negative sentiment. I will then group by 15 words and calculate the average sentiment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
##make a dataframe of all chapters
##use non.stops which also has words with <=3 chars removed

word <- non.stop[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))

for(i in 2:17){
word <- non.stop[[i]]
chpt <- rep(i, length(word))
holder <- cbind(chpt, word)
pahr.words <- rbind(pahr.words, holder)
rm(holder)
}

##I checked and words are in the order that they appear
##in the novel
library(tidytext)
bing <- sentiments %>%
filter(lexicon == "bing") %>%
select(-score)



d2 <- pahr.words %>%
inner_join(bing) %>%
cbind(sort( c(rep(1:201, 15),rep(202,4)))) ##this will group words by 15 for averaging sentiment

names(d2)[5]<-"group"

d3 <- count(d2, chpt,group,sentiment)

library(tidyr)

d4 <- spread(d3, sentiment, n)
d4$sentiment <- d4$positive - d4$negative

Plot as a line graph, with odd chapters colored black and even chapters colored grey. I also annotate a few moments of trauma within the narrative.

1
2
3
4
library(ggplot2)
mycols <- c(rep(c("black","darkgrey"),8),"black")
ggplot( d4, aes(group, sentiment, color=chpt)) + geom_line() + scale_color_manual(values = mycols) + geom_hline(yintercept=0, linetype="dashed", color="red") + annotate("text", x = 146, y = -14, label = "Hysteria in the gymnasium") + annotate("text", x = 147, y = -13, label = "x") + annotate("text", x = 12, y = -11, label = "Edith screams on Rock") + annotate("text", x = 35, y = -11, label = "x") + annotate("text", x = 68, y = -13, label = "Bad news delivered\n to Ms Appleyard") + annotate("text", x = 49, y = -13, label = "x")

We can see that the novel starts with a positive sentiment - “Beautiful day for a picnic…” - which gradually moves into negative territory and remains there for the majority of the book.

Does sentiment analysis really work? Depends on how accurately word sentiment is characterized. Consider the word “drag”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> d2[d2$word=="drag",]
chpt word sentiment lexicon group
133 1 drag negative bing 9
141 1 drag negative bing 10
162 1 drag negative bing 11
169 1 drag negative bing 12
183 1 drag negative bing 13
198 1 drag negative bing 14
199 1 drag negative bing 14
213 1 drag negative bing 15
227 1 drag negative bing 16
250 1 drag negative bing 17
263 1 drag negative bing 18
275 2 drag negative bing 19
300 2 drag negative bing 20
457 3 drag negative bing 31
468 3 drag negative bing 32
585 4 drag negative bing 39
602 4 drag negative bing 41
630 4 drag negative bing 42
633 4 drag negative bing 43
665 4 drag negative bing 45
678 4 drag negative bing 46
679 4 drag negative bing 46
743 5 drag negative bing 50
1224 7 drag negative bing 82
2978 16 drag negative bing 199
>

There are many instances of the word drag annotated as negative. Consider the sentence “It’s a drag that sentiment analysis isn’t reliable.” That would be drag in a negative context. In Picnic, a drag is a buggy pulled by horses, mentioned many times, imparting lots of undeserved negative sentiment to the novel. Drag in Picnic is neutral and should have been discarded. Inspecting the sentiment annotated word list, many other examples similar to drag could be found, some providing negative, some positive sentiment, on average probably cancelling each other out. Even more abundant are words properly annotated, which, on balance may convey the proper sentiment. I would be skeptical, though, of any sentiment analysis without a properly curated word list.

In the next post I will look at what can be done with a corpus.

Share

Sentiment analysis

In my previous post on text manipulation I discussed the process of OCR and text munging to create a list of chapter contents. In this post I will investigate what can be done with a data-frame, and future posts will discuss using a corpus, and Document Term matrix.

Each chapter is an XML file so read those into a variable and inspect:

1
2
3
4
5
6
7
8
9
10
11

##Indicate working directory
>setwd("~/pahr/sentiment/")
> all.files <- list.files(paste(getwd(), "/xhtmlfiles", sep=""))
> allfiles
[1] "chapter10.xhtml" "chapter11.xhtml" "chapter12.xhtml" "chapter13.xhtml"
[5] "chapter14.xhtml" "chapter15.xhtml" "chapter16.xhtml" "chapter17.xhtml"
[9] "chapter1.xhtml" "chapter2.xhtml" "chapter3.xhtml" "chapter4.xhtml"
[13] "chapter5.xhtml" "chapter6.xhtml" "chapter7.xhtml" "chapter8.xhtml"
[17] "chapter9.xhtml"
>

Create a dataframe that will provide the worklist through which I will process, as well as hold data about each chapter. The dataframe will contain a row for each chapter and tally information such as:

  • bname: base name of the chapter XML file
  • chpt: chapter number
  • paragraphs: number of paragraphs
  • total: total number of words
  • nosmall: number of small (<4 characters) words
  • uniques: number of unique words
  • nonstop: number of non-stop words
  • unnstop: number of unique non-stop words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

d <- data.frame(matrix(ncol = 9, nrow = length(all.files)))
names(d) <- c("file.name","bname","chpt","paragraphss","total","nosmall","uniques","nonstop","unnstop")

d$file.name <- all.files
for(i in 1:nrow(d)){
numc <- nchar(d[i,"file.name"])
d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 6)
d[i,"chpt"] <- as.integer(substring( d[i,"file.name"], 8, numc - 6))
}
d <- d[order(d$chpt),]

> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 NA NA NA NA NA
10 chapter2.xhtml chapter2 2 NA NA NA NA NA
11 chapter3.xhtml chapter3 3 NA NA NA NA NA
12 chapter4.xhtml chapter4 4 NA NA NA NA NA
13 chapter5.xhtml chapter5 5 NA NA NA NA NA
14 chapter6.xhtml chapter6 6 NA NA NA NA NA
unnstop
9 NA
10 NA
11 NA
12 NA
13 NA
14 NA
>

I will read the chapter XML files into a list and at the same time count the number of paragraphs per chapter:

1
2
3
4
5
6
7
8
chpts  <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
chpts[[chpt.num]] <-xmlToList( paste( getwd(), "/xhtmlfiles/", d[i,"file.name"], sep=""))
d[i,"lines"] <- length(chpts[[chpt.num]]$body )
}

Each quote from a character is given its own paragraph, so a high paragraph count is indicative of lots of conversation.

Next create a list for each parameter I would like to extract. Stop words are provided by the tidytext package

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
library(tidytext)
total <- vector(mode="list", length=nrow(d))
nosmall <- vector(mode="list", length=nrow(d))
un <- vector(mode="list", length=nrow(d)) ##uniques no blanks
data(stop_words) #from tidytext package
non.stops <- vector(mode="list", length=nrow(d))
unstops <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
total[[chpt.num]] <- strsplit(gsub( "[[:punct:]]", "", chpts[[chpt.num]])[2], " ", fixed=TRUE)
d[i,"total"] <- length(total[[chpt.num]][[1]] )
##eliminate words with fewer than 4 characters
nosmall[[chpt.num]] <- total[[chpt.num]][[1]][!(nchar(total[[chpt.num]][[1]] )<4)]
d[i,"nosmall"] <- length(nosmall[[chpt.num]] )
##uniques
un[[chpt.num]] <- unique(nosmall[[chpt.num]])
d[i,"uniques"] <- length( un[[chpt.num]] )
##no stops (but not unique)
non.stops[[chpt.num]] <-nosmall[[chpt.num]][!(nosmall[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"nonstops"] <- length(non.stops[[chpt.num]] )
##no stops AND unique
unstops[[chpt.num]] <-un[[chpt.num]][!(un[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"unstop"] <- length(unstops[[chpt.num]] )
}

> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 50 5151 2854 1649 NA
10 chapter2.xhtml chapter2 2 59 3490 1844 1077 NA
11 chapter3.xhtml chapter3 3 42 2904 1632 971 NA
12 chapter4.xhtml chapter4 4 59 4064 2011 1066 NA
13 chapter5.xhtml chapter5 5 100 6216 3267 1572 NA
14 chapter6.xhtml chapter6 6 48 3305 1741 1028 NA
unnstop nonstops unstop
9 NA 2061 1414
10 NA 1228 883
11 NA 1107 786
12 NA 1290 843
13 NA 2124 1306
14 NA 1171 835
>

plot(d$chpt, d$total, type="o", ylab="Words", xlab="Chapter Number", main="Words by Chapter", ylim=c(0,9000))
points(d$chpt, d$nosmall, type="o", col="lightblue")
points(d$chpt, d$uniques, type="o", col="blue")
points(d$chpt, d$nonstops, type="o", col="red")
points(d$chpt, d$unstop, type="o", col="orange")

# get the range for the x and y axis
legend(1, 9000, c("Total words","Big words (> 3 chars)","Unique(Big words)","Non-stop(Big words)","Unique*Non-stop(Big words)"), col=c("black","lightblue","blue","red","orange"),lty=1, pch=1, cex=0.9)

The word count trends are the same for all categories, which is expected. I am interested in the “Non-stop(Big words)”, the red line, as I don’t want to normalize word dosage i.e. if the word “happy” is present 20 times, I want the 20x dosage of the happiness sentiment that I wouldn’t get using unique words. To visually inspect the word list I will simply pull out the first 50 words from each category for chapter 2:

Comparing nosmall to non.stops the first two words eliminated are words 9 and 24, “several” and “through”, two words I would agree don’t contribute to sentiment or content.

Next I will make a wordcloud of the entire book. To do so I must get the words into a dataframe.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
word <- non.stops[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))

for(i in 2:17){
word <- non.stops[[i]]
chpt <- rep(i, length(word))
holder <- cbind(chpt, word)
pahr.words <- rbind(pahr.words, holder)
rm(holder)
}

library('wordcloud')
wordcloud(pahr.words$word, max.words = 100, random.order = FALSE)

Appropriately “rock” is the most frequent word. The word cloud contains many proper nouns. I will make a vector of these nouns, remove them from the collection of words and re-plot:

1
2
3
4
5
6
7
8
9
10
11
> prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley" )
> cloud.words <- as.character(pahr.words$word)
> ioi <- (cloud.words %in% prop.nouns)
> summary(ioi)
Mode FALSE TRUE NA's
logical 24524 1194 0
> cw2 <- cloud.words[!ioi]
>
> wordcloud(cw2, max.words = 100, random.order = FALSE)
>

In the next post I will look at what can be done with a corpus.

Share

Unnecessariat

This article provides me with a few useful terms: unnecessariat, doomstead, hobbyfarm. 

Share

Greenlandic Inuit show genetic signatures of diet and climate adaptation

This report, as well as the Netflix documentary “Fed Up”, have convinced me to give up fish oil supplements.

Share

Geek vs. Nerd

Share