Albion’s Seed

Nice review of Albion’s Seed

Share

AR-15

Evolution of gun culture in America

Why the NRA has been so successful.

Guns don’t kill people.

2015 Rand study on interventions

Suicide / Homicide coincidence (not)

Share

ebook text manipulation

In my first post on creating an ebook I discussed the physical manipulation required to convert a paperback book into images and ultimately text files. Now I want to convert the text files into an ebook. Here is the sequence of events:

  1. Organize text in chapter/page order
  2. Read into a list, combining pages into chapters
  3. Remove ligatures, common misspellings, combine hyphenated word fragments
  4. Annotate with ebook XML tags
  5. Generate the ebook

Organize text

I start with my dataframe listing all files and their page numbers and read each individual page text file into an R list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(tidytext)
library(dplyr)
library(stringr)
library(readr)
library(tokenizers)
> d
file.name bname chpt eo page img.num pnumber chpteo
1 ch1o-0001.txt ch1o-0001 1 o 1 1 NA 1o
2 ch1e-0010.txt ch1e-0010 1 e 2 10 NA 1e
3 ch1o-0002.txt ch1o-0002 1 o 3 2 NA 1o
4 ch1e-0009.txt ch1e-0009 1 e 4 9 NA 1e
5 ch1o-0003.txt ch1o-0003 1 o 5 3 NA 1o
6 ch1e-0008.txt ch1e-0008 1 e 6 8 NA 1e
7 ch1o-0004.txt ch1o-0004 1 o 7 4 7 1o
8 ch1e-0007.txt ch1e-0007 1 e 8 7 8 1e
9 ch1o-0005.txt ch1o-0005 1 o 9 5 9 1o
10 ch1e-0006.txt ch1e-0006 1 e 10 6 NA 1e




By counting the number of rows associated with each chapter in the dataframe, determine the number of pages per chapter then combine those pages into a list by chapters, 17 chapters total for Picnic. I will not annotate individual pages with page numbers, but will combine all pages into a chapter and let the epub format handle the flow.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
pages  <- vector(mode="list", length=nrow(d))
for(i in 1:nrow(d)){
pages[i] <- list(read_lines( paste( getwd(), "/textfiles/", d[i,"bname"], ".txt" , sep=""), skip =0))
}

# how many pages per chapter
chap.lengths <- vector('integer', length=16)
for(i in 1:16){
chap.lengths[i] <- nrow(d[d$chpt== i,])
}


chapter <- vector(mode = "list", length = 16) #there are 16 chapters
page.counter <- 1

for(i in 1:length(chap.lengths)){ #i is the chapter number
for(j in 1:chap.lengths[i]){ #j is the page number per chapter but I need a page counter to span the book
chapter[[i]] <- c(chapter[[i]], pages[[page.counter]])
page.counter <- page.counter + 1
}
}

Remove ligatures, correct misspellings

OCR wil have introduced many misspellings, some of which can be corrected in bulk. I also want to remove ligatures, as this will interfere with word recognition when I am performing spell checking. Finally, the type setting process introduces many hyphenated words at the end of a line of text, to preserve readability. I want to remove these and let the epub flow text instead.

I create the method replaceforeignchars which will replace ligatures and common misspellings. The replacements to be executed are tabulated in the table “fromto”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
##define method
fromto <- read.table(text="
from to
š s
-— -
—- -
» ''
œ oe
fi fi
fl fl
ğ g
mr Mr
mrs Mrs",header=TRUE)

replaceforeignchars <- function(dat,fromto) {
for(i in 1:nrow(fromto) ) {
dat <- gsub(fromto$from[i],fromto$to[i],dat)
}
dat
}

chapters2 <- chapter

for(i in 1:17){
chapters2[[i]] <- replaceforeignchars( chapters2[[i]], fromto)
}

With pages combined into chapters remove hyphenated words at the end of lines.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

chapters2 <- vector(mode = "list", length = 16) #there are 16 chapters
concat.flag <- FALSE #if True must concatenate last word on sentence to first on next sentence and remove dash

for(i in 1:17){ #i is the chapter counter
for(j in 1:length(chapter[[i]])){ #j is the line counter
sentence <- chapter[[i]][j]
sentence.length <- nchar(sentence)
if( substring( sentence, sentence.length, sentence.length)=='-'){
concat.flag <- TRUE
words.first <- tokenize_words(sentence)
words.first.len <- length(words.first[[1]])
words.second <- tokenize_words(chapter[[i]][j+1])
words.second.len <- length(words.second[[1]])
concatenated.word <- paste( words.first[[1]][words.first.len], words.second[[1]][1], sep="", collapse="")
new.sen.first <- paste(c(words.first[[1]][1:(words.first.len-1)],concatenated.word), sep=" ", collapse=" ")
new.sen.second <- paste( words.second[[1]][2:(words.second.len)], sep=" ", collapse=" ")

chapters2[[i]] <- c(chapters2[[i]], new.sen.first, new.sen.second)

}else{
if(concat.flag){
concat.flag <- FALSE
}else{
chapters2[[i]] <- c(chapters2[[i]], chapter[[i]][j] )

}
}

}
}

In practice this didn’t work so well. Tokenizing a sentence removes capitalization which then has to be manually corrected. There were also occasions where a line was duplicated and this had to be manully corrected. I decided to remove hyphens manually while editing the text.

Next I print out each chapter as a page with xhtml annotation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for(i in 1:16){
out.file <- paste(getwd(), "/content/chapter", i, ".xhtml", sep="")

cat("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n" , file = out.file, append=TRUE)
cat( "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\n" , file = out.file, append=TRUE)
cat( "<head>", file = out.file, append=TRUE)
cat( "<meta charset=\"UTF-8\" />" , file = out.file, append=TRUE)
cat( "</head>\n", file = out.file, append=TRUE)
cat( "<body>", file = out.file, append=TRUE)
cat( chpts3[[i]], file = out.file, append=TRUE)

cat( "</body>", file = out.file, append=TRUE)
cat( "\n</html>", file = out.file, append=TRUE)

}

A useful command is saveRDS which allows for the saving of R objects. Here I save my list, which I can read back into an object, modify, and resave.

1
2
saveRDS(chapters2, paste(getwd(), "/chptobj/chptobj.list", sep=""))
chapters2 <- readRDS(paste(getwd(), "/chptobj/chptobj.list", sep=""))

The package qdap provides an interactive method, check_spelling_interactive, for spell checking. A dialog bog will pop up for each unrecognized word in turn, providing you with a pick list of potential corrections or the opportunity to type in a correction manually.

1
2
library(qdap)
check_spelling_interactive(pages[[100]], range=2, assume.first.correct=FALSE)

I found that the pick list often did not provide the appropriate choice, capitalization is not preserved, and Picnic has many slang words that forced interaction with qdap too frequently. I decided to read through the text and correct manually.

Here is qdap flagging the French ‘alors’. There are settings for qdap that may improve the word choices available, but I did not spend the time investigating.

Once the chapters have been edited and proof read, it is time to create the epub. An epub is a zip file with the extension “.epub”. It also has a well defined directory layout and required files that define chapters, images, flow control, etc. ePub specifications and tutorials are readily available on line. Here I will show examples of some of the epub contents.

File toc.ncx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
<meta name="dtb:uid" content="granitemtn.net [2016.05.30-07:52:00]"/>
<meta name="dtb:depth" content="3"/>
<meta name="dtb:totalPageCount" content="190"/>
<meta name="dtb:maxPageNumber" content="190"/>
</head>
<docTitle>
<text>Picnic at Hanging Rock</text>
</docTitle>
<navMap>
<navPoint id="navpoint-1" playOrder="1">
<navLabel>
<text>Cover</text>
</navLabel>
<content src="content/cover.xhtml"/>
</navPoint>
<navPoint id="navpoint-2" playOrder="2">
<navLabel>
<text>Title Page</text>
</navLabel>
<content src="content/title.xhtml"/>
</navPoint>
.
.
.
<navPoint id="navpoint-21" playOrder="21">
<navLabel>
<text>Back</text>
</navLabel>
<content src="content/back.xhtml"/>
</navPoint>
</navMap>
</ncx>

File metadata.opf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title>Picnic at Hanging Rock</dc:title>
<dc:creator opf:file-as="Lindsay, Joan" opf:role="aut">Joan Lindsay</dc:creator>
<dc:language>en-US</dc:language>
<dc:identifier id="bookid">granitemtn.net [2016.05.30-07:52:00]</dc:identifier>
<dc:rights>Public Domain</dc:rights>
</metadata>
<manifest>
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>

<item id="title" href="content/title.xhtml" media-type="application/xhtml+xml"/>
<item id="characters" href="content/characters.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter1" href="content/chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter2" href="content/chapter2.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter3" href="content/chapter3.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter4" href="content/chapter4.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter5" href="content/chapter5.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter6" href="content/chapter6.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter7" href="content/chapter7.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter8" href="content/chapter8.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter9" href="content/chapter9.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter10" href="content/chapter10.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter11" href="content/chapter11.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter12" href="content/chapter12.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter13" href="content/chapter13.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter14" href="content/chapter14.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter15" href="content/chapter15.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter16" href="content/chapter16xhtml" media-type="application/xhtml+xml"/>
<item id="chapter17" href="content/chapter17.xhtml" media-type="application/xhtml+xml"/>
<item id="cover-image" href="content/images/cover.jpg" media-type="image/jpeg"/>
<item id="back-image" href="content/images/back.jpg" media-type="image/jpeg"/>
<item id="cover" href="content/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="back" href="content/back.xhtml" media-type="application/xhml+xml"/>

</manifest>
<spine>
<itemref idref="cover"/>
<itemref idref="title"/>
<itemref idref="characters"/>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
<itemref idref="chapter3"/>
<itemref idref="chapter4"/>
<itemref idref="chapter5"/>
<itemref idref="chapter6"/>
<itemref idref="chapter7"/>
<itemref idref="chapter8"/>
<itemref idref="chapter9"/>
<itemref idref="chapter10"/>
<itemref idref="chapter11"/>
<itemref idref="chapter12"/>
<itemref idref="chapter13"/>
<itemref idref="chapter14"/>
<itemref idref="chapter15"/>
<itemref idref="chapter16"/>
<itemref idref="chapter17"/>
<itemref idref="back"/>

</spine>
</package>

File container.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<?xml version="1.0"?>

<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">

<rootfiles>

<rootfile full-path="EPUB/pahf.opf"

media-type="application/oebps-package+xml" />

</rootfiles>

</container>



File - an example chapter:

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">

<head><meta charset="UTF-8" /></head>
<body><p>Chapter 2</p><p> </p>
<p> Manmade improvements on Nature at the Picnic Grounds consisted of several circles of flat stones to serve as fireplaces and a wooden privy in the shape of a Japanese pagoda. The creek at the close of summer ran sluggishly through long dry grass, now and then almost disappearing to re-appear as a shallow pool. Lunch had been set out on large white tablecloths close by, shaded from the heat of the sun by two or three spreading gums. In addition to the chicken pie, angel cake, jellies and the tepid bananas inseparable from an Australian picnic, Cook had provided a handsome iced cake in the shape of a heart, for which Tom had obligingly cut a mould from a piece of tin. Mr Hussey had boiled up two immense billycans of tea on a fire of bark and leaves and was now enjoying a pipe in the shadow of the drag where he could keep a watchful eye on his horses tethered in the shade. </p>
.
.
.
<p> The four girls were already out of sight when Mike came out of the first belt of trees. He looked up at the vertical face of the Rock and wondered how far they would go before turning back. The Hanging Rock, according to Albert, was a tough proposition even for experienced climbers. If Albert was right and they were only schoolgirls about the same age as his sisters in England, how was it they were allowed to set out alone, at the end of a summer afternoon? He reminded himself that he was in Australia now: Australia, where anything might happen. In England everything had, been done before: quite often by one’s own ancestors, over and over again. He sat down on a fallen log, heard Albert calling him through the trees, and knew that this was the country where he, Michael Fitzhubert, was going to live. What was her name, the tall pale girl with straight yellow hair, who had gone skimming over the water like one of the white swans on his Uncle’s lake? </p><p></p></body>
</html>


Once the files are in order they are zipped into an epub. Navigate to the directory containing your files and:

1
zip -Xr9D Picnic_at_Hanging_Rock.epub mimetype * -x .DS_Store

Some of the switches I am using:

X: Exclude extra file attributes (permissions, ownership, anything that adds extra bytes)

-r: Recurse into directories

-9: Slowest but most optimized compression

-D: Do not created directory entries in the zip archive

-x .DS_Store: Don’t include Mac OS X’s hidden file of snapshots etc.

The next post in this series discusses sentiment analysis.

Share

Create an eBook

One of my all-time favorite movies is Picnic at Hanging Rock by Peter Weir. Every scene is a painting, and the atmosphere transports you back to the Australian bush of 1900. The movie is based on a book by Joan Lindsay, who had the genius to leave the plot’s main mystery unresolved. During her lifetime she never discouraged anyone from claiming the book was based on real events. After her death in 1984 a “lost” final chapter was discovered, which purportedly resolved the mystery. Most (including myself) believe the final chapter is a hoax.
Recently on R-bloggers there has been a run on articles discussing sentiment analysis. I thought it would be fun to text mine and sentiment analyze Picnic. I purchased a paperback version of the book years ago, which I read while on vacation.

My book is old and the pages are yellowing. Time to preserve it for prosterity.
In this post I will discuss converting a paperback into an ebook. Future posts will discuss the text mining/sentiment analysis. The steps are:

  1. Cut off the spine
  2. Scan the pages, one image per page
  3. Perform OCR (optical character recognition)
  4. Assemble the text in page order
  5. Proofread

As an aside, one of the most impressive crowd sourcing pieces of software I have seen is Project Gutenberg’s Distributed Proofreaders website. Dump in your scanned images and the site will coordinate proofreading and text assembly. Procedures are in place for managing the workflow, resolving discrepancies, motivating volunteers, etc. Picnic doesn’t qualify for this treatment as it is not in the public domain. I will have to do it myself.

Cut off the spine

I used a single edge razor blade. Cut as smoothly and straight as possible. Keep the pages in numerical order.

Scan

I have an HP OfficeJet 5610 All-in-One multifunction printer equipped with a document feeder. I am working with Debian Linux, so I use Xsane as the scanning software. Searching the web I find that there is a lot of discussion concerning the optimum resolution, color, and file format that should be used for images destined for OCR. I decided on 300dpi grayscale TIFF, which in retrospect was a good choice. I load one chapter at a time onto the document feeder positioned such that the smooth edge enters the feeder first. This results in odd pages being rotated 90 degrees counterclockwise, and even pages being rotated 90 degrees clockwise. Xsane will auto-number the images, but I will supply a prefix following a convention: “chptNN[e|o]-NNNN” where e|o is e or o standing for even or odd page numbers, NN for the chapter number and NNNN is the Xsane supplied image number. The image number will start at 1 for each set (even or odd) of chapter pages.
Once all images are scanned, I will need to rotate either 90 or 270 degrees to prepare for OCR, using the rotate.image function from the adimpro package. I use the following code, depositing the rotated images in a separate directory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library("adimpro")

#populate a vector with all image file names
all.files <- list.files(paste(getwd(), "/rawimages", sep=""))

for(i in 1:length(all.files)){

img <- read.image(paste(getwd(),"/rawimages/", all.files[i], sep=""))
if(nchar(all.files[i])==14){enum <- 4}else{enum <- 5}

if( substring(all.files[i],enum,enum)=="e" ){
img <- rotate.image(img, angle = 270, compress=NULL)
}
if( substring(all.files[i],enum,enum)=="o" ){
img <- rotate.image(img, angle = 90, compress=NULL)
}
write.image(img, file = paste(getwd(),"/rotatedimages/", all.files[i], sep=""))
}

OCR

Next perform OCR on each image. I use tesseract from Google which has a Debian package.

1
2
3
4
5
6
7

for(i in 1:length(all.files)){
basefile <- substring(all.files[i], 1, nchar(all.files[i])-5)
system( paste("tesseract", paste(getwd(),"/rotatedimages/", all.files[i], sep=""), paste(getwd(),"/textfiles/", basefile, sep=""), sep=" "))
}


Seems to work well. Here is a comparison of image and text:

Assemble text

I need to create a table of textfile name, page number, words per page etc. to coordinate assembly of the final text and assist with future text mining. Here are the contents of the all.files variable:

1
2
3
4
5
6
7
8
> all.files <- list.files(paste(getwd(), "/textfiles", sep=""))
> all.files
[1] "ch10e-0001.txt" "ch10e-0002.txt" "ch10e-0003.txt" "ch10e-0004.txt"
[5] "ch10e-0005.txt" "ch10e-0006.txt" "ch10o-0001.txt" "ch10o-0002.txt"
[9] "ch10o-0003.txt" "ch10o-0004.txt" "ch10o-0005.txt" "ch10o-0006.txt"
[13] "ch11e-0001.txt" "ch11e-0002.txt" "ch11e-0003.txt" "ch11e-0004.txt"
[17] "ch11o-0001.txt" "ch11o-0002.txt" "ch11o-0003.txt" "ch11o-0004.txt"
.....

Make a data.frame extracting relevant information from the filenames:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
all.files <- list.files(paste(getwd(), "/textfiles", sep=""))

d <- data.frame(matrix(ncol = 11, nrow = 190))
names(d) <- c("file.name","bname","chpt","eo","page","lines","words","img.num","text","problems","pnumber")

d$file.name <- all.files

for(i in 1:nrow(d)){
numc <- nchar(d[i,"file.name"])
d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 4)
if(numc==13){
d[i,"chpt"] <- as.numeric(as.character(substring(d[i,"file.name"], 3, 3)))
d[i,"eo"] <- substring( d[i,"file.name"], 4, 4)
d[i,"img.num"] <- substring( d[i,"file.name"], 6, 9)
}else{d[i,"chpt"] <- as.numeric(as.character(substring( d[i,"file.name"], 3, 4)))
d[i,"eo"] <- substring( d[i,"file.name"], 5, 5)
d[i,"img.num"] <- substring( d[i,"file.name"], 7, 10)}
}

Read in all the pages of text using the read_lines function from the readr package:

1
2
3
4
5
6
library(readr)
pages <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
pages[i] <- list(read_lines( paste( getwd(), "/textfiles/", d[i,"bname"], ".txt" , sep=""), skip =0))
}

If I look at some random pages, I can see that usually the second to the last line has the page number, when it exists on a page:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>pages[[10]]
.......
[63] "needed, the poor young things . . ."
[64] "As soon- as he Could escape from his Aunt’s dinner table"
[65] "1 I 7" #actually page 117
[66] ""


>pages[[21]]
.......
[37] "The gold padlock on the Head’s heavy chain bracelet rattled"
[38] ""
[39] "142"
[40] ""
>

Many of the page numbers are corrupt i.e. there are random characters thrown in by mistake by the OCR. I make note of these characters and use gsub to get rid of them. Some escape my efforts, but enough are accurate that I can compare the extracted page number to the expected page number, determined by the order in which the pages were fed into the scanner.
I will extract the second to the last line (stll) and include it in my table:

1
2
3
4
5
6
7
8
9
pnumber <- list()[1:190]

for(i in 1:nrow(d)){
stll <- pages[[i]][length(pages[[i]])-1] #second to last line
#get rid of: ' . : - x |
pnumber[[i]] <- gsub( "'", "",gsub( ":", "",gsub( "-", "",gsub( "x", "",gsub( "/|", "",gsub( "/.", "", stll))))))
tryCatch({d[i,"pnumber"] <- as.integer(pnumber[i])},
error={d[i,"pnumber"] <- 0})
}

For the expected page number, create a column “chpteo” which is the concatenation of chptr number and e or o for even odd. Sequentially number these by 2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
d$chpteo <- paste0(d$chpt, d$eo)

odds <- d[d$eo=="o",]
odds <- odds[ order(c(as.numeric(as.character(odds$chpt)), as.numeric(as.character(odds$img.num)))),]
odds <- odds[!is.na(odds$file.name),]
odds$page <- seq(1,189,by=2)

evens <- d[d$eo=="e",]
evens <- evens[ order(c(as.numeric(as.character(evens$img.num))), decreasing=TRUE),]
evens <- evens[ order(c(as.numeric(as.character(evens$chpt)))),]
evens <- evens[!is.na(evens$file.name),]
evens$page <- seq(2,190,by=2)

d2 <- rbind(evens, odds)
d2 <- d2[order(d2$page),]

Here is what my data.frame “d2” looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> head(d2)
file.name bname chpt eo page lines words img.num text problems
91 ch1o-0001.txt ch1o-0001 1 o 1 NA NA 0001 NA NA
92 ch1o-0002.txt ch1o-0002 1 o 3 NA NA 0002 NA NA
93 ch1o-0003.txt ch1o-0003 1 o 5 NA NA 0003 NA NA
94 ch1o-0004.txt ch1o-0004 1 o 7 NA NA 0004 NA NA
95 ch1o-0005.txt ch1o-0005 1 o 9 NA NA 0005 NA NA
96 ch1o-0006.txt ch1o-0006 1 o 11 NA NA 0006 NA NA
pnumber chpteo
91 NA 1o
92 NA 1o
93 NA 1o
94 7 1o
95 9 1o
96 NA 1o

“page” is the expected page number based on scanning order.
“pnumber” is the OCR extracted page. Compare them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> d2[,c("page","pnumber")]
page pnumber
91 1 NA
92 3 NA
93 5 NA
94 7 7
95 9 9
96 11 NA
97 13 NA
98 15 NA
99 17 17
100 19 NA
109 21 NA
110 23 23
111 25 95
112 27 97
113 29 29
114 31 31
115 33 33
116 35 35
122 37 37
123 39 39

Looks good. There are some OCR errors but enough come through to verify that the order is correct. Now I can sort on page and use that order to assemble the ebook. Read each page file and append to an output file. Since I want to be able to refer to images to correct problems, I also insert the image information between text files:

1
2
3
4
5
6
7
8
9
10
11

####write it all out
out.file <- paste(getwd(), "/ebook-draft/output.txt", sep="")

for(i in 1:nrow(d2)){

a <- readLines(con = paste(getwd(), "/textfiles/",d2[i,"file.name"] ,sep=""), n = -1L, ok = TRUE, warn = TRUE,encoding = "unknown", skipNul = FALSE)
cat( paste(d2[i,"file.name"], "\n\n", sep=""),file = out.file, fill=80, append=TRUE)
cat(a, file = out.file, fill=80, append=TRUE)
cat("\n\n", file = out.file, fill=80, append=TRUE)
}

Here is what a page junction looks like:

 

You can see the page number when present, which will provide a method to confirm the correct order. The file name is included, which will allow me to go back to the original image during the proofreading process to verify words I may be uncertain of.

Proofread

It would be nice to have the image and text juxtaposed during the proofreading process. To see what this looks like, take a look at Project Gutenberg’s Distributed Proofreaders website. I will have to read on a device that allows me to refer to the images when needed. Once the proofreading is complete, I will be ready for sentiment analysis.

The next post in this series discusses text manipulation.

Share

Safari

WSJ

Dorobo

Aardvark

Share

Guile

BU 353 USB GPS unit

Associate GPS coordinates with a street address

Init.el

1
2
;;;; Guile/Lisp  Setup
(load-file "~/.emacs.d/elpa/geiser-20170325.1956/geiser.el")

Guile alternatives to DSLs

From https://ambrevar.xyz/guix-advance/

  • XML, HTML (better idea: S-XML)
  • Make, Autoconf, Automake, CMake, etc.
  • Bash, Zsh, Fish (better ideas: Eshell or scsh)
  • JSON, TOML, YAML
  • Nix language, Portage’s Ebuild and many other OS package definition syntax rules.
  • Firefox when it used XUL (but since then Mozilla has moved on) and most other homebrewed extensibility languages
  • SQL
  • Octave, R, PARI/GP, most scientific programs (better ideas: Common Lisp, Racket and other Schemes)
  • Regular expressions (better ideas: Emacs’ rx, Racket’s PEG, etc.)
  • sed, AWK, etc.
  • Most init system configurations, including systemd (better idea: GNU Shepherd)
  • cron (better idea: mcron)
  • conky (not fully programmable while this is probably the main feature you would expect from such a program)
  • TeX, LaTeX (and all the derivatives), Asymptote (better ideas: scribble, skribilo – still in development and as of January 2019 TeX/LaTeX are still used as an intermediary step for PDF output)
  • Most programs with configurations that don’t use a general-purpose programming language.

Issues

Install guile 2.2.7; can’t find libffi

https://guile-user.gnu.narkive.com/qcZSj1pL/libffi-not-found-even-if-installed-in-default-path

$ find /usr -name ‘libffi.*’
/usr/lib/libffi.dylib
/usr/local/lib/libffi.5.dylib
/usr/local/lib/libffi.a
/usr/local/lib/libffi.dylib
/usr/local/lib/libffi.la
/usr/local/lib/pkgconfig/libffi.pc
/usr/local/share/info/libffi.info
However, it did not help, and the ./configure still fails with the very same error message.
Did you try running configure as something like

‘PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./configure’

My command:

PKG_CONFIG_PATH=/usr/lib/x86_64-linux-gnu/pkgconfig ./configure

Configuration

In .bash_profile (not .bashrc)
(see https://guix.gnu.org/manual/en/html_node/Invoking-guix-environment.html#Invoking-guix-environment)

GUIX_PROFILE=”$HOME/.guix-profile” ;
source “$HOME/.guix-profile/etc/profile”

PKG_CONFIG_PATH

Missing development packages error

sudo find / -name “guile*.pc”
Look at the directories, are they present when
echo PKG_CONFIG_PATH
Say you find one in /usr/local/lib/pkgconfig
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig in .bashrc

sudo find / -name “guile*.pc”

%load-path

Is it in the path:
scheme@(guile-user)> (search-path %load-path “dbi/dbi.scm”)

#find / -wholename dbi/dbi.scm
note that I find /usr/local/share/guile/site/2.2/dbi/dbi.scm so to use-modules (dbi dbi) I must in .bashrc:

export GUILE_LOAD_PATH=”/usr/local/share/guile/site/2.2:/home/mbc/.guix-profile/share/guile/site/3.0${GUILE_LOAD_PATH:+:}$GUILE_LOAD_PATH”

Share

copy-paste-stack-overflow

Share