2016-05-06

Create an eBook

One of my all-time favorite movies is Picnic at Hanging Rock by Peter Weir. Every scene is a painting, and the atmosphere transports you back to the Australian bush of 1900. The movie is based on a book by Joan Lindsay, who had the genius to leave the plot’s main mystery unresolved. During her lifetime she never discouraged anyone from claiming the book was based on real events. After her death in 1984 a “lost” final chapter was discovered, which purportedly resolved the mystery. Most (including myself) believe the final chapter is a hoax.
Recently on R-bloggers there has been a run on articles discussing sentiment analysis. I thought it would be fun to text mine and sentiment analyze Picnic. I purchased a paperback version of the book years ago, which I read while on vacation.

My book is old and the pages are yellowing. Time to preserve it for prosterity.
In this post I will discuss converting a paperback into an ebook. Future posts will discuss the text mining/sentiment analysis. The steps are:

Cut off the spine
Scan the pages, one image per page
Perform OCR (optical character recognition)
Assemble the text in page order
Proofread

As an aside, one of the most impressive crowd sourcing pieces of software I have seen is Project Gutenberg’s Distributed Proofreaders website. Dump in your scanned images and the site will coordinate proofreading and text assembly. Procedures are in place for managing the workflow, resolving discrepancies, motivating volunteers, etc. Picnic doesn’t qualify for this treatment as it is not in the public domain. I will have to do it myself.

Cut off the spine

I used a single edge razor blade. Cut as smoothly and straight as possible. Keep the pages in numerical order.

Scan

I have an HP OfficeJet 5610 All-in-One multifunction printer equipped with a document feeder. I am working with Debian Linux, so I use Xsane as the scanning software. Searching the web I find that there is a lot of discussion concerning the optimum resolution, color, and file format that should be used for images destined for OCR. I decided on 300dpi grayscale TIFF, which in retrospect was a good choice. I load one chapter at a time onto the document feeder positioned such that the smooth edge enters the feeder first. This results in odd pages being rotated 90 degrees counterclockwise, and even pages being rotated 90 degrees clockwise. Xsane will auto-number the images, but I will supply a prefix following a convention: “chptNN[e|o]-NNNN” where e|o is e or o standing for even or odd page numbers, NN for the chapter number and NNNN is the Xsane supplied image number. The image number will start at 1 for each set (even or odd) of chapter pages.
Once all images are scanned, I will need to rotate either 90 or 270 degrees to prepare for OCR, using the rotate.image function from the adimpro package. I use the following code, depositing the rotated images in a separate directory:

library("adimpro")

#populate a vector with all image file names
all.files <- list.files(paste(getwd(), "/rawimages", sep=""))

for(i in 1:length(all.files)){

    img <- read.image(paste(getwd(),"/rawimages/", all.files[i],  sep=""))
    if(nchar(all.files[i])==14){enum <- 4}else{enum <- 5}
    
    if( substring(all.files[i],enum,enum)=="e" ){
        img <- rotate.image(img, angle = 270, compress=NULL)
    }
    if( substring(all.files[i],enum,enum)=="o" ){
        img <- rotate.image(img, angle = 90, compress=NULL)
    }
    write.image(img, file = paste(getwd(),"/rotatedimages/", all.files[i],  sep=""))
}

OCR

Next perform OCR on each image. I use tesseract from Google which has a Debian package.


for(i in 1:length(all.files)){
    basefile <- substring(all.files[i], 1, nchar(all.files[i])-5)
    system( paste("tesseract", paste(getwd(),"/rotatedimages/", all.files[i], sep=""), paste(getwd(),"/textfiles/", basefile, sep=""), sep=" "))
}

Seems to work well. Here is a comparison of image and text:

Assemble text

I need to create a table of textfile name, page number, words per page etc. to coordinate assembly of the final text and assist with future text mining. Here are the contents of the all.files variable:

> all.files <- list.files(paste(getwd(), "/textfiles", sep=""))
> all.files
  [1] "ch10e-0001.txt" "ch10e-0002.txt" "ch10e-0003.txt" "ch10e-0004.txt"
  [5] "ch10e-0005.txt" "ch10e-0006.txt" "ch10o-0001.txt" "ch10o-0002.txt"
  [9] "ch10o-0003.txt" "ch10o-0004.txt" "ch10o-0005.txt" "ch10o-0006.txt"
 [13] "ch11e-0001.txt" "ch11e-0002.txt" "ch11e-0003.txt" "ch11e-0004.txt"
 [17] "ch11o-0001.txt" "ch11o-0002.txt" "ch11o-0003.txt" "ch11o-0004.txt"
 .....

Make a data.frame extracting relevant information from the filenames:

all.files <- list.files(paste(getwd(), "/textfiles", sep=""))

d <- data.frame(matrix(ncol = 11, nrow = 190))
names(d) <- c("file.name","bname","chpt","eo","page","lines","words","img.num","text","problems","pnumber")

d$file.name <- all.files

for(i in 1:nrow(d)){
    numc <- nchar(d[i,"file.name"])
    d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 4)
    if(numc==13){
        d[i,"chpt"] <- as.numeric(as.character(substring(d[i,"file.name"], 3, 3)))
        d[i,"eo"] <- substring( d[i,"file.name"], 4, 4)
         d[i,"img.num"] <- substring( d[i,"file.name"], 6, 9)
    }else{d[i,"chpt"] <- as.numeric(as.character(substring( d[i,"file.name"], 3, 4)))
        d[i,"eo"] <- substring( d[i,"file.name"], 5, 5)
    d[i,"img.num"] <- substring( d[i,"file.name"], 7, 10)}    
}

Read in all the pages of text using the read_lines function from the readr package:

library(readr)
pages  <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
        pages[i] <-  list(read_lines( paste( getwd(), "/textfiles/", d[i,"bname"], ".txt" , sep=""), skip =0))
}

If I look at some random pages, I can see that usually the second to the last line has the page number, when it exists on a page:

>pages[[10]]
.......
[63] "needed, the poor young things . . ."                            
[64] "As soon- as he Could escape from his Aunt’s dinner table"       
[65] "1 I 7"                     #actually page 117                                     
[66] ""


>pages[[21]]
.......
[37] "The gold padlock on the Head’s heavy chain bracelet rattled"   
[38] ""                                                              
[39] "142"                                                           
[40] ""                                                              
>

Many of the page numbers are corrupt i.e. there are random characters thrown in by mistake by the OCR. I make note of these characters and use gsub to get rid of them. Some escape my efforts, but enough are accurate that I can compare the extracted page number to the expected page number, determined by the order in which the pages were fed into the scanner.
I will extract the second to the last line (stll) and include it in my table:

pnumber <- list()[1:190]

for(i in 1:nrow(d)){
    stll <- pages[[i]][length(pages[[i]])-1]   #second to last line
    #get rid of:  ' . : - x |
    pnumber[[i]] <- gsub( "'", "",gsub( ":", "",gsub( "-", "",gsub( "x", "",gsub( "/|", "",gsub( "/.", "", stll))))))
    tryCatch({d[i,"pnumber"] <- as.integer(pnumber[i])},
             error={d[i,"pnumber"] <- 0})
}

For the expected page number, create a column “chpteo” which is the concatenation of chptr number and e or o for even odd. Sequentially number these by 2.

d$chpteo <- paste0(d$chpt, d$eo)

odds <- d[d$eo=="o",]
odds <- odds[ order(c(as.numeric(as.character(odds$chpt)), as.numeric(as.character(odds$img.num)))),]
odds <- odds[!is.na(odds$file.name),]
odds$page <- seq(1,189,by=2)

evens <- d[d$eo=="e",]
evens <- evens[ order(c(as.numeric(as.character(evens$img.num))), decreasing=TRUE),]
evens <- evens[ order(c(as.numeric(as.character(evens$chpt)))),]
evens <- evens[!is.na(evens$file.name),]
evens$page <- seq(2,190,by=2)

d2 <- rbind(evens, odds)
d2 <- d2[order(d2$page),]

Here is what my data.frame “d2” looks like:

> head(d2)
       file.name     bname chpt eo page lines words img.num text problems
91 ch1o-0001.txt ch1o-0001    1  o    1    NA    NA    0001   NA       NA
92 ch1o-0002.txt ch1o-0002    1  o    3    NA    NA    0002   NA       NA
93 ch1o-0003.txt ch1o-0003    1  o    5    NA    NA    0003   NA       NA
94 ch1o-0004.txt ch1o-0004    1  o    7    NA    NA    0004   NA       NA
95 ch1o-0005.txt ch1o-0005    1  o    9    NA    NA    0005   NA       NA
96 ch1o-0006.txt ch1o-0006    1  o   11    NA    NA    0006   NA       NA
   pnumber chpteo
91      NA     1o
92      NA     1o
93      NA     1o
94       7     1o
95       9     1o
96      NA     1o

“page” is the expected page number based on scanning order.
“pnumber” is the OCR extracted page. Compare them:

> d2[,c("page","pnumber")]
    page pnumber
91     1      NA
92     3      NA
93     5      NA
94     7       7
95     9       9
96    11      NA
97    13      NA
98    15      NA
99    17      17
100   19      NA
109   21      NA
110   23      23
111   25      95
112   27      97
113   29      29
114   31      31
115   33      33
116   35      35
122   37      37
123   39      39

Looks good. There are some OCR errors but enough come through to verify that the order is correct. Now I can sort on page and use that order to assemble the ebook. Read each page file and append to an output file. Since I want to be able to refer to images to correct problems, I also insert the image information between text files:


####write it all out
out.file <- paste(getwd(), "/ebook-draft/output.txt", sep="")

for(i in 1:nrow(d2)){

    a <- readLines(con = paste(getwd(), "/textfiles/",d2[i,"file.name"] ,sep=""), n = -1L, ok = TRUE, warn = TRUE,encoding = "unknown", skipNul = FALSE)
    cat( paste(d2[i,"file.name"], "\n\n", sep=""),file = out.file, fill=80,  append=TRUE)
    cat(a, file = out.file, fill=80,  append=TRUE)
    cat("\n\n", file = out.file, fill=80,  append=TRUE)
}

Here is what a page junction looks like:

You can see the page number when present, which will provide a method to confirm the correct order. The file name is included, which will allow me to go back to the original image during the proofreading process to verify words I may be uncertain of.

Proofread

It would be nice to have the image and text juxtaposed during the proofreading process. To see what this looks like, take a look at Project Gutenberg’s Distributed Proofreaders website. I will have to read on a device that allows me to refer to the images when needed. Once the proofreading is complete, I will be ready for sentiment analysis.

The next post in this series discusses text manipulation.