The Technologies of Book Scanning

Date of Class: 2/14
First due: 2/21
Comments due: 2/28
Revisions due: 3/6

In this class, we discussed the technologies of book scanning, including OCR, material-specific scanning processes, and Google’s scanning and de-warping protocols, and the errors that result from these scanning processes.

To start the class, we had a conversation about the paper A Research Agenda for Historical and Multilingual Optical Character Recognition – Historical and Multilingual OCR. As part of this conversation, we discussed our initial assumption that book scanning was a ‘one and done’ process, and how this turns out not to be true. We concluded that as OCR technologies continue to improve, repositories like Google and HathiTrust should consider re-OCRing their materials to better their accuracy. As part of this conversation, we discussed whether these repositories have a responsibility to continue to improve and maintain their scans, and decided that as long as companies like Google claimed a ‘moral imperative’ to scan books, they also had a moral imperative to scan, OCR, and maintain these books at a high standard. We noted that one of the benefits of a physical book as opposed to a digital copy is that they probably do not require the same level of attention to maintenance, etc.

Additionally, we discussed how this article’s recommendations about OCR technology reflected a slower, more thorough academic approach to scanning, as opposed to Google’s, and concluded that this kind of intentional planning was likely very valuable, especially because it would help to create standards that digitizers could be held to.

Next, we reviewed the article Google Books: Making the public domain universally accessible and Google Book Search: Document Understanding on a Massive Scale. We discussed the fact that we hadn’t seen many of the innovations Google Books had intended to implement (ex book summarizations or the ability to download a pdf of any book) and how these innovations might have changed the way we approached books. We concluded that we probably wouldn’t trust/would feel concerned about Google summarizing books for the public, especially if we didn’t know exactly how it was working. We also talked about how access to pdf changes the way we interact with books (searching vs flipping through, reading the whole book instead of snippets), and the cultural implications of loss of physical books – for example, books no longer being treasured possessions, or something we closely associated with.

Third, we discussed Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive. As part of this discussion, we spoke about how surprised we were by the high levels of error in some of these corpora, and the implications of these errors for research. We concluded that the magnitude of these errors certainly made it difficult for scientists to produce statistically rigorous, complete, and replicable results, which perhaps makes tools like Google Ngrams more valuable for figuring out possible projects, and less valuable for actually executing on them. During this conversation, we also discussed how these papers (and previous readings) have illustrated the importance of involving historians and librarians in the digitization process – for example, a historian could have helped OCR experts deal with the ‘suck’ vs ‘fuck’ typeface challenge that we discussed earlier in the course while they were initially training their algorithms.

Finally, we covered Preserving Imperfection: Assessing the Incidence of Digital Imaging Error in HathiTrust. We talked about how different researchers discussed in this paper had different standards for acceptable scans, and what we believed was an acceptable scan, and concluded that the criteria probably differed widely depending on the intent for the corpus – and that scanners probably should work towards a scan quality much better than the lowest common denominator. We also spoke about the different errors found in HathiTrust, and how interesting it was that books in terrible condition often produced the best scans. We thought it was particularly interesting that the article concluded that “severe [scanning] error occurs largely independent of physical form” (24) and that the scanning process created artifacts that would preserve these errors.