History of Book Scanning and Google Books

Date of Class: 1/31
First due: 2/7
Comments due: 2/14
Revisions due: 2/21

During this seminar, we discussed the history of Book Scanning and Google Books.

To begin, we dove into the article “The Dark History of HathiTrust” by Alissa Centivany, and the accompanying “UM-Google Cooperative Agreement”. We first determined: does this count as a “dark” history? The class discussed whether Google or Michigan committed any moral misconduct, and the extent to which Michigan went over the CIC’s head. We reached somewhat of a consensus that although UM and UI acted quickly in order to secure funding and complete the project, they were relatively responsible about how they handled the aftermath. In some ways, we beleived this was a somewhat brave decision, and a monumental event in book scanning history. This brought up another discussion about whether this kind of “fast and loose” approach to book scanning, in order to get the project running, was appropriate considering the nature of the content. The article talks about the “moral obligation” to scan, and the librarians at Michigan and the technologists at Google both shared this value system. We concluded this article discussion by looking at some online comments on an article published around the same time (2008) discussing the release of HathiTrust. There, a lot of people shared concerns about the dwindling physical presence of libraries, and how paper resources are being largely replaced by digital ones. We evaluated whether this stands true now, twelve years later, and the class had mixed accounts. Some students brought up the point that they still check out physical books from Hunt Library and the Carnegie libraries, but that they feel they are often one of the only ones still using this system. The Sorrell’s library redesign was also mentioned, where many of the books that once lined the walls of the library were removed to make space for study corners and alternative working spaces. Also, we had an interesting discussion over our often preconcienved notion that a digital resource is timeless and permanent because of it’s format, when in many cases it isn’t. Although there is an internet web archive, a lot of important pages (such as digital humanities conferences pages), are completely lost once the event has passed, along with all of the valuable data it may have hosted.

The class then spent some time together exploring the HathiTrust ngram exploring tool, Bookworm. Immediately, we decided that the interface of Bookworm was much cleaner and more user-friendly than Google Ngrams Viewer, although they do seem to serve different purposes. The filtering function was particularly interesting to look at, as there were some fields (such as “Is Gov”) that seemed to be remnants of stray tagging or librarian error. It was relevant to see this after discussing librarian culture and the clash between librarians and technologist during the HathiTrust scanning, and brought up another argument for the downsides to rushing such a large project.

Following this connection in librarian culture and moral obligation to scan, we moved on to discuss the “Million Books Project”. We first identified a direct connection in a quote from Journalist Kevin Kellyn, “‘The desire of all creators is for their works to find their way into all minds,’” and talked about whether it’s true that the creators of this printed content really do desire to be scanned and digitized. This raised a point about modern publishing now: when writers create work today, its primary existence is in a digital form, so printing and scanning is rendered more obsolete. Some students pointed out that if you email the author of many articles or publications that cost money, they will send you the pdf for free. This adjusted the context of the conversation. Throughout this discussion, I found we were repeatedly trying to imagine what reputation the concept of book scanning had in 2004, or 2008, versus what it actually is now. The difference in the presence of technology and tech conglomerates in our lives makes it more challenging to simply assess the reactions to each scanning project. We also discussed the reason why this article was written. It appears that in some ways, this was a response to the HathiTrust project, especially the buzz that Michigan was receiving from working with this “hip” tech company. Reputation, which we talked about in the previous article as well, seemed to play a huge role in the implementation and development of this project, and this article seemed in part to be a little shout-out to Carnegie Mellon University and the other people on this project for doing very similar work first.

More interesting points brought up: - although we didn’t leave enough time to really dive into it, the process of converting scanned lines into code, as outlined by Michael Gavin in “How To Think About EEBO”, was fascinating and worth exploring further. We thought it was bizzare how similar it was to html, but was instead categorizing about the lines of a sonnet. - Are digitization tools tools for librarians, or are they making librarians obsolete? We discussed briefly how the role of the librarian is changing as digitization changes the scope of the content.