Thoughts on Google Books

image link-topic-sf0.jpg

1. Introduction: Against Google-Bashing

I'm not a "Google-basher." Their activities have been of incalculable benefit to me, and I admire them as one of the few organizations with truly long-term (if usually secret) goals. It seems to me, moreover, that the major problems encountered by researchers in using Google Books are really due to our misunderstanding of their stated purpose (though this misunderstanding has become institutionalized and itself is now a problem). Google Books is, however, making one serious error and committing one annoying infelicity. (I'll discuss these here in the hope that Google will actually see these notes and correct these problems with their otherwise extraordinary service.)

2. Google Books Is Not a Digital Library

The basic complaint about Google Books' freely viewable volumes is that the image quality is poor. It is. Eight-point text in mid-19th century English journals is almost unreadable. The images are also highly post-processed to reduce file size. This is most noticeable in Plates, which are frequently unusable. (It also affects the interpretation of bound volumes. It was common practice in the 20th century to bind volumes of periodicals as if they were a book, discarding advertising material and covers or binding them at the end. With a paper copy, you have some chance of re-associating these materials with their proper issues; with the Google digitizations, this information is lost.) Most annoyingly, Google's digitizers never fold open fold-out plates, rendering them entirely useless. To the antiquarian technologist, these matters can be infuriating.

But we must realize what Google's purpose is. They're quite open about it. Their purpose is to make the world's texts searchable. This is not the same as building a digital library. The intended use of Google Books is to search for, and in, a text and then to grab it from the shelf at hand. I do this often (I have a fair number of books), even when a full-view Google digital version is available. This is why Google won the lawsuits against them regarding this project.

The full-view books are really just carrots held in front of the academic community to encourage university and other libraries to let Google digitize their collections. We make a profound mistake (and I mean that very seriously) if we consider Google Books a digital library.

So the best way to proceed, as a community, is to consider Google for what it is - a very good search engine - and to continue independent efforts to digitize texts at a sufficiently high quality to constitute a proper digital library. Several organizations are doing this, including the Internet Archive, the University of Toronto, the Library of Congress, the Getty, Winterthur, and several European libraries (ETH Zurich, BNF/Gallica, etc.) We owe these institutions a profound debt of gratitude. Unfortunately, we know from information discovered by The Internet Archive that some university libraries are starting to "deaccession" their books after Google has digitized them, under the entire false belief that the Google digitization they now have constitutes a sufficient digital surrogate. (The IA is quite polite, and won't say which libraries are guilty of this.) This is an instance of the partial destruction of civilization and its knowledge by ignorance - but there's nothing new in that.

3. JBIG2 and the Silent Corruption of Texts

Google Books is making one very significant error, however. It's a stupid error to make, and is (or was, depending on how they're storing their digitizations internally) an easy error to correct.

The PDFs supplied in full-view Google Books use two data compression algorithms: JBIG2 for the text parts and JPEG2000 (JPX) for the images (or text parts treated as images). Both are "lossy" algorithms, but in itself that isn't the problem. A good "lossy" algorithm (such as JPEG or JPEG2000) presents an acceptable view of the text/image. More importantly, a good lossy algorithm allows you to detect when the text/image has been degraded. If you can't quite read a numeral '6', for example, you see a fuzzy numeral '6' which differs from other numeral '6' characters, fuzzy or clear, which might appear on the page. You know that the problem is a bad scan (or perhaps bad printing of the original physical page).

The JBIG2 algorithm, however, does not do this. It detects a '6' as a 6. It stores this image, and then in each similar case on the page substitutes this same image for each time it encounters a '6'. Or more accurately it does this each time it thinks that it encounters a '6'. If it made a mistake and saw a '6' where an '8' was actually present, it will substitute a '6' for it. The important point is that once this is done there is no way, in principle, ever to know that the original text had a '6'. JBIG2 corrupts texts silently and undetectably.

It's astonishing that such a stupid algorithm was ever invented. It's alarming that Google Books is using it for their texts. The short of it is that if your research depends upon exact data from the past, you can never trust a Google Book to present that data to you. It may have changed it, and you can never know if it did or not.

The long-term problem is that in reality the Google Books digitizations will be the only surviving versions of many, many texts. Nobody else has the blind patience to digitize everything, including the "boring" stuff. I have no idea what format Google uses internally to store its digitizations, but I sincerely hope that it is not the same JBIG2/JPX combination that they use to present them.

(Note that not all PDFs use JBIG2. PDF is a versatile container format. But see the discussion of Undetectable Data Corruption with JB2/JBIG Formats. The increasingly popular DjVu format uses JB2, which has the same problems as JBIG2.)

4. An Infelicity

This isn't a bug; it's a consequence of Google's origins in searching. But it is inelegant. I mention it because so much of Google is very elegant - think of the simplicity of their main page, or the fact that their rather sophisticated software works so well almost all of the time. (I'm really, really impressed by both the designers and coders at Google. I suspect that even if I were younger I could never qualify for a job there.)

Google Books has many digitizations from serial or multivolume literature: periodicals, classic encyclopaedias (such as Rees'), etc. But the arrangement of them is maddening - essentially random. I would suggest to Google that it would greatly endear them to the academic community if they were to employ a few low-paid graduate students to synthesize single-page indexes of these works. The Hathi Trust has managed this (often with Google digitizations, but Hathi versions are page-limited outside of participating institutions). Surely Google has a bigger budget.