Breaking the “Code” by Scanning the Text?

By Gareth Williams and Santino Prinzi

 

One of the most exciting scenes in Mr Penumbra’s 24 Hour Bookstore by Robin Sloan is when Clay has snuck into the Feste Lente Company (FLC) with the Grumble Gear Book Scanner so he can scan Manutius’ codex vitae, which the FLC are trying to crack. This is so he can use a computer to read the text on his behalf in the hope of discovering the meaning of immortality. Clay also scans Ajax Penumbra’s codex vitae as he fears the FLC will destroy it if they find out what they have done (which they do). By scanning the codex vitae Clay remediates the physical printed text into PDF images, which we can do ourselves, but there’s more to it than that. The images are transformed into plain text by using Optical Character Recognition (OCR) software to change the image into readable, workable text, but this doesn’t always work.

Although the characters use a rudimentary cardboard system in the novel, the process of scanning texts has become incredibly popular.In the same way that FLC use the codex vitae to preserve life, many historians and archivists are turning to OCR in order to save texts that could easily be lost. However, this isn’t an easy process as first the book must be scanned to a JPEG or PDF and then encoded using a bitmap system. Obviously the more degraded the original copy, the harder it is to get a clear, legible transcript. Many OCR functions are often described as ‘brittle’ for this reason; errors created in the early stages of encoding are quite likely to end up in the final product. Here are some examples;

Images courtesy of HathiTrust. Images of the original text available at: http://hdl.handle.net/2027/nyp.33433082227533?urlappend=%3Bseq=11

As you can see, although a lot of the text has been recognised easily, some of the letters were unrecognised by the software (such as the ‘E’ being too condensed that it is seen as an ‘x’). The main problems arise due to the use of different type-faces and smaller fonts.

Images courtesy of HathiTrust. Image of original text available at: http://hdl.handle.net/2027/uc1.31378008333604?urlappend=%3Bseq=11

As you can see from the second example there are times when the software almost completely fails. It not only struggles to recognise words in italics, but it also sees the smaller fonts as symbols. It is often occurs that some words appear almost legible even though they have collided together. This is demonstrated in the plague manuscript – “Infection of the Plague seldom, if ever, …” is encoded to, “Ilffctctiofl’ of the Plagae ſeldonyiflevjer”. Errors are unavoidable in any computer system, but when scanning older texts it is apparent that they are incredibly frequent.

This example demonstrates a key message in Sloan’s novel about the use of technology: embrace it, but don’t rely on it solely. There are distinct differences between the PDF file and the plain text, and the plain text would require substantial physical editing in order to correct what the OCR software couldn’t do for us. Failures like this do not mean we shouldn’t be using computers to aid us, just like the failures in Sloan’s novel do not stop Clay from trying to break the code of the codex vitae. We’re looking forward to any future failures (and successes, hopefully) as we experiment with OCR software and other digital tools on this module.

Advertisements

One thought on “Breaking the “Code” by Scanning the Text?

  1. Some great use of images and text from Hathi Trust to demonstrate your point! I think you might need to be careful about the differences between digitization (at its most simple, taking photos of a document to save as a digital image format such as PDF, jpg, .tiff, or .png); OCR (the computer-automated process of transforming the words in an image file to a text file format), and encoding (the process of marking-up or tagging texts in order for the them to be processed by computers for a variety of analyses).

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s