Monthly Archives: November 2015

How are literary texts preserved, disseminated and displayed on the internet?

By Christine Bradley and Lorien Kaack

Increasingly over time, the internet has caused more and more anxiety over the preservation of important information, how it is displayed and disseminated. Over the past few weeks we have been looking at authors such as Roy Rosenzweig, Robert Sloan and Alan Liu, who have written on these anxieties, Rosenzweig notably having a more negative tone than Liu, who sees the internet and its progression as positive.

Rosenzweig speaks of his anxiety of data on the internet and how it can be preserved. He states, ‘Ignacio’s sudden deletion of Bert should capture our interest as historians since it dramatically illustrates the fragility of evidence in the digital era’ [1]. He is scared of important information that is stored on the internet being lost, where most data is subject to bitrot, and has a life expectancy of 10 years. We can’t keep saving all this information, so how do we choose what to preserve and what not to preserve? ‘The most calibrated mix of technical solutions will not save the past for the future because the problems are much more than technical and involve difficult social, political and organisational questions of authenticity, ownership and responsibility’ [2].

The character, Kat in Mr Penumbra’s 24 hour Bookstore is very optimistic. She has complete faith in the idea of the whole of the internet being stored within containers, and seemingly is rather zealous and proud of this, however people such as Rosenzweig are very resistant about this idea of keeping everything on the internet in one place. Can it all be preserved, and if not, how do we make the decision of what to preserve?

The idea of displaying information on the internet can be shown through a digital literary project.  A digital literary project is a collection of information which has been manipulated and made machine readable in some way, and then made human viewable in some form:

It requires 3 stages :




Each of these three processes contain important information and decisions about the scholarly project you are undertaking and what you want to do with it. These include intellectual decisions that shape the information and what you want it to be used for.

By cataloguing a source you are grouping it into one category and not another, similarly by transcribing it you’re separating the text from its context. All of these things are necessary if you are going to make a digital project but all of them create a certain authority and limitations, which can help you understand the project by breaking it down this way.   

An example of this is the END (Early Novels Database):


  • University of Pennsylvania Library. 1200 Novels 1660- 1830

         We don’t see the actual text it is just the metadata of the text so publication             date, title etc…they use lots of info.


  • Organized metadata into specific fields


  • Available for people to interact with  

For this project the editors have decided only to give metadata on the texts, rather than giving you the text itself. This is important as it shows what they want the webpage to be specifically used for. Rather than putting all the information of the books on the database, they have chosen a certain category that they wish to display and this will determine why and who it is used by. On the END website ‘About’ page they describe their reasoning for displaying the information this way as: ‘The END (Early Novels Database) Project creates super-rich metadata about fiction in English in order to help researchers imagine new histories for the novel’ [3].  This also being a way to show how these early novels organise themselves and ‘about how early novels instruct readers about themselves, carefully describing prefaces, introductions, and dedications; tables of contents, indexes; title-page genre terms and footnotes buried deep within the text’ [4].

Screen Shot 2015-11-20 at 14.04.51



[1] Rosenzweig, Roy. ‘Scarcity or Abundance? Preserving the Past in a Digital Era.’ The American Historical Review, 108, (3), 2003, p. 736.


[2] Ibid, p. 747.


[3] Early Novels Database [Online] Available from:


[4] Early Novels Database [Online] Available from:



What makes a book worthy of preservation?

By Sophie Lee, Erin Brown and Jimmy Barton.

What makes a book worthy of preservation?

Commercial reasons

The most obvious answer would, of course, be commercial gain. The popularity of a text goes a long way in determining whether it is re published; a method of preserving the text, although not the physical work.

Historical importance

If a text has particular historical importance, for example if it still contains writing and annotations in it from the 17th or 18th century, then this would also usually be seen as an important piece of work to preserve. This, of course, would be preservation in the physical sense as well as preservation of the words in the text.

To an extent, both the text and the physical copy can be preserved digitally, thanks to scanning and encoding, however only an image of the physical text would be preserved, rather than the physical text itself.

Preservation in Mr Penumbra’s 24-hour bookstore

In Mr Penumbra’s 24 hour Book Store, there is a belief that the ‘Codex Vitae’ holds the key to immortality, thus it is being preserved. This shows how the book’s influence and effect is a key value in determining preservation. If the book wasn’t so significant, would they still be trying to crack the code and preserving it? There is an air of selfishness in the sense that there are a select few who know the truth behind the book and that it would hold the key to immortality. It is a symbol of power. On the other hand, the age of the book also makes it desirable. Google wish to digitize it and preserve it because it is their goal with everything. The character Kat expresses this desire of omnipotent knowledge.

How do these corporations operate? – The Festina Lente Company gains their money off the copyrighted fonts. They are much smaller than Google and have more personal reasons of preservation, such as the numerous ‘Codex Vitae’s’ of the members of the Unbroken Spine. Google’s money from advertisement drives their projects of preservation and digitization. They wish to preserve life through keeping these works of literature. Google’s scope and audience reflect their aims of preservation. Google use [Clay uses] the ‘Grumble Gear’ scanner to digitally preserve these ‘Codex Vitae’ [for his team and Google]. The method is much more contemporary, and much more technologically supported than the book chase and decoding the Unbroken Spine makes their members do.

Google’s immortality is symbolized in the significant use of their search engines, and people’s subconscious continuous use of their website [webserevices and search engine]. It is the similar immortality that is hidden in the font ‘Gerritszoon.’ It is so obvious that it is hard to realize. The preservation value of the book lies in its significance for others, be it personal or commercial. Alan Liu describes how a symbiosis of both Google’s methods and the FLC’s methods of preservation would be ultimately the most affective. Liu states ‘it may be that experiencing and communicating literature through social-computing technologies will do more than supplement older reading, interpreting, and performing practices. The payoff will be an evolution in our understanding of the nature of reading, interpreting, and performing.’

Here, technology and conventional methods work together to improve the preserved piece, through its interpretation and understanding.


1813 – Second edition of Sense and Sensibility – Jane Austen.

88406-335x352Image courtesy of Peter Harrington – London

2011 published edition of Sense and Sensibility – Jane Austen.



Image courtesy of AustenProse.

Breaking the “Code” by Scanning the Text?

By Gareth Williams and Santino Prinzi


One of the most exciting scenes in Mr Penumbra’s 24 Hour Bookstore by Robin Sloan is when Clay has snuck into the Feste Lente Company (FLC) with the Grumble Gear Book Scanner so he can scan Manutius’ codex vitae, which the FLC are trying to crack. This is so he can use a computer to read the text on his behalf in the hope of discovering the meaning of immortality. Clay also scans Ajax Penumbra’s codex vitae as he fears the FLC will destroy it if they find out what they have done (which they do). By scanning the codex vitae Clay remediates the physical printed text into PDF images, which we can do ourselves, but there’s more to it than that. The images are transformed into plain text by using Optical Character Recognition (OCR) software to change the image into readable, workable text, but this doesn’t always work.

Although the characters use a rudimentary cardboard system in the novel, the process of scanning texts has become incredibly popular.In the same way that FLC use the codex vitae to preserve life, many historians and archivists are turning to OCR in order to save texts that could easily be lost. However, this isn’t an easy process as first the book must be scanned to a JPEG or PDF and then encoded using a bitmap system. Obviously the more degraded the original copy, the harder it is to get a clear, legible transcript. Many OCR functions are often described as ‘brittle’ for this reason; errors created in the early stages of encoding are quite likely to end up in the final product. Here are some examples;

Images courtesy of HathiTrust. Images of the original text available at:

As you can see, although a lot of the text has been recognised easily, some of the letters were unrecognised by the software (such as the ‘E’ being too condensed that it is seen as an ‘x’). The main problems arise due to the use of different type-faces and smaller fonts.

Images courtesy of HathiTrust. Image of original text available at:

As you can see from the second example there are times when the software almost completely fails. It not only struggles to recognise words in italics, but it also sees the smaller fonts as symbols. It is often occurs that some words appear almost legible even though they have collided together. This is demonstrated in the plague manuscript – “Infection of the Plague seldom, if ever, …” is encoded to, “Ilffctctiofl’ of the Plagae ſeldonyiflevjer”. Errors are unavoidable in any computer system, but when scanning older texts it is apparent that they are incredibly frequent.

This example demonstrates a key message in Sloan’s novel about the use of technology: embrace it, but don’t rely on it solely. There are distinct differences between the PDF file and the plain text, and the plain text would require substantial physical editing in order to correct what the OCR software couldn’t do for us. Failures like this do not mean we shouldn’t be using computers to aid us, just like the failures in Sloan’s novel do not stop Clay from trying to break the code of the codex vitae. We’re looking forward to any future failures (and successes, hopefully) as we experiment with OCR software and other digital tools on this module.

Which books?

Last week, we bounced around a few ideas for books we might like to attempt to part-digitise and edit. Below is the list, plus a couple of others. As you’ll see, it might depend upon the ability to find the right kind of source book. Also, I mentioned, we won’t be digitizing or editing a large portion of these texts. In fact, it’ll be up to you to decide what combination of features (physical, literary, historical) you think is most interesting about your chosen book, and so what pages you will be digitizing and editing. I will insist, however, that you include the title page. More details to come.

Sheridan, School for Scandal

Defoe, Robinson Crusoe.

Austen, Persuasion

Voltaire, Candide

 NB: for these, you will need to find the oldest edition you can afford (e.g. Everyman editions from the early 20thC are around £4; 19thC editions tend to be more but you might find a bargain). I’d suggest a visit to Bayntun’s antiquarian bookshop near the station – it’s fun to browse downstairs where they keep the cheaper second-hand copies.

Nehemiah Grew, Musaeum Regalis Societatis. or a catalogue & description of the natural and artificial rarities belonging to the Royal Society (1681)

The Most Delectable History of Mr Reynard (1701)

Both of the above from BSU special collections.

Henry Fielding , Tom Jones (1749) [Everyman, c.1910].

Alexander Pope, The Works, vols I and II (1736).

Both of these are my personal copies which may be used.