All posts by santinoprinzi

Data Visualisation: The Hidden Considerations

By Lorien and Tino

 

Using the English Short Title Catalogue, we searched for a list of all the fiction titles published between 1660 and 1799 by the publisher Noble as we wanted to see what we can learn about literary history by finding the most popular words in the titles published. The reason for doing this is because the titles published should give us an indication of the types of fictional narratives by this publisher that were published in this time period, for example, Adventure, History, or Romance. Though this is what we are intending to find out, this post identifies the considerations we need to make about how we choose to visualise data.

In order to do this, we needed to search the database inputting the criteria we wanted. This produced 36 titles, which we then edited so we only had the titles themselves. We needed to edit the data so we could ensure details such as authors were not included.

Using Voyant, we can input the text and discover the most popular words used in these titles. However, it is not as simple as this, as some of the titles have words such as ‘volumes’ and, though this is part of the title, it does not tell us anything about the type of book it is. Not all stop words are undesirable, and so we have produced two word clouds: one with stop words included, and one without.

The first word cloud reveals ideas around the syntactical structure of sentences, and though this is really interesting it does not answer our questions, hence the second word cloud, which has removed the stop words. By a point of comparison, here are the top ten words from both word clouds presented as pie charts:

with stop words includedwith stop words

We can see from the two charts that ‘History’ and ‘Adventures’ are two popular types of novels being published. What is also interesting is the presence of ‘Mr’ and ‘Miss’, but not ‘Mrs’, suggesting that female protagonists in novels may be unmarried women, and from what we already know about 18th Century novels, this may suggest a moral standpoint from the novel.

Distance reading, however, cannot give us conclusive details. For example, one word present in the top ten is ‘two’ and this is something that close reading would be required to determine if this is about the title of the novel or an indicator to the number of volumes the novel was originally published in. If the latter, then we have seemingly “stumbled” across a result we had not been expecting.

There is a flaw with our pie charts, or rather, the way he have chosen to represent this data. A pie chart implies the whole of something buy spitting a complete circle into smaller segments, but the data we are visualising is not the whole corpus but rather the top ten popular words and the frequency. A better way to visualise this data would be a bar chart because we can see the amount of times a word is mentioned, and it would not imply a totality that it cannot.

The most important aspect we have learned about data visualisations is exactly that: the visual. What we instantly see can tell us so many things, right or wrong, therefore, it is important we think carefully about the way we choose to present our data.

Advertisements

Essay Blog Post: The (New) Translator’s Task?

In her book My Mother was a Computer: Digital Subject and Literary Texts N. Katherine Hayles writes the following: ‘I use the term “media translation” to suggest that recreating a text in another medium is so significant a change that it is analogous to translating from one language to another’ (p.109).

Hayles is suggesting that the act of translating texts from one language to another is synonymous with the act of remediating a text from the physical codex into a digital one, in other words, by creating a digital edition of a physical text. The heart of my essay responds to this claim by exploring the challenges faced by a digital scholarly editor when remediating a text to see if they run parallel to challenges faced by the translator of languages.

The theoretical discussion around the challenge of a translator being dependent on its target (if the aim is to replicate the text as closely as possible or reimagine it for an intended audience) supports Hayles’ claim, but discussing this theoretically only would be limiting as the remediation of texts and translating languages is practical, too. My essay applies these challenges to my first-hand experience of the digital remediation process of Reynard the Fox.

As a result of this, the decisions I had to make for my own translation of text are not dissimilar to those faced by translators of language, and so the theoretical arguments and the practical process behind digital remediation construct an argument in support of Hayles’ claim in my essay.

Breaking the “Code” by Scanning the Text?

By Gareth Williams and Santino Prinzi

 

One of the most exciting scenes in Mr Penumbra’s 24 Hour Bookstore by Robin Sloan is when Clay has snuck into the Feste Lente Company (FLC) with the Grumble Gear Book Scanner so he can scan Manutius’ codex vitae, which the FLC are trying to crack. This is so he can use a computer to read the text on his behalf in the hope of discovering the meaning of immortality. Clay also scans Ajax Penumbra’s codex vitae as he fears the FLC will destroy it if they find out what they have done (which they do). By scanning the codex vitae Clay remediates the physical printed text into PDF images, which we can do ourselves, but there’s more to it than that. The images are transformed into plain text by using Optical Character Recognition (OCR) software to change the image into readable, workable text, but this doesn’t always work.

Although the characters use a rudimentary cardboard system in the novel, the process of scanning texts has become incredibly popular.In the same way that FLC use the codex vitae to preserve life, many historians and archivists are turning to OCR in order to save texts that could easily be lost. However, this isn’t an easy process as first the book must be scanned to a JPEG or PDF and then encoded using a bitmap system. Obviously the more degraded the original copy, the harder it is to get a clear, legible transcript. Many OCR functions are often described as ‘brittle’ for this reason; errors created in the early stages of encoding are quite likely to end up in the final product. Here are some examples;

Images courtesy of HathiTrust. Images of the original text available at: http://hdl.handle.net/2027/nyp.33433082227533?urlappend=%3Bseq=11

As you can see, although a lot of the text has been recognised easily, some of the letters were unrecognised by the software (such as the ‘E’ being too condensed that it is seen as an ‘x’). The main problems arise due to the use of different type-faces and smaller fonts.

Images courtesy of HathiTrust. Image of original text available at: http://hdl.handle.net/2027/uc1.31378008333604?urlappend=%3Bseq=11

As you can see from the second example there are times when the software almost completely fails. It not only struggles to recognise words in italics, but it also sees the smaller fonts as symbols. It is often occurs that some words appear almost legible even though they have collided together. This is demonstrated in the plague manuscript – “Infection of the Plague seldom, if ever, …” is encoded to, “Ilffctctiofl’ of the Plagae ſeldonyiflevjer”. Errors are unavoidable in any computer system, but when scanning older texts it is apparent that they are incredibly frequent.

This example demonstrates a key message in Sloan’s novel about the use of technology: embrace it, but don’t rely on it solely. There are distinct differences between the PDF file and the plain text, and the plain text would require substantial physical editing in order to correct what the OCR software couldn’t do for us. Failures like this do not mean we shouldn’t be using computers to aid us, just like the failures in Sloan’s novel do not stop Clay from trying to break the code of the codex vitae. We’re looking forward to any future failures (and successes, hopefully) as we experiment with OCR software and other digital tools on this module.