Voyant 2

Voyant has released version 2 of its textual analyzer and visualizer: http://voyant-tools.org/ (link is also in ‘Tools’). It has a cleaner interface and seems -at least during my first testing – emmaquicker. In the top left-hand tool interface where the word cloud appears, you can switch views: between word cloud (‘cirrus’), raw word counts (‘terms’), and word-frequency links (‘Links’).

For more features, check out this Profhacker blog post.

NB: see also their new ‘Guides’ section!



Wrong Search, Wrong Answer

Distance reading is an undeniably useful tool in helping us to search for data which encompasses a large pool of texts. On its own, however, distance reading lacks context and requires us to couple the knowledge we gain through close reading with the data we gain through distance reading in order to create accurate search terms, and accurate answers. We can see this through our graph, which provides the data for works of fiction published between the years of 1660-1799 (according to the ESTC) which contain the term ‘adventure’, seen in Figure 1.

Figure 1.

This data can raise various problems; the first of which is the potential to be mislead, as the data  does not take into account synonyms or translations of the term ‘adventure’. Therefore, this may not provide us with accurate data for the amount of texts. Additionally, the amount of texts is not taking into account multiple editions published in the same year, and would therefore require further distance reading. Although this is a rudimentary analysis, and could be further examined taking into account other popular terms (e.g. ‘adventure’ and ‘romance’ for example), it does provide us with a rough point at which to start.

However this would require further close reading of the texts in order to gain a greater significance for evidential use. Franco Moretti, a scholar in the field of digital text analysis, is a strong advocate for distance reading; however he cannot deny that close reading is often still necessary, Moretti concedes that things didn’t unfold as planned. Somewhere along the line, he writes, he “drifted from quantification to the qualitative analysis of plot””.

The answers we are looking for can only be found through specific search terms. If we search for the wrong terms, we receive the wrong answers. The right information can be known through the process of close reading, proving to us that both close reading and distance reading are needed in order to gain reliable and accurate information.

Data Visualisation: The Hidden Considerations

By Lorien and Tino


Using the English Short Title Catalogue, we searched for a list of all the fiction titles published between 1660 and 1799 by the publisher Noble as we wanted to see what we can learn about literary history by finding the most popular words in the titles published. The reason for doing this is because the titles published should give us an indication of the types of fictional narratives by this publisher that were published in this time period, for example, Adventure, History, or Romance. Though this is what we are intending to find out, this post identifies the considerations we need to make about how we choose to visualise data.

In order to do this, we needed to search the database inputting the criteria we wanted. This produced 36 titles, which we then edited so we only had the titles themselves. We needed to edit the data so we could ensure details such as authors were not included.

Using Voyant, we can input the text and discover the most popular words used in these titles. However, it is not as simple as this, as some of the titles have words such as ‘volumes’ and, though this is part of the title, it does not tell us anything about the type of book it is. Not all stop words are undesirable, and so we have produced two word clouds: one with stop words included, and one without.

The first word cloud reveals ideas around the syntactical structure of sentences, and though this is really interesting it does not answer our questions, hence the second word cloud, which has removed the stop words. By a point of comparison, here are the top ten words from both word clouds presented as pie charts:

with stop words includedwith stop words

We can see from the two charts that ‘History’ and ‘Adventures’ are two popular types of novels being published. What is also interesting is the presence of ‘Mr’ and ‘Miss’, but not ‘Mrs’, suggesting that female protagonists in novels may be unmarried women, and from what we already know about 18th Century novels, this may suggest a moral standpoint from the novel.

Distance reading, however, cannot give us conclusive details. For example, one word present in the top ten is ‘two’ and this is something that close reading would be required to determine if this is about the title of the novel or an indicator to the number of volumes the novel was originally published in. If the latter, then we have seemingly “stumbled” across a result we had not been expecting.

There is a flaw with our pie charts, or rather, the way he have chosen to represent this data. A pie chart implies the whole of something buy spitting a complete circle into smaller segments, but the data we are visualising is not the whole corpus but rather the top ten popular words and the frequency. A better way to visualise this data would be a bar chart because we can see the amount of times a word is mentioned, and it would not imply a totality that it cannot.

The most important aspect we have learned about data visualisations is exactly that: the visual. What we instantly see can tell us so many things, right or wrong, therefore, it is important we think carefully about the way we choose to present our data.

Essay Blog Post: Constraints of the Codex

In Susan Schreibman’s Digital Scholarly Editing,  she states that ‘Digital scholarly editors are no longer bound by the constraints of the codex and the economics of print publication.’

Schreibman is suggesting that when a text is transcribed from a codex to a digital edition, or is simply born a digital edition to begin with, it is freed from the restrictions it suffers as a physical edition. My own previous study into physical and digital editions of the same text, particularly with Reynard the Fox, concurs with the idea that digital editions are liberated from these restrictions. Digital editions are cheaper, perhaps even free, to make, thus freeing them from the financial restrictions of print publication. They are also easier to navigate, search, and access, due to the use of hyperlinks, encoding and the immense audience of the internet. The digital text is liberated from the spatial restrictions enforced upon a codex, in that it takes up no physical space at all, and any errors that are made in a digital edition can be easily rectified, something which I have already experienced, through study of Reynard the Fox, is not the case with a codex.

My essay will be arguing that, despite there being minor restrictions present in digital editions of texts which are not problematic in the physical,  the majority of the constraints of the codex, as well as the problems with the economics of print publication, are abolished when a text is created digitally. Through the use of type, print, accessibility and navigation problems found in bound versions of Reynard the Fox, I will be considering whether these same problems are still present in the digital version, thus striving to confirm Schreibman’s assertion.

Editorial goals are different from archival ones.

Peter Shillingsburg states that ‘Editorial goals … are different from archival ones’ in Literary Documents, Texts, and Works Represented Digitally.

I intend to explore the truth of this statement through the Early Novels Database which archives ‘super-rich metadata about fiction in English in order to help researchers imagine new histories for the novel. By uniting twenty-first-century database and search technologies with the sensibility of eighteenth-century indexing practices, END creates several innovative access points to our collection of controlled-term and descriptive metadata about novels published between 1660-1830.'[1] The END will be compared to a project we executed in class where we digitised sections of the Nehemiah Grew text.  The processes we undertook and the editorial decisions that we were faced with when remediating this text, were very influential to how it appeared in digital form.

Editorial goals, therefore seem to require decisions which allow the editor to choose which parts of the text are most appropriate, and whether things such as writing in the margins of the book should be translated into the digital form.  Archival goals are more of an accurate representation of a text, or in the case of the END the metadata of the text. The END offers a more interactive experience, whereas our digitising of the Grew text served to translate a very old book into a digital format.  Although both serve to make texts digital, editorial goals rely more upon giving old texts new life in a digital form, whereas archives provide accurate representations for a researcher to explore.  In the light of these suggestions my essay will take the stance of agreeing with Peter Shillingsburg’s statement.


[1] The Early Novels Database About Page. [Online] Available from: http://earlynovels.org/?page_id=7 [Accessed 20/01/16].

Sharing knowledge is more important than continuously building digital systems.

In the book Defining Digital Humanities: A Reader, Mark Sample states in reference to the promise of the digital, ‘Its not about building, it’s about sharing.’ (p.255)

The viewpoint on the influence of the digital industry seems to be centered on three focal disputes. One side seems to favour more traditional, possibly archaic values, opposing to fully accept its ever-growing nature. On the other hand, the second side is very encouraging of the rise and usefulness of the digital industry and its ability to aid and enhance pre-existing standards and research. The third side attempts to create a symbiosis of both in order to create a happy medium. This is the side that Sample, as well as myself is seemingly on.

This essay will formulate the argument that Sample’s statement is true to its fullest extent, highlighting the importance of sharing knowledge over mass- construction of digital systems. Sample’s emphasis of the significance of sharing, complies with my own research into crowdsourcing/community sourcing tools and databases/projects that I have studied in class. Sample also states ‘The heart of the digital humanities is not in the production of knowledge; it’s the reproduction of knowledge.’ (p.256) This not only affects the theoretical approach to the question at hand but also opens up the exploration of databases such as Wikipedia and the Early Novels Database (END). These are two resources that will help evidence the argument that sharing knowledge is more important than building knowledge. This battle between the correct applications of digital knowledge is important due to the volatile nature of the industry, and the quality of potential excellence that the digital world has due to its ability to reshape the representation, sharing, and discussion of knowledge.