Week eighteen: analysing a corpus

To re-cap Johanna Drucker usefully summarises distance reading here:

“Distant reading is the idea of processing content in (subjects, themes, persons, places etc.) or information about (publication date, place, author, title) a large number of textual items without engaging in the reading of the actual text. The “reading” is a form of data mining that allows information in the text or about the text to be processed and analyzed. Debates about distant reading range from the suggestion that it is a misnomer to call it reading, since it is really statistical processing and/or data mining, to arguments that the reading of the corpus of literary or historical (or other) works has a role to play in the humanities. Proponents of the method argue for the ability of text processing to expose aspects of texts at a scale that is not possible for human readers and which provide new points of departure for research. Patterns in changes in vocabulary, nomenclature, terminology, moods, themes, and a nearly inexhaustible number of other topics can be detected using distant reading techniques, and larger social and cultural questions can be asked about what has been included in and left out of traditional studies of literary and historical materials.”

This week,  we will be continuing our exploration of distance reading, this time applying our text analysis to a much larger number of texts. There are a number of steps to this worksheet so be sure to read it through before you begin the preparation:

Read the following.

Patricia Cohen, Analyzing Literature by Words and Numbers, New York Times, 2010

Kathryn Schulz What is Distance Reading, New York Times, 2011

Ted Underwood, Distant reading and representativeness

These concern Victorian novels, but the methodological issues remain the same for our own experiments. Now:

  1. Assess the arguments for and against this approach.
  2. What points are made about the literary canon?
  3. Are such methods more representative? Or is that beside the point?
Make your own corpus.

First, using the list of eighteenth-century novels I’ve created [in Bibliographies], find and download as many novels as you can. Use Project Gutenberg, the Oxford Text Archive, Internet Archive, and Hathi Trust (note: in this exercise there are no copyright issues). Remember to save them as plain text files. When you save them make sure you give them a short but self-explanatory file name. You should aim for – at the very least – 20 novels! Record which novels you’ve found. Then ask yourself this question: how would you characterise the corpus you have created? Random? Author-based? Genre-based? Period-based? As-many-as-I-could get?

Next, using Wikipedia, find out how many of the novels you found have full entries. What might be the relationship between availability of digital versions and the entries on Wikipedia? What might that reveal about the nature of the literary canon?

Text analysis

Next, upload these files to Voyant – making sure to upload them in chronological order. We will then consider these questions:

  • What keywords did you pursue and why?
  • What challenges did you find in choosing suitable keywords? What were ‘suitable keywords’? Were there keywords that didn’t ‘work’? Note examples of when you had to ‘turn concepts into quantifiable entities’ (Moretti lecture, 2015).
  • What patterns did you notice?
  • Perhaps there were no patterns? Why? What did you do when this happened?
  • Why did I ask you to upload them in chronological order?
  • What difference did your choice of texts make to the questions you asked, and the results?

Next, go to ECCO and use Artemis’s ‘term frequency’ visualizer. So that we are all searching the same corpus

  • Specify the database as ‘Eighteenth-Century Collections Online’ only
  • limit the year range to 1700 to 1800.
  • Click ‘monographs’ and ‘Newspapers and periodicals’.
  • Now enter the same words / terms you explored in Voyant and compare the results.
  • Did any of the words stand out as particularly common? Or, perhaps just as interestingly, did any of them stand out as particularly uncommon words?