Week fifteen: playing with n-gram viewers

Since this session coincides with your essay hand-in deadline, we’re going gently introduce ourselves to the next section of the module – ‘Distance Reading’  – by some in-class experimentation.

Johanna Drucker usefully summarises distance reading here:

“Distant reading is the idea of processing content in (subjects, themes, persons, places etc.) or information about (publication date, place, author, title) a large number of textual items without engaging in the reading of the actual text. The “reading” is a form of data mining that allows information in the text or about the text to be processed and analyzed. Debates about distant reading range from the suggestion that it is a misnomer to call it reading, since it is really statistical processing and/or data mining, to arguments that the reading of the corpus of literary or historical (or other) works has a role to play in the humanities. Proponents of the method argue for the ability of text processing to expose aspects of texts at a scale that is not possible for human readers and which provide new points of departure for research. Patterns in changes in vocabulary, nomenclature, terminology, moods, themes, and a nearly inexhaustible number of other topics can be detected using distant reading techniques, and larger social and cultural questions can be asked about what has been included in and left out of traditional studies of literary and historical materials.”

We’ll begin to explore these techniques and tools by experimenting with four different n-gram viewers (this term is from computational linguistics and simply means ‘token’ – for example, if we were counting words ‘What do I mean?’ the phrase counts as a 4-gram). All these are available via the ‘Tools’ section of the module website.

Follow-up reading:

Culturomics (background on Google’s N-Gram viewer)

Patricia Cohen ‘In 500 Billion Words: New Windows on CultureNew York Times, Dec 16, 2010 (part of the the series ‘Humanities 2.0’)