Focus of Part 2
Part 1 of the “Foundations of Analysis” post briefly introduced the field of the Digital Humanities (DH) and the associated notion of Distant Reading which is the key method underlying DH text analysis. In the second part of the post, I’ll discuss a potential framework for doing data analysis in the DH. The framework rests on frameworks used in data mining and data science. It’s a framework that I have used in the past to analyze pop cultural phenomenon and artifacts and plan to use in subsequent posts dealing with this subject matter. To set the stage for this discussion, I first look at a simple example based on a “distant reading visualization” produced in an earlier study by Stefanie Posavec (an independent graphic designer).
On the Road (Again): Example of Distant Reading Visualization of a Single Text
In Part 1 of this post, reference was made to a recent report by Janicke et al. that surveyed the “close and distant reading visualization techniques” used by DH research papers from 2005-2014. One of the key findings was that a number of the research studies examining single texts or small collections of texts utilized distant reading visualizations either solely or in combination with close reading visualizations.
A frequently cited example of a distant reading of a single text is Stefanie Posavec‘s 2007 visual analysis of Jack Kerouac’s On the Road. The analysis was part of an MA project (Writing without Words) designed to visually represent text and explore differences in writing styles among authors (including “sentence length, themes, parts-of-speech, sentence rhythm, punctuation, and the underlying structure of the text”). Although the distant part is often highlighted in citations, Posavec began with a manual, close reading of the book. Figure 1 exemplifies the close reading visualization techniques that formed the base for her distant analysis. Essentially, she divided the text into segments – chapter, paragraphs, sentences and words — and then used color and annotations to record things like the number of the paragraph (circled on the left), the sentences and their length (slashed numbers on the right), and the general theme of the paragraph (denoted by color). These close reading visualizations served as the underpinnings for number of unique and highly artistic distant reading visualizations.
One of her distant reading visualizations is displayed in Panel 1 of Figure 2. Here, the combined panels highlight the relationship between the components of the tree (in the visualization) and the segments of the book. More specifically, in Panel 1 the lines emanating from the center of the tree represent the chapters in the book (14 of them). Panel 2 shows the paragraphs (15) emanating from one of the chapters. Finally, in Panel 3 we see the sentences and words for one of the paragraphs (I think there are 26 sentences and I don’t know how many words). While there are individual lines for the words, they dissolve into a “fan” display in the visualization. For the various structures, the colors represent the categories or general themes of the sentences in the novel. A major value of this display is that it is easy to see the general structure of the novel, as well as the coverage of various themes by the segments of the all the way down to the sentences.
An interesting facet of Posavec’s visualizations (this and others) is that they were all done manually — both the close and distant reading visualizations. Given Posavec background, it’s easy to understand her approach. By training she’s a graphic designer “with a focus on data-related design,” and her visualizationwork reflects it. As David McCandless (the author of Information is Beautiful and Knowledge is Beautiful) put it, “Stefanie’s personal work verges more into data art and is concerned with the unveiling of things unseen.” She’s not a digital humanist per se, nor is she a data scientist or data analyst (which isn’t a handicap). If she were the latter, she probably would have used a computer to do the close and distant reading work, although at the time it would have been a little more difficult, especially the tree displays.
Web-Based Text Analysis of On the Road
Today, it’s a pretty straightforward task to do small scale text analysis using computer-based or web-based tools (see chapter 3 of Jockler’s Macroanalysis). For instance, I found a digital version of Kerouac’s On the Road on the web and pasted it into Voyant — “a web-based reading and analysis environment for digital texts.” It took me about 20 minutes to find the data and produce the analysis shown in Figure 3. As usual, most of that time (about 15 minutes) was spent cleaning up issues with the digitized text (e.g. paragraphs that had been split into multiple pieces or letters that had been incorrectly digitized).
The various counts provided by tools like Voyant (e.g. sentence and word counts) are similar to the types of data used as the base for Posavec’s displays. There are a small number of publically available tools of this sort that can be used by non-experts to perform various computational linguistic tasks. Of course, what they don’t provide is the automated classification of the paragraphs and sentences into the various (color coded) themes, nor the automated creation of the tree diagram. Both of these require programming or specialized software or a combination of both. For a discussion of the issues and analytical techniques underlying automated analysis of “themes” see chapter 8 in Jockler’s Macroanalysis).
Framework for Data Analysis of (Pop) Culture
While incomplete, the brief discussion of Posavec’s visual analysis of On the Road and the outline of a potential computer-based analysis of the same text in “digital” form hints at the broad steps that are used in performing distant reading, and distant viewing or distant listening for that matter. So, returning to the question raised in Part 1: “what are the steps used in doing analysis — data and visual — of some cultural phenomenon of interest and its related artifacts?”
I think I can safely say that there is no “standard” or “approved” framework or methodology for defining and delineating the steps to be used in doing DH. Unlike the “scientific method” or the methods for doing “statistical hypothesis testing,” there are no well-developed cultural practices for doing analysis in the DH, much less any standards committees sitting around voting on standard practices. However, this doesn’t mean that there is or should be “epistemological anarchy” (to use an impressive sounding phrase from Feyerabend cited in Rosenbloom 2012). There are some options that are used in other fields to perform similar sorts of analysis.
The first framework comes from the field of data mining (i.e. discovering meaningful patters from large data sets using pattern recognition technologies) and is known as the CRISP-DM methodology. The acronym stands for Cross-Industry Standard Process for Data Mining. Basically, it’s a standards-based methodology detailing the process and steps for carrying out a data mining project. Figure 4 provides an overview of the “Process” (this is the “official” diagram).
These processes and steps also form the bases of various methodologies and frameworks proposed and employed in the emerging field of data science. Wikipedia defines data science (DS) as “an interdisciplinary field about processes and systems to extract knowledge or insights from large volumes of data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining and predictive analytics, as well as knowledge discovery in databases (KDD).” Obviously, data science is broader than data mining, however the generic processes that are used with DS are often very similar. One rendition of the DS Process is shown in Figure 5.
The process starts with the “Real World.” In some cases the “Real World” represents a real world process producing some data of interest. In other instances, it’s a surrogate for a problem, question or goal of interest. Similarly, the outcome(s) at the end can either be a “Decision” and/or a “Data Product.” Again, the “Decision” represents an actual decision, solution, or answer, while a “Data Product” usually refers to a software application (often automated) that solves a particular problem or improves a process. In many cases the “Real World” and the associated decisions and data products come from the business world. However, because the breadth of DS projects encompasses a wide-variety of models and algorithms including “elements of statistics, machine learning, optimization, image processing, signal processing, text retrieval and natural language processing,” the DS process has been and can be used with a wide variety of theoretical and applied domains in the physical and biological sciences, the social sciences and the humanities.
To show it’s applicability to the humanities, reconsider Posavec’s analysis of On the Road. If we focus on the topical analysis and ignore her lexical or stylistic analysis, then the general steps of the DS process might look like:
- Real World (Questions) – What were the major themes in Keroauc’s On the Road and how did they vary by key segments of the text — chapters, paragraphs and sentences?
- Raw Data Collected – This is the text of the book tagged by key segments (e.g. first chapter, 5th paragraph, 2nd sentence) and stored in digital form.
- Data Processed into Cleaned Data – Utilize text analysis and natural language processing procedures (e.g. tokenization, stopword removal, stemming, etc.) to prepare and transform the original text into a clean dataset, as well as an appropriate form and format so that it can either be “explored” and/or “modeled.”
- Models & Algorithms – First, employ a topic modeling algorithm (e.g. Latent Dirichlet Allocation – LDA) to “automatically” determine in an unsupervised fashion the underlying themes (i.e. topics) of the book. Second, utilize the results from the topic modeling algorithm to determine the major theme (topic) for each paragraph and sentence. This latter step, which is needed to assign the colors to be used in the tree diagram, requires a new algorithm because it is not automatically performed by the topic modeling algorithm.
- Communicate, Visualize & Report – In this instance, the primary focus of Posavec’s analysis and output was a tree diagram where the limbs and branches represented the segments of the book displayed in the order in which they appeared in the text and the colors of the limbs and branches represented the themes (topics) of the various paragraphs and sentences. There are a number of publically available algorithms and programs that can produced color coded tree diagrams (see the output of some of these programs in Figure 6). While none of them that I have seen can reproduce Posavec’s handcrafted displays, I’m pretty confident that some of at least one of them can be modified so that the resulting diagram is close to the original display.
- Make Decisions & Data Products — As noted, Posavec’s primary focus was on visualization. Typically, text analysis of this sort would result in written discussion tying the analytical results to the associated problems or questions of interest. Alternatively or additionally, the various algorithms could be packaged into an application or system that could perform similar sorts of analysis and visualization for other collections of text.
Even though the above example only involves a single book, the DS Process is general enough and broad enough to serve as the framework for DH analysis of all sorts, regardless of the specific cultural phenomenon and artifacts being studied, the size of the data collection involved (micro, meso or macro), and the nature of the questions or problems being addressed (the who and with whom, the what, the when and the where). That said, there are a number of details that vary by each of these factors. These details will be provided in the Posts to come, starting with the next couple of posts dealing with the text analysis of hip/hop and rap lyrics.
Visual Frameworks and Methods
Visualization projects (like Posevac’s) often require a number of specialized steps over and above those outlined in the DS Process. For projects of this sort, there specialized frameworks that can be used to either supplement or supplant the DS Process. As noted in the introductory post, one of the best of the visualization frameworks was developed by Katy Bӧrner and her team at Indiana University (IU). Bӧrner is the Victor H. Yngve Professor of Information Science in the Department of Information and Library Science at IU in Bloomington, IN, as well as the Founding Director of the Cyberinfrastructure for Network Science Center at IU. Their visualization framework is detailed in three books: Atlas of Science (2010), Visual Insights (2014), and Atlas of Knowledge (2015). The latest rendition of the framework is described in the Atlas of Knowledge and is shown below in Figure 7 (along with the page numbers in the Atlas associated with the various steps) . The framework is supported by a toolset – the Science-to-Science (Sci2) tool — which was also developed by Katy Bӧrner and her team. A detailed description of the toolset is provided in Visual Insights along with a number of examples and case studies illustrating its use.
When looking for guidance in doing visual analysis, another avenue to consider (besides the DS Process and visualization frameworks) is the “informal” processes and practices employed by the graphic designers and illustrators involved in much of the innovative work being done in the areas of information and data visualization. A recent book entitled Raw Data: Infographic Designers’ Sketchbooks (2014) provides a detailed look at the work being done by seventy-three of these designers across the world.
Bӧrner, K. (2010). Atlas of Science Visualizing What We Know. Cambridge, MA.: MIT Press.
Bӧrner, K., and David Polley (2014). Visual Insights. Cambridge, MA.: MIT Press.
Bӧrner, K. (2015). Atlas of Knowledge. Cambridge, MA.: MIT Press.
Heller, S. and R. Landers (2014). Raw Data: Infographics Designer’s Sketchbook. London:Thames and Hudson.
Jockler, M. (2013). Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). Urbana, Chicago, Springfield IL: University of Illinois Press.
McCandless, D. (2010). “Great Visualizers: Stephanie Posavec.” informationisbeautiful.net/2010/great-visualizers-stefanie-posavec/.
McCandless, David. (2009). Information is Beautiful. New York: Harper Collins.
McCandless, David. (2014). Knowledge is Beautiful. New York: Harper Collins.
Prosavec, S. (2007). “Writing without Words.” stefanieposavec.co.uk/writing-without-words.
Rosenbloom, P. “Towards a Conceptual Framework for the Digital Humanities.” digitalhumanities.org/dhq/vol/6/2/000127/000127.html.
Wikipedia. “Data Science.” en.wikipedia.org/wiki/Data_science.
Places (Virtual and Real)