Foundations of Analysis: Part 2

Focus of Part 2

Part 1 of the “Foundations of Analysis” post briefly introduced the field of the Digital Humanities (DH) and the associated notion of Distant Reading which is the key method underlying DH text analysis.  In the second part of the post, I’ll discuss a potential framework for doing data analysis in the DH. The framework rests on frameworks used in data mining and data science.  It’s a framework that I have used in the past to analyze pop cultural phenomenon and artifacts and plan to use in subsequent posts dealing with this subject matter.  To set the stage for this discussion, I first look at a simple example based on a “distant reading visualization” produced in an earlier study by Stefanie Posavec (an independent graphic designer).

On the Road (Again): Example of Distant Reading Visualization of a Single Text

In Part 1 of this post, reference was made to a recent report by Janicke et al. that surveyed the “close and distant reading visualization techniques” used by DH research papers from 2005-2014.  One of the key findings was that a number of the research studies examining single texts or small collections of texts utilized distant reading visualizations either solely or in combination with close reading visualizations.

posavec close annotation

A frequently cited example of a distant reading of a single text is Stefanie Posavec‘s 2007 visual analysis of Jack Kerouac’s On the Road. The analysis was part of an MA project (Writing without Words) designed to visually represent text and explore differences in writing styles among authors (including “sentence length, themes, parts-of-speech, sentence rhythm, punctuation, and the underlying structure of the text”). Although the distant part is often highlighted in citations, Posavec began with a manual, close reading of the book.  Figure 1 exemplifies the close reading visualization techniques that formed the base for her distant analysis. Essentially, she divided the text into segments – chapter, paragraphs, sentences and words — and then used color and annotations to record things like the number of the paragraph (circled on the left), the sentences and their length (slashed numbers on the right), and the general theme of the paragraph (denoted by color).  These close reading visualizations served as the underpinnings for number of unique and highly artistic distant reading visualizations.

posavecs viz - components of tree

One of her distant reading visualizations is displayed in Panel 1 of Figure 2. Here, the combined panels highlight the relationship between the components of the tree (in the visualization) and the segments of the book. More specifically, in Panel 1 the lines emanating from the center of the tree represent the chapters in the book (14 of them). Panel 2 shows the paragraphs (15) emanating from one of the chapters. Finally, in Panel 3 we see the sentences and words for one of the paragraphs (I think there are 26 sentences and I don’t know how many words). While there are individual lines for the words, they dissolve into a “fan” display in the visualization.  For the various structures, the colors represent the categories or general themes of the sentences in the novel.  A major value of this display is that it is easy to see the general structure of the novel, as well as the coverage of various themes by the segments of the all the way down to the sentences.

An interesting facet of Posavec’s visualizations (this and others) is that they were all done manually — both the close and distant reading visualizations.  Given Posavec background, it’s easy to understand her approach. By training she’s a graphic designer “with a focus on data-related design,” and her visualizationwork reflects it.  As David McCandless (the author of Information is Beautiful and Knowledge is Beautiful) put it, “Stefanie’s personal work verges more into data art and is concerned with the unveiling of things unseen.” She’s not a digital humanist per se, nor is she a data scientist or data analyst (which isn’t a handicap).  If she were the latter, she probably would have used a computer to do the close and distant reading work, although at the time it would have been a little more difficult, especially the tree displays.

Web-Based Text Analysis of On the Road

Today, it’s a pretty straightforward task to do small scale text analysis using computer-based or web-based tools (see chapter 3 of Jockler’s Macroanalysis). For instance, I found a digital version of Kerouac’s On the Road on the web and pasted it into Voyant  — “a web-based reading and analysis environment for digital texts.”  It took me about 20 minutes to find the data and produce the analysis shown in Figure 3. As usual, most of that time (about 15 minutes) was spent cleaning up issues with the digitized text (e.g. paragraphs that had been split into multiple pieces or letters that had been incorrectly digitized).

voyant_tool

The various counts provided by tools like Voyant (e.g. sentence and word counts) are similar to the types of data used as the base for Posavec’s displays. There are a small number of publically available tools of this sort that can be used by non-experts to perform various computational linguistic tasks. Of course, what they don’t provide is the automated classification of the paragraphs and sentences into the various (color coded) themes, nor the automated creation of the tree diagram.  Both of these require programming or specialized software or a combination of both. For a discussion of the issues and analytical techniques underlying automated analysis of “themes” see chapter 8 in Jockler’s Macroanalysis).

Framework for Data Analysis of (Pop) Culture

While incomplete, the brief discussion of Posavec’s visual analysis of On the Road and the outline of a potential computer-based analysis of the same text in “digital” form hints at the broad steps that are used in performing distant reading, and distant viewing or distant listening for that matter. So, returning to the question raised in Part 1: “what are the steps used in doing analysis — data and visual — of some cultural phenomenon of interest and its related artifacts?”

I think I can safely say that there is no “standard” or “approved” framework or methodology for defining and delineating the steps to be used in doing DH.  Unlike the “scientific method” or the methods for doing “statistical hypothesis testing,” there are no well-developed cultural practices for doing analysis in the DH, much less any standards committees sitting around voting on standard practices.  However, this doesn’t mean that there is or should be “epistemological anarchy” (to use an impressive sounding phrase from Feyerabend cited in  Rosenbloom 2012). There are some options that are used in other fields to perform similar sorts of analysis.

crispdm

The first framework comes from the field of data mining (i.e. discovering meaningful patters from large data sets using pattern recognition technologies) and is known as the CRISP-DM methodology. The acronym stands for Cross-Industry Standard Process for Data Mining. Basically, it’s a standards-based methodology detailing the process and steps for carrying out a data mining project.  Figure 4 provides an overview of the “Process” (this is the “official” diagram).

These processes and steps also form the bases of various methodologies and frameworks proposed and employed in the emerging field of data science. Wikipedia defines data science (DS) as “an interdisciplinary field about processes and systems to extract knowledge or insights from large volumes of data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining and predictive analytics, as well as knowledge discovery in databases (KDD).”  Obviously, data science is broader than data mining, however the generic processes that are used with DS are often very similar.  One rendition of the DS Process is shown in Figure 5.

datascience

The process starts with the “Real World.” In some cases the “Real World” represents a real world process producing some data of interest.  In other instances, it’s a surrogate for a problem, question or goal of interest. Similarly, the outcome(s) at the end can either be a “Decision” and/or a “Data Product.”  Again, the “Decision” represents an actual decision, solution, or answer, while a “Data Product” usually refers to a software application (often automated) that solves a particular problem or improves a process.  In many cases the “Real World” and the associated decisions and data products come from the business world. However, because the breadth of DS projects encompasses a wide-variety of models and algorithms including “elements of statistics, machine learning, optimization, image processing, signal processing, text retrieval and natural language processing,” the DS process has been and can be used with a wide variety of theoretical and applied domains in the physical and biological sciences, the social sciences and the humanities.

To show it’s applicability to the humanities, reconsider Posavec’s analysis of On the Road. If we focus on the topical analysis and ignore her lexical or stylistic analysis, then the general steps of the DS process might look like:

  1. Real World (Questions)  – What were the major themes in Keroauc’s On the Road and how did they vary by key segments of the text — chapters, paragraphs and sentences?
  2. Raw Data Collected – This is the text of the book tagged by key segments (e.g. first chapter, 5th paragraph, 2nd sentence) and stored in digital form.
  3. Data Processed into Cleaned Data – Utilize text analysis and natural language processing procedures (e.g. tokenization, stopword removal, stemming, etc.)  to prepare and transform the original text into a clean dataset, as well as an appropriate form and format so that it can either be “explored” and/or “modeled.”
  4. Models & Algorithms – First, employ a topic modeling algorithm (e.g. Latent Dirichlet Allocation – LDA) to “automatically” determine in an unsupervised fashion the underlying themes (i.e. topics) of the book. Second, utilize the results from the topic modeling algorithm to determine the major theme (topic) for each paragraph and sentence. This latter step, which is needed to assign the colors to be used in the tree diagram, requires a new algorithm because it is not automatically performed by the topic modeling algorithm.
  5. Communicate, Visualize & Report – In this instance, the primary focus of Posavec’s analysis and output was a tree diagram where the limbs and branches represented the segments of the book displayed in the order in which they appeared in the text and the colors of the limbs and branches represented the themes (topics) of the various paragraphs and sentences. There are a number of publically available algorithms and programs that can produced color coded tree diagrams (see the output of some of these programs in Figure 6). While none of them that I have seen can reproduce Posavec’s handcrafted displays, I’m pretty confident that some of at least one of them can be modified so that the resulting diagram is close to the original display.
  6. Make Decisions & Data Products — As noted, Posavec’s primary focus was on visualization.  Typically, text analysis of this sort would result in written discussion tying the analytical results to the associated problems or questions of interest.  Alternatively or additionally, the various algorithms could be packaged into an application or system that could perform similar sorts of analysis and visualization for other collections of text.

tree diagrams

Future Analyses

Even though the above example only involves a single book, the DS Process is general enough and broad enough to serve as the framework for DH analysis of all sorts, regardless of the specific cultural phenomenon and artifacts being studied, the size of the data collection involved (micro, meso or macro), and the nature of the questions or problems being addressed (the who and with whom, the what, the when and the where). That said, there are a number of details that vary by each of these factors. These details will be provided in the Posts to come, starting with the next couple of posts dealing with the text analysis of hip/hop and rap lyrics.

Visual Frameworks and Methods

Visualization projects (like Posevac’s) often require a number of specialized steps over and above those outlined in the DS Process. For projects of this sort, there specialized frameworks that can be used to either supplement or supplant the DS Process.  As noted in the introductory post, one of the best of the visualization frameworks was developed by Katy Bӧrner and her team at Indiana University (IU).  Bӧrner is the Victor H. Yngve Professor of Information Science in the Department of Information and Library Science at IU in Bloomington, IN, as well as the Founding Director of the Cyberinfrastructure for Network Science Center at IU. Their visualization framework is detailed in three books: Atlas of Science (2010), Visual Insights (2014), and Atlas of Knowledge (2015). The latest rendition of the framework is described in the Atlas of Knowledge and is shown below in Figure 7 (along with the page numbers in the Atlas associated with the various steps) .  The framework is supported by a toolset – the Science-to-Science (Sci2) tool —  which was also developed by Katy Bӧrner and her team. A detailed description of the toolset is provided in Visual Insights along with a number of examples and case studies illustrating its use.

borner_framew

When looking for guidance in doing visual analysis, another avenue to consider (besides the DS Process and visualization frameworks) is the “informal” processes and practices employed by the graphic designers and illustrators involved in much of the innovative work being done in the areas of information and data visualization. A recent book entitled Raw Data: Infographic Designers’ Sketchbooks (2014) provides a detailed look at the work being done by seventy-three of these designers across the world.

Resources

References

Bӧrner, K. (2010). Atlas of Science Visualizing What We Know. Cambridge, MA.: MIT Press.

Bӧrner, K., and David Polley (2014). Visual Insights. Cambridge, MA.: MIT Press.

Bӧrner, K. (2015). Atlas of Knowledge. Cambridge, MA.: MIT Press.

Heller, S. and R. Landers (2014). Raw Data: Infographics Designer’s Sketchbook. London:Thames and Hudson.

Jockler, M. (2013). Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). Urbana, Chicago, Springfield IL: University of Illinois Press.

McCandless, D. (2010). “Great Visualizers: Stephanie Posavec.” informationisbeautiful.net/2010/great-visualizers-stefanie-posavec/.

McCandless, David. (2009). Information is Beautiful. New York: Harper Collins.

McCandless, David. (2014). Knowledge is Beautiful. New York: Harper Collins.

Prosavec, S. (2007). “Writing without Words.” stefanieposavec.co.uk/writing-without-words.

Rosenbloom, P. “Towards a Conceptual Framework for the Digital Humanities.” digitalhumanities.org/dhq/vol/6/2/000127/000127.html.

Wikipedia. “Data Science.” en.wikipedia.org/wiki/Data_science.

People

Katy Bӧrner
David McCandless
Stefani Posavec

Places (Virtual and Real)

Cyberinfrastructure for Network Science Center

Tools

Science-to-Science (Sci2) Tool

Voyant

Foundations of Analysis: Part 1

On the Shoulders of Giants and Other Researchers of Above Average Height

shoulders4

In the introductory post, I indicated that most of my research in Pop Culture derives its foundation from the field of Digital Humanities (DH) and the works of “giants” in the field like Lev Manovich and Franco Moretti. In this post, I want to put this foundation in context, discussing some of the key ideas and the researchers behind those ideas.

For the record, Manovich and Moretti are #3 and #4 in the picture above. The identity of the others is revealed below. That’s me “standing on their shoulders.” You may heard the phrase If I have seen further, it is by standing on the shoulders of giants” often attributed to Sir Isaac Newton or read Stephen Hawking’s book On the Shoulders of Giants. Matthew Jockers, one of the leaders in DH (#5 above and discussed at the end of the post), also noted in describing the state of DH’s union that, “In 2012 we stand upon the shoulders of giants, and the view from the top is breath taking.” Actually, the original phrase and the one quoted in the picture above was more like “A dwarf standing on the shoulder of giant may see further than a giant himself.”  This phrasing dates to sometime in the 12th century well before political correctness. The term was meant to be metaphorical. referring to someone with mortal skills. For me its both metaphorical and literal since I’m 3 standard deviations to the left of the average height for US males.

Roots of Digital Humanities

The humanities encompass a variety of disciplines focused on the study of human culture.  Among the various disciplines are literature, history, philosophy, religion, languages, the performing arts, and the visual arts. Besides their focus on human culture, the humanities traditionally have been distinguished (from the sciences) by the methods they employ, relying primarily on “reflective assessment, introspection and speculation” underpinned by a large element of historical (textual) data.

In recent years growing segment of humanists (and researchers in other disciplines) have also begun to apply some of the newer analytical and visualization methods to a range of traditional humanists topics. The application of these newer tools and techniques falls under the rubric of digital humanities (DH). The DH label was created at the beginning of the 21st century to distinguish it from its earlier counterpart — humanities computing which encompassed a variety of efforts aimed at standardizing the digital encoding of humanities texts — and to avoid any confusion that might be raised by labeling these newer activities as digitized humanities.

Google Trends and Ngram - Digital Humanities and Humanities Computing

Basically, digitized humanities refers to the digitization of resources so they can be accessed by digital means (e.g. digitizing images so they can be viewed on line or digitizing  an image catalog so it can be queried online).  In contrast, DH refers to humanities study and research performed with the aid of advanced computational, analytical, and visualization techniques enabled by digital technology. Digitization is simply one component of DH, albeit a critical component.

“The ‘IZEs’ have it” or should that be “The ‘IZations’ have it”

Speaking of “digitization,” have you ever noticed the proliferation of “izes” and “izations” in today’s rhetoric, especially IT rhetoric? As John Burkardt, a research associate in the Department of Scientific Computing at Florida State University, reminds us, we owe this phenomenon to the “16th century poet Thomas Nashe for inventing the suffix –ize as a means of easily generating new and longer words” and to the (Madison Ave.) ad-speak of the 1950s which relied on (among other things) the use of -ize and -ization to create new verbs and nouns. Burkhardt provides a long list of (393) examples, many of which are current day.  In the world of IT, “izes” and “izations” abound. For instance, a recent Tweet highlighted “The digitization and ‘cloud’ization of data #digital data.” Similarly, recent articles about the Internet of Things (IoT) have opined that the IoT is about “the ‘dataization’ of our bodies, ourselves, and our environment.”

In the context of this posting, some of the key “izes” and “izations” impacting the world of DH include:

  • Digitize and digitization – Converting cultural artifacts (e.g. text documents, paintings, photographs, song lyrics) to digital form.
  • Webize and webization – Conversion to digital form often goes hand-in-hand with the process of adapting digitized artifacts to the Web (or Internet)
  • Dataize and dataization – “…translating digitized cultural artifacts into ‘data’ which captures their content, form and use.”
  • Algorithmize and algorithmization – Converting an informal description of a process or a procedure into an algorithm.
  • (Data or Information) Visualize and visualization – Presenting data or information in a pictorial or graphic form.
  • Artize and artization – Emphasizing the artistic quality of an information or data visualization.

All but the last one of these “izes” is “real,” meaning that I was able to find definitions for the words on the Web.  The last one is sort of a figment of my imagination. I say “sort of” because the phenomenon is real and important, even though the word doesn’t and probably shouldn’t exist.  Lev Manovich  calls this phenomenon “artistic visualization” or visualizations that place a heavy emphasis on their artistic dimension. I plan to cover this dimension in detail in future postings. A couple of places where you can see a number of examples is Manuel Lima’s visualcomplexity.com or Mike Bostock’s D3 gallery.

In a crude sense, the sequence of “izes” listed above provides an outline for doing research in DH. That is to say, for some (pop) cultural phenomenon of interest and its associated artifacts, the artifacts first have to be digitized and maybe webized, then dataized, then algorithmized, and finally visualized (at least in a primitive sense). So, the question is: how do you do this?  Some of the answers are aimed at specific types of cultural phenomenon and artifacts, while others rest on very general frameworks for doing data analysis, data science, or (data) visualization.  The section below briefly discusses a couple of instances of the former, while general frameworks are described in Part II of the post.

Distant Analysis: Reading, Viewing and Listening

Close Reading

Much of the research and analysis conducted in the humanities revolves around the examination of textual information utilizing a method known as close reading. Close reading involves concentrated, thoughtful and critical analysis that focuses on significant details garnered from one text or a small sample of texts (e.g. a novel or collection of novels) with an eye toward understanding key ideas and events, as well as the underlying form (words and structure). While close reading remains the primary method in literary analysis and criticism, it is not without its drawbacks.  Given its attention to detail, as well as its reliance on the skills of the individual reader, close reading makes it difficult: (1) to replicate results; (2) to ascertain general patterns within a single text or among a collection of text; and (3) to generalize findings from the analysis of a single text or a small (non-random) sample of texts to some larger population of which the analyzed text or sample of texts is a small part.

Distant Reading

This is where distant reading can come into play.  The term distant reading was coined by the literary expert Franco Moretti in 2000 to advocate the use of various statistical, data analysis, and visualization techniques to understand the aggregate patterns of large(r) collections of text at a “distance.” Towards this end, he suggested that the (types of) “graphs used in quantitative history, the maps from geography, and trees from evolutionary theory” could serve to “reduce and abstract” the text within the collections of interest and to “place the literary field literally in front of our eyes — and show us how little we know about it” (Moretti 2007). Obviously, these are fighting words to the average humanist, which means that distant reading has an abundance of critics (both literary and otherwise).

Closely related to concept of distant reading is the notion of culturomics. As Wikipedia notes, this is a form of “computational lexicology” (I’ll let you look that one up) that studies human culture and cultural trends by means of the quantitative analysis of words and phrases in a very large corpus of digitized texts. The best known example of culturomics is the Google Labs’ Ngram Viewer project which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time. Some interesting examples of the types of analysis that can be performed are provided in a research article by Jean-Baptiste Michel (photo#6) and Erez Lieberman Aiden (photo #2) et al. that was entitled “Quantitative Analysis of Culture Using Millions of Digitized Books” and appeared in the January 2011 issue of Science.

While the notions of distant reading and culturomics highlight the need for analyzing larger collections of texts, a number DH research projects conducted in the past 10 years (since the publication of Moretti’s Graphs, Maps and Trees in 2005) have focused on individual texts or smaller samples of texts, employing the analytical and visualization techniques advocated by Moretti to enhance or supplement a close reading exercise. A very recent report by Janicke (photo #1) et al. (2015) surveyed the “close and distant reading visualization techniques” utilized by DH research papers published in key visualization and digital humanities journals from 2005-2014.*  The close reading visualizations used things like color, font size, glyphs or connections lines to highlight various features of the text or the reader’s annotations, while the distant reading visualizations involved the usual suspects –  charts, graphs, maps, networks, etc. Below is a table displaying the relationship between the type(s) of visualization used (close, distant or both) by the size of the text sample being analyzed (single text, small collection, large corpus).

close-distant viz techniques

Among other things, the results indicate that: 1. Although all of these studies fall under the DH umbrella, a sizeable number (%) used either a single text or smaller collection of texts (37 out of 100); 2. Within those studies that employed either a single text or smaller collection of texts, almost half used either distant reading visualization techniques or some combination of both close and distant visualization techniques (18 out of 37).

A Word about Distant Viewing and Listening

Even in humanistic fields outside of literature, like the visual arts where the cultural artifacts of interest (e.g. paintings or sculptures) are non-textual, something akin to close reading and text play critical roles.  That is, single pieces of art, single artists, or even specific styles or movements are the focal point of much of the scholarly research in this area, and much of the scholarly communication is delivered in textual form as “catalogues, treatises, monographs, articles and books.”  Increasingly, the tools and techniques of digital humanities are also being applied to these non-textual areas – what we might call distant viewing and distant listening of much larger, digital collections of art, artists, music and musicians.

In the world of distant viewing, the leading light is Lev Manovich.  Currently, Manovich is a Professor at The Graduate Center, City University of New York (CUNY) and Director of the Software Studies Initiative which has labs at CUNY and the University of California, San Diego (UCSD). His focus, and the focus of the labs, is on the intertwined topics of software (as a cultural phenomenon and artifact) and cultural analytics.  Cultural Analytics is defined as “the use of computational methods for the analysis of massive cultural data sets and flows.” Basically, it’s the subset of Digital Humanities focused on “big (cultural) data,” especially really big image data sets. Manovich and his colleagues at the Software Initiative are very prolific, so it’s hard to pinpoint one article or book that summaries their work.  However, Manovich’s article “Museum without Walls, Art History without Names: Visualization Methods for Humanities and Media Studies” does an excellent job of summarizing a number of his articles dealing with the topics of this post including distant and close reading.

A Final Reference

This post barely touches the world of Digital Humanities and the associated notions of distant and close reading.  For a really good book on these topics, take a look at Matthew Jockers’ Macroanalysis: Digital Methods and Literary History — Topics in Digital Humanities (2013). As the title implies, Jockers prefers the terms microanalysis and macroanalysis (ala Economics)  with a bit of mesoanalysis thrown in between instead of standard terms like close and distant reading. He’s also written a how to book — Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences) that details how to perform micro-, meso- and macroanalysis of text.

As an aside, Jockers (photo #5) was a colleague of Moretti’s at Stanford University and with Moretti was co-founder and co-director of the Stanford Literary Lab.  Today he is an Associate Professor of English at the University of Nebraska, Lincoln, Faculty Fellow in the Center for Digital Research in the Humanities and Director of the Nebraska Literary Lab.

Resources

References

Aiken, M., E. Aiden* et al. (2011). “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science. [* joint lead authors]

Bostock, M. D3 Gallery.

Burkhardt, John, Word Play.

Lima, M. Visual Complexity.

Jänicke, S. et al. (2015). “On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges.” Proceedings of Eurovisstar 2015.

Jockers, M. (2013) Macroanalysis: Digital Methods and Literary History  – Topics in the Digital Humanities. University of Illinois Press.

Jockers, M. (2014). Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences). Springer.

Manovich, L. (2013). “Museum without Walls, Art History without Names: Visualization Methods for Humanities and Media Studies.” In Oxford Handbook of Sound and Image in Digital Media. Oxford University Press.

Moretti, Franco (2007). Graphs, Maps and Trees:  Abstract Models for Literary History. London: Verso.

Moretti, Franco (2013). Distant Reading.  London: Verso.

Places

Center for Digital Research, University of Nebraska at Lincoln.

Stanford Literary Lab, Stanford University.

Culturomics.

Software Studies Initiative.