Quantified Me (A Relatively Quick Note)

Infographics: A New Edition

Figure 1. Best American Infographics

A couple of weeks ago I purchased the latest edition of The Best American Infographics (2016) edited by Gareth Cook.  Cook “is a Pulitzer Prize winning journalist and contributing writer to the New York Times Magazine.” In the past he worked for the Boston Globe and was their science editor from 2007 to 2011 [note: he graduated from Brown University with degrees in International Relations and Mathematical Physics, a bit of an odd combination].  This particular edition was released on Oct. 4 and is the fourth in the series. Like the other editions in the series, this one covers a range infographics “originally” published for a North American audience, online or in print, during 2015. If you follow this particular genre of data or info visualization, you may have seen some of them before.

Figure 2. Defining Infographics with an Infographic

Of course, this all begs the question of “what is an infographic?”, for which I have no really good answer.  There is a long list of possibilities of which these are some:

  • Wikipedia: Information graphics or infographics are graphic visual representations of information, data or knowledge intended to present information quickly and clearly.
  • Customermagnetism.com: An infographic is a popular form of content marketing that can help you simplify a complicated subject or turn an otherwise boring subject into a captivating experience. Ideally, an infographic should be visually engaging and contain a subject matter and data that is appealing to your target audience …something that is truly ‘link worthy’ or ‘share worthy’.
  • Visual.ly: Infographics are visualizations that:  1. present complex information quickly and clearly; 2. integrate words and graphics to reveal information, patterns or trends; 3. are easier to understand than words alone; 4. are beautiful and engaging.
  • By Example: You’ll know one when you see one, which is sort of Cook’s approach (some examples of infographics about infographics are shown in Figure 2).

An Example: Dear Data

Figure 3. Dear Data – 52 weeks of Postcard Infographs

The introduction to this newest edition, which was written by Robert Krulwich of RadioLab and NPR fame, highlights one of the entries called Dear Data. Actually, it’s not a single infographic but a set of 104 (52 pairs of) infographic postcards exchanged between two artists, Giorgia Lupi and Stefanie Posavec, over the course of 2015. At the time, Lupi was living in Brooklyn and Posavec in London. Basically, they came up with a (personal) question for each the 52 weeks of 2015 which they each answered by collecting their own data, creating a graphic to visualize the data, putting it all on a postcard along with a brief explanation of the visualization, and then sending the card to the other. The questions covered a multitude of topics, some more personal than others including:  How did you check the time? How often did you say ‘thank you?’ How often did you have intimate contact? How often did you complain? How often did you drink? How often did you walk through doors? And, so on.

This project is actually documented in a new book by the same name.  The flavor of the book is reminiscent of some of the examples in a slightly earlier book (Oct. 2014) called  Raw Data: Infographic Designers’ Sketchbooks by Heller and Landers. Some of the sketches by Posavec (see the cover in Figure 3) are also reminiscent of some of her earlier work (like her MA project Writing without Words)  which I discussed in “Foundations of Analysis: Part 1.” Finally, the entire project exemplifies a larger movement known as the Quantified Self.

The Quantified Self

On a variety of occasions I’ve tried to keep an online diary of my dietary and exercise activities for an extended period of time.  A couple of times in the past, I did at least 30 mins of running, cycling or swimming every day for a year, recording and analyzing the particulars as the year progressed. It’s hard to keep a streak like that going, especially when you’re flying all over the world on a frequent basis (75-100+K miles a year). Much of this effort was designed to track various training regimes for upcoming events (marathons, stage races, distance bike rides, and the like). Part of the time it was just to see if I could persevere for the year. A number of times I’ve tracked my diet in a similar fashion for extended periods (up to 6 months at a time) usually for medical reasons.   Think about recording everything you eat and drink, determining the amount calories, salt, carbs, fats, you consume for the day. Also, think about trying to analyze the content of a meal eaten at a restaurant.  With the exception of the occasional fast food restaurant, most restaurants don’t post this information. Again, this isn’t too daunting unless you’re traveling and eating out a lot (say 50-70% of the year like I used to do).

In the case of exercise, I’ve had a variety of activity trackers over the years (Garmin watches of various sorts). Until recently, these have worked well for running but not so well for cycling and swimming.  Now, with a combination of my Garmin watch and iPhone, it’s easy to track all three and to have the data automatically upload to one of the fitness tracking sites for archiving and analysis. Some of these sites (like MyFitness) also make it possible to enter the food and drinks you consume and to have the various nutritional counts automatically computed.  However, even with programs of this sort, I can attest that it’s still a royal pain in the backside to track your diet (unless you’re on a mono-diet).

The first time I tracked my info for any extended period of time, I viewed it as a major accomplishment. That is, until I came across some of the leading lights of the Quantified Self (QS) movement including Lifeloggers like Steve Mann and Gordon Bell.  This movement focuses on “incorporating technology into the acquisition of data about various aspects of a person’s daily life.” When they say “various,” they aren’t kidding. It’s hard to think of an aspect of one’s life or self, that someone isn’t spending time recording and analyzing it (and in many cases  an inordinate amount of time). This includes among other things data about “inputs (food consumed, quality of surrounding air), states (mood, arousal, blood oxygen levels), and work and performance, whether mental or physical.”

One of my favorites is  Alberto Frigo‘s right hand (like “My Left Foot” but different).  In his own words:

It has been more than 10 years since I have started that project, to be precise today the 11th of May 2014, is my 3.882nd day I have been photographing every object my right hand uses.

Figure 4. Frigo’s Lifelogging Site

This project is briefly described in an article that appeared the Irish online news journal (the journal.ie) in 2015. As the article notes, the project is “documented on his (Frigo’s) frankly kind-of-impenetrable website” (see Figure 4). It is slated to run until 2040 when Frigo will reach 60.  At the current rate of 76 pictures a day, that’s just under 1 million pictures.

 

Figure 5. Feltron’s Annual Lifelogging Report for 2016

On a slightly broader scale, and more in keeping with the infographic theme, Nicholas Feltron, who was one of the lead designers of the Facebook Timeline, produced an annual report for each of the ten years from 2005-2014 based on data about his personal activities and surrounding environs. Like an annual corporate financial report, Feltron’s report summarized his data for each of the four quarters, as well as for the total year.  The totals for the last report in the series (2014) are shown below in Figure 5.

Life-Logging Before and After Apps: The Rise of Personal Visualization and Personal Visual Analytics

The shift to “commercially available applications and products” reflects a general trend away from PC-based solutions to mobile applications in large part because mobile devices are more frequently used to record the activities while they are on-going. This shift not only has simplified data collection, but it has also marked a major shift towards data 6visualizations revolving around personal dashboards.  The visualizations in Figure 6 are typical of those produced a few years ago, while the visualization in Figure 7 is easily recognized as a recent mobile app.

Figure 6. Early Examples of Lifelogging PC Applications

Figure 7. Example of a Mobile Lifelogging App

Figure 8. Ryan’s New Book on the Visual Culture

Until recently, research in the use of these sorts of applications, services and tools has been spread across a variety of domains.” This means that many common lessons, findings, issues and gaps may be missed even though these research communities actually share many common goals.” Efforts to address these issues and gaps has given rise to an emerging research community focused on personal visualization (PV) and personal visual analytics (PVA). The former involves “the design of interactive visual representations for personal use in a personal context,” while the latter is the “science of analytical reasoning facilitated by visual representations used within a personal context” (Tory and Carpendale, 2015). Among those with an interest in PV and PVA, there is the belief “that integrating techniques from a variety of areas (among these visualization, analytics, ubiquitous computing, human-computer interaction, and personal informatics) will not only impact applications for formal data users but will also empowering individuals to make better use of their personal data that is relevant to their personal lives, interests, and needs” (Huang et al., 2015; also see Ryan, 2016 for a broader perspective).

Constructing Infographs: Left for a Later Date

In this note, I haven’t addressed any of the issues, techniques or tools revolving around the construction of infographs. This will be addressed at a later date.

Resources

References

Cook, Garth. The Best American Infographics 2016. Mariner Books.

Customer Magnetism. What is an Infographic?

Huang, D. et. al. Personal Visualization and Personal Visual Analytics. March, 2015.  IEEE Transactions on Visualization and Computer Graphics

Heller, S. and R. Landers. Raw Data: Infographic Designers’ Sketchbooks. 2014. Thames & Hudson Ltd.

Lupi, G. and Stefanie Posavec. Dear Data. 2016. Princeton Architectural Press.

Ryan, L. The Visual Imperative: Creating a Visual Culture of Data Discovery. 2016. Morgan-Kaufmann Publishers.

The Journal.Ie. “This man has been taking a photo of everything he touches… for the last 11 years.” 2015.

Tory, M. and S. Carpendale. “Personal Visualization and Personal Visual Analytics.”   July/August 2015. IEEE Computer Society.

Wikipedia. Infographic.

Visually. What is an Infographic?

 

People

Gordon Bell

Gareth Cook

Nicholas Feltron

Alberto Frigo

Robert Krulwich

Giorgia Lupi

Steve Mann

Stefanie Posavec

 

Places (Virtual and Real)

Quantified Self Movement: Self Knowledge Through Numbers

 

Analyzing Rap Lyrics – Part 3: Analytical Results and Implications

Part 3 is divided into 3 major sections.  The first section focuses on the results produced by analyzing the changes and trends in lyrical “style” across time for our sample of 1206 rap songs.   In this section the results were generated from a series of standard one-way ANOVA tests (with polynomial contrasts) run against the data contained in the song-property matrix described at the end of Part 2. The second section of this part discusses the results produced by analyzing the trends in lyrical “content” across time for the sample.  More specifically, these results come from applying an analytical technique called non-negative matrix factorization (NMF) to the song-stem (td-idf) matrix also described at the end of Part 2.  Finally, the third section deals with the practical implications of this and similar sorts of research, most of which revolve around a larger body of research known as Music Information Retrieval (MIR).

Trend Analysis: Selection of Time Periods

Even those unfamiliar with the genre can recognize that rap and hip-hop are not what they used to be. A pre-2005 hip-hop or rap hit can be easily distinguished from a track released in the past decade, and artists who have gotten into the game within the last ten years bear little similarity to what was the norm for ‘90s-era rappers… Perhaps the most striking difference between 1990s hip-hop and more modern tracks is the lyrics (McNulty, 2014).

One book that attempts to provide a historical understanding of some of the shifts in (American) rap lyrics is The Anthology of Rap (edited by Bradley and Dubois, 2010). The book is divided into four historical sections: (1) Old School (1978-84); (2) Golden Age (1985-1992); (3) Mainstream (1993-1999); and the New Millennium (2000-2010).  Similar distinctions were made in a special magazine issue devoted to Hip Hop Legends (Jaime, 2015), although the time periods and genres were slightly different and included: Old School (1970-1982); Golden Age  (1983-1987); Gangsta (1988-1998); and Millennium (1999-2010).  In these, as well as other informal discussions of the historical evolution of rap, the naming and delineation of the evolutionary periods  have been based on “practical means of organizing the material” rather than on a well-formulated and consensual set of terms and times.

In the trend analysis of rap lyrics provided in this discussion, time is treated differently for the two types of analysis – lyrical “style” and lyrical “content.”  This mimics the approaches taken by earlier researchers dealing with the same topics.

In the case of “style,” one-way ANOVA (with contrasts) has often been used to examine the changes in various stylistic variables across time.  Here, time has been typically divided into groups or periods.   For example, this is the approach used by Petrie et al. (2008) and Fell and Sporleder (2014) in their historical analyses of the lyrics of various genres.  In the analysis of style that follows time is divided into six periods. The specific periods along with the number of (sample) songs in each period are shown in Table 1:

trend-periods
With the exception of the first and last periods, each of the other time periods is 5 years long.  The first (1980-9) and last (2010-15) periods were extended beyond 5 years primarily to ensure that there were enough songs for comparison with the other periods.

 

This is the same sort of approach used by Petrie et al. (2008) and Fell and Sporleder (2014) in their historical analyses of other types of song lyrics (although the exact slices of time were different).  It is not the approach used in “The Evolution of Popular Music: USA 1960-2010” (Mauch 2015), nor is it the approach used in “The Evolution of Pop Lyrics and a Tale of Two LDA’s” (Thompson 2015). As noted earlier, these two “Evolution” studies examined shifts by year in the audio and lyrics, respectively, of the US Billboard Hot 100 between 1960 and 2010.  In each instance the analysis was based on the individual years rather than larger time periods because the sample size was large enough to support more detailed analysis (i.e. with approximately 17K in the sample there was an average of about 350 songs per year).  With the sample of 1206 rap songs analyzed here, many of the earlier and later years had less than 5 songs which makes meaningful comparisons among the individual years difficult.

Trends in Style

To reiterate what was said in Part 2, while lyrical style covers a wide range of properties, the focus of this analysis is on a subset of four stylistic dimensions including:

dimensions-of-style-analyis

With the first three dimensions – Vocabulary, Style and Orientation – only a handful of specific measures are considered. In the case of the fourth dimension – Semantics – the feature of “Conceptual Imagery” is a surrogate for a large subset of stylistic features and measures based on the research of Pennebaker et al. and produced by a computer application called the  Linguistic Inquiry and Word Count (LIWC) that was derived from this research. The larger subset of conceptual features and measures considered here includes:

features-and-measure-style-analysis

Each of the measures listed in Table 3 denotes the % of total words that are found in a dictionary of words associated with that particular concept.  For example, in LIWC there is a list of words that denote “positive emotions.” Thus, if a song has a total 500 words of which 5 are found in the dictionary of positive emotion words, then for this song the measure would equal 1% (= (5/500)*100). In reviewing the results for these measures it is important to note :

(1)  Sub-features are not exhaustive – The top measure for each feature is an “overall” measure for that feature, while the subsequent measures associated with that feature represent key sub-features. The sub-features are not exhaustive, so the percentages for these do not sum to the overall percentage. For instance, the % Affective measure is the overall measure for the Affective feature.  The % Positive and % Negative emotions are key features for the Affective feature. You can’t add the % Positive to the % Negative to get the % Affective.

(2) Features (and sub-features) are not mutually-exclusive – Words can, and often do, appear in the dictionaries for more than one feature of sub-feature. For instance, the word “love” is in the dictionaries for the Affective and Drive features (as well as the dictionaries for Positive emotions and Affiliation). So, again, you can’t add the overall percentages for the various features (like Affective, Social, … Drives) to come up with some total percentage for the Semantic Dimension or to come up with the total Word Count.

One-way ANOVA with Linear and Quadratic Contrasts

Following the example set by Petrie et al. (2008) in their analysis of changes in the Beatles lyrics over time, simple one-way analysis of variance (ANOVA) was used to determine whether there were any statistically significant changes (p values >= .01) in the various features over the six time periods delineated in Table 1. For those measures where the overall ANOVA was significant, polynomial contrasts were computed to determine whether there were any significant linear, quadratic, or cubic effects.  In general, the nature of the historical trend is best described by the highest power or order of the significant effects.  So, for instance, if the only significant effect is the linear, then the measure under consideration is either trending upward or downward in a linear fashion. If the only significant effect is the quadratic, then the trend is curvilinear with a single inflection point (either convex or concave). Finally, if the only significant effect is cubic, then the trend is curvilinear with multiple inflection points (like a wave pattern).  Of course, often two or three of the effects are significant.  In these cases the interpretation is a little more complex but is usually (but not always) best to describe the effect with the highest power (e.g. if the linear and quadratic effects are both significant, then the curvilinear trend usually prevails).

While Python was used to perform most of the processing done in Parts 1 and 2, in this analysis I’ve relied primarily on R.  Even though Python has a variety of statistical packages that can be used for doing one-way ANOVA, it’s easier to analyze the polynomial contrasts with R. The results from the one-way ANOVA of all the stylistic features and measures are shown in the attached spreadsheet of ANOVA results and PDF of trend graphs.

The spreadsheet provides: (1) the overall means (averages) for each measure;  (2) the means for each measure for the six time periods; (3) the overall significance of the F-value from the one-way ANOVA, as well as the significance of each of the effects in the polynomial contrasts (linear, quadratic and cubic); and  (4) for significant relationships, a summary description of the general pattern for the trend  (e.g. positive convex) . Additionally, where available, the spreadsheet provides an overall mean for each measure based on statistics compiled from a number of LIWC studies conducted by  Pennebaker et al.’s (2015). These studies were based on word usage in a variety of settings including blogs, expressive writing samples, novels, natural speech, the New York Times and Twitter (in the remainder of the discussion they are collectively referred to as the LIWC Studies).  Combined, they represent the “utterances of over 80K writers or speakers totaling over 231 million words.” These combined LIWC means are provided as a point of reference to see how rap lyrics compare to general usage. So, for example, from the spreadsheet we can see that rap lyrics utilize fewer Past Tense verbs than the earlier LIWC studies (2.8% of all the words in the songs are past tense verbs, while the number is 4.6% of all the words in the LIWC studies), while the use of Power words in rap lyrics is surprisingly very close to the general LIWC usage (2.4% to 2.5%, respectively).

The PDF file contains a series of line graphs plotting the trend in means and associated standard errors for each of the measures by time period. For each line graph a “dot” represents the mean for a given time period and the “whiskers” represent 1 standard error above and below that mean. The plots for each measure also contain a blue line representing the mean of the measure for all rap lyrics and, where available, a red line representing the mean for that measure in the earlier LIWC Studies. In addition, the overall significance of the F-value from the one-way ANOVA and the significance of each of the effects in the polynomial contrasts is noted at the top of each plot along with a summary description of the overall trend. For instance, the plot in Figure 1 below shows the trends in the mean usage for first person pronouns (I and we) in the 1206 rap lyrics from 1980 to 2015. From this figure we can see that overall mean percentage of first person pronouns (PerFirst) in the sample was just under 9% (blue line), while the mean for the LIWC Studies in general was under 6% (red line). The percent declined in the earlier years from P1 to P3 (19905-1999) but has been on the rise since then. Looking at the standard errors from one period to the next, there was obviously more variability from song to song in P1 (se ~ 5% from 1980-89) than for the other periods (se ~ 3%) which raises some issues about homoscedasticity (which I’ll ignore for the moment).  Based on one-way ANOVA, the overall F-value is significant at the .001 level (***) with both the linear and quadratic effects also significant at this level, while the cubic effects were not significant (note the line at the top “***,***,***,NS: convex w/ increasing slope”). Based on these significance levels, one could describe the overall trend as a “convex curve with an increasing slope” (which is also known as a concave upward or convex downward curve).

first-person-trends

“A convex curve with an increasing slope” is only one of a number of trend patterns that can occur. Figure 2 provides a pictorial summary of the types of trends that are possible in this analysis.

possible-trends

Detailed ANOVA Results

  1. Vocabulary – A number of lyrical studies have indicated that for most popular genres there has been an increase in the average Word Count per song.  This is also the pattern in this sample, although the increase was not uniform across the time periods.  In P1 the average was around 450 words per song, in P2-P5 around 560, and in P6 (2010+) around 680. With this increase in the total Word Count, there was also a shift in the ratio of the number of unique words to the total number of words (i.e. Lexical Diversity).  From P1 to P3 the Lexical Diversity went from 42% to 46% but steadily declined to an average of 34% in P6.  In terms of the use of uncommon words (i.e. % not in Wiktionary) or slang words (i.e. % in Urban Dictionary but not in Wiktionary) there was no significant variation across the time.
  2. Length – The changes in word count are almost a direct function of the changes in the Number of Lines per song and the Number of (lower case) Words per Line.  In this case, the average Number of Lines rose from 63 in P1, to 72 for P2-P5, and then to 84 in P6.  Not only did the number of lines increase on average but so did the Number of Words per Line from approximately 7 to 8.  Part of the increase in the Number of Lines came from repetition. While the average Number of Unique Lines shifted from 50 in P1 to 65 in P3 and back down to about 60 in P6, the percentage of Unique Lines (not included in the spreadsheet) went from 82 to 90 to 70 in those same periods (so the percentage of non-unique or repetitive lines went from 18 to 10 to 30).  So, on average one of the reasons songs are getting longer is because there is more repetition (and thus less lexical diversity).
  3. Rhyming and Echoisms – There are clearly differences among rap artists with respect to their rhyming structures. While automated detection of rhyming in rap music is a very complex (as the 2010 article by Hirjee and Brown clearly demonstrates), this analysis only looks at assonance (i.e. “repetition of vowel sounds to create internal rhyming within phrases or sentences”) and relies on an algorithm developed by Eric Malmi (2015) which computes the length of the Longest Matching vowel sequence that ends with one of the 15 previous words along with the Average Rhyme length which averages the lengths of the longest matching vowel sequences of all words.  Unfortunately, there were significant trends in either the average rhyme length nor are there any systematic changes in the longest sequences, even though they seem to vary a little bit from one time period to the next. The same is true for echoisms which measure the % of words with repetitive letters (e.g. “fuuuture”) or repetitive words (e.g. “love, love, love”).  Again, there are no significant differences across time.
  4. Temporal – This feature measures the degree to which a song is focused on the past, the present and the future. It is calculated by determining the percentages of all words that are past, present, or future tense verbs which in this case wereapproximately 3%, 14%, and 2%, respectively. This pattern also holds for the other LIWC Studies, although the differences are not as great (with the percentages being 5%, 10% and 1%, respectively). In terms of changes across time, the past tense verbs declined slightly from 4% in P1 to 2.5% in P2, and to 3.0% in P6.  Similarly, the present tense verbs declined from 15% in P1 to 13% in P3 and then to rose 15% again in P6. In contrast there were no significant changes in the use of future tense verbs. As a consequence, the ratio of the past tense to present and future tense followed the same curvilinear convex pattern as the past and present tense verbs alone.
  5. Egocentric – The ratio of the percentage of words that are first person pronouns in comparison to the percentage that are second or third person pronouns provides an indicator of how egocentric a song is. In this sample, on average the percentage  of first person pronouns was about 9%, while the average percentage  of second person and third person combined was around 6%. For the earlier LIWC Studies, the percentages were close to 6% and 4%, respectively.  Obviously, the lyrics for the sample of rap songs were generally more egocentric. From the standpoint of trends, the percentage of first person pronouns followed a curvilinear path going from 9% in P1 to 8% in P3 to  10% in P6, while the combined percentage of second and third person pronouns oscillated around their overall mean. The result is that the ratio of first to second and third personal pronouns also followed a curvilinear path from 2.4 (P1) to 1.7 (P4) to 2.6 (P6). These data suggest that the focus on oneself versus others was on the downswing until 2005 but has been on the upswing for the last 10 years.
  6. Semantics – The remainder of the data deals with the “topics that are mentioned and the images they invoke.” The assumption is…”that the words people use convey psychological information over and above their literal meaning and independent of their semantic context.” As noted, these results were produced by applying Linguistic Inquiry and Word Count (LIWC) analysis to the lyrics. Before we consider the detailed results, consider for a moment the overall mean percentages for the various dimensions and features shown in Figures 3 and 4.

overall-means

As Figure 3 shows, while there are differences in the percentages for a given dimension between Rap Lyrics and the other LIWC studies, the differences are usually not that large and the relative importance of the various categories is basically the same for the two. For example, while the percentage of Social words was higher in the Rap Lyrics versus the other LIWC Studies, the percentages for both were high relative to the other categories. In fact if you correlate the means of the dimensions for the two groups (Rap vs. Other LIWC) the correlation is .90.

feature-means

As shown in Figure 4, the same pattern holds when you look at the various features within these dimensions. For example, while there is a major difference between the percentage of Positive words for Rap Lyrics and the other LIWC Studies, for both it was still the highest percentage among all the features. Again, there is a strong correlation between the percentages for the two groups (correlation is ~.8).

With these similarities as a back drop, here are the details for the various (sub)dimensions and features under the Semantic umbrella.

  1. Affective Words –  Affective words deal with moods, feelings and emotions.  In our sample of rap lyrics about 5% of the words match the words in the LIWC Affective list or dictionary.  This holds true regardless of the year.  Regardless of this count, there is variation in the percentage of words that express positive and negative emotions. Overall, the percent of positive words (e.g. love, like) was around 2.5%.  This was lower than the average percentage of positive words (5.6%) for the LIWC Studies in general.  Additionally, for rap lyrics the percentage of positive words dropped in a curvilinear fashion from 3.6% (P1) to 2.4% (P6).   In contrast, the percentage of negative words (e.g. hate, kill, hurt, sad) in rap lyrics increased slightly over time from 1.5% (P1) to 2.6% (P3) where it remained.
  2. Social Words – Social words provide details about social relations and social processes (e.g. who has status and who doesn’t, or who is dominating a relationship). Overall, a substantial percentage (13%) of the words in rap lyrics fall in this category, as opposed to the percentage (10%) for the LIWC Studies in general. Originally, in P1 the percentage of social words was 15%, but it dropped in P2 to 12%, and oscillated above and below 13% after that. In this category, fewer than 1% of the words (combined) refer to family and friends which was also true for other LIWC studies. Overall, about 1.5% of the words referred to females, while only 1% referred to males. These figures are reversed for other LIWC Studies. Additionally, the percentage of words referring to males  steadily declined over the six time periods.
  3. Cognitive Words – Cognitive words (e.g. think, know, because) provide clues as “to how people process information and interpret to make sense of their environment.” Approximately 9% of the words in our sample of rap lyrics fell into this category which is slightly lower than it was for other LIWC Studies. For most of the subcategories in this dimension (including insight, causation, discrepancy, tentative, certainty, and differentiation words), the percentage of words was around 1.5% and, for the most part, there was little significant change in the percentage from P1 thru P6.
  4. Perceptual Words – Perceptual words deal with the acts of seeing, hearing and feeling (i.e. touching). Overall, 3.5% of the words were in this category which was slightly higher than it was for the LIWC Studies (2.7%).  The percentage in this category declined somewhat from 4.0% in P1 to 3.3%  in P3 but was back to 4.3% in P6. Among the subcategories, the percentage of See words rose steadily from 1% in P1 to 2% in P6, the percentage of Hear words declined slightly from 1.5% in P1 to 1% in P6, and the percentage of Feel words remained pretty much the same (around 1%).
  5. Biological Words – Biological words (e.g. blood, pain, intercourse) deal with various biological processes, actions, and parts of the body. Again, the percentage of words in this category was relatively small (around 3%).  This was higher than it was for the other LIWC Studies (at 2%) and remained relatively unchanged from P1 to P6. The subcategories of this dimension include Body, Health, Sexual and Ingestion. The overall percentage of Body words is around 1.6% (the highest of the subcategories). The percent in this category increased somewhat from P1 to P4 and P5 but is now close to the overall mean. Surprisingly, the percent of Sexual words is very small (around .4%). However, the percent has risen from close 0% in P1 to .6% in P6.
  6. Drive Words – Drive words refer to the (psychological) factors driving behavior including Affiliation, Achievement, Power, Reward and Risk. On average, close to 8% of the words fell into this category.  This percentage was only slightly higher than it was for the other LIWC Studies (7%) and was relatively unchanged over time. Among this category, the key subcategories are Affiliation, Power, and Reward. These subcategories are also more prominent in the other LIWC Studies. In rap lyrics, references to Affiliation (e.g. friends, enemy, ally) were somewhat higher in P1 (3%) then leveled out to 2% in the remainder of the time periods, references to Power peaked at around 3% in P3, and references to Reward increased slightly from P1 (1.7%) to P5 (2.5%) but declined in P6 to the overall average of 2%. Again, it is somewhat surprising that the percentage of words dealing with Risk (e.g. danger) was so low (overall at .3%).
  7. Other Words – Besides the dimensions and features discussed above, LIWC has been applied to a range of other words. Three classes of words that are particularly pertinent to rap lyrics are Death, Money, and Swear words. First, on average the percent of words pertaining to Death in Rap Lyrics (.2%) was the same as in other LIWC Studies (.16%). Additionally, the percentage in the sample Rap Lyrics did not changed significantly over the years. Second, the overall percentage of words dealing with Money was also close to the overall percent in the other LIWC studies (.6% versus .7%). Again, the percent in Rap Lyrics did not changed significantly over the years. Finally, and not unexpectedly, the percentage of Swear words in Rap Lyrics (1.6%) was greater than in the other LIWC Studies (.2%). This percentage increased in a curvilinear (wave-like) fashion from .1% in P1 to 2% in P6. Given the general impression of Rap Lyrics among the public, the fact that the percent of Swear words was on average only 2% may seem surprising. However, there was quite a bit of variability from song to song. While the maximum percentage for any song in the sample was 14%, over 80% of the songs had 2% or less. Obviously, it’s the songs at the upper range that garner the attention.

Summary of ANOVA trends

Reading a detailed statistical analysis of this sort is much like getting the results from a comprehensive series of blood panel tests – there are a myriad of details which can make it difficult to see various relationships or to paint an overall picture.  A general summary will be provided after the results of the “topic” analysis have been discussed. For the moment, Table 3 provides a summary of the significant trends and the patterns found in the stylistic analysis. The summary is organized by the possible trends displayed in Figure 2.

summary-anova-patterns

Trends in Content: Topic Modeling

Content(s) refers to the subject, topics, or themes of a document or one of its parts.  Obviously, for songs the content rests on the topics or themes reflected in the words of the lyrics.  Topic modeling is a collection of (text mining) algorithms designed to ferret out the themes or topics that occur across a corpus of documents. A recent edition to this collection of tools is Non-negative Matrix Factorization (NMF). It has been successfully applied in a range of areas including bioinformatics, image processing, audio processing, as well as text analysis (e.g. sentiment of analysis of restaurant reviews). Technically, NMF is “an unsupervised family of algorithms that simultaneously performs dimension reduction and clustering in order to reveal the “latent” features underlying the data (see Kuang, 2013 and Lee and Seung, 1999) . In this case the data of interest are found in the song-stem tf-idf matrix described in Figure 10 of Part 2 of this analysis.

To understand how NMF works and what the terms “dimension reduction” and “clustering” mean, let’s reconsider this matrix of tf-idf values discussed in Part 2.  However, this time, instead of focusing on the song-stem matrix for the individual songs, we’re going to first combine the stem counts for those songs that were released in the same time periods (according Table 2 in Part 2), and then calculate the tf-idf values for each stem in each of the groups of songs by year (the processes is depicted in Figure 5). In essence we are the treating each group of songs by year as a single song. In this way we can more easily explore the trends in content by year. For purposes of discussion we’ll label this matrix S.

 songstem-to-songbyyear

This matrix is 22 (years) by 500 (stems). In this analysis the stems are sorted in alphabetical order. The matrix is much too large to reproduce in tabular form (at least in this format). However, the following heatmap provides a general overview of the values.  Each row of the map contains a color coded rendition of the normalized stem frequencies (TFIDFs) for a given year. The colors reflect the relative magnitude of a given TFIDF with the darker colors representing more important terms (see index at the top).

heatmap-songbyyear

The heatmap indicates that:

  • For any given year (row), most of the TFIDF values are toward the lower end of the spectrum (.05 or less). These stems are of lesser importance.
  • For any given year (row), approximately 10% of the stems (~50) have values toward the higher end of the spectrum (.20 or more).  These are the stems that are key to understanding the content for that particular year.
  • Looking down the years (across the rows), there are a variety of instances (denoted by the darker lines) where the TDIDF values appear to remain relatively strong from year to year. These instances provide visual clues to the ebbs and flows of content from year to year.

Of course, the problem with the display is it’s hard to produce a succinct summary of the patterns and changes in content from year to year. This is true for both the patterns among the stems and among the songs. This is where topic modeling and NMF come into play.

Factorization: Simplifying the Representation

Mathematical factorization is the decomposition of an object (e.g. number, polynomial, or matrix) into a product of other objects which when multiplied together give the original. For example, the polynomial  x2 − 4 can be decomposed into (x – 2)(x + 2). In all cases, a product of simpler objects is obtained… The aim … is usually to reduce something to “basic building blocks” (e.g. polynomials to irreducible polynomials).

This is the basic idea underlying NMF. As the name implies, Non-negative Matrix Factorization (NMF) starts with a non-negative matrix (in this case the songByYear-Stem Matrix S) and factorizes it into two smaller non-negative matrices W and T. In this instance, W is an (N x K) matrix where N is the number of years and K is the number of latent topics discovered by the factorization process. Each entry in the matrix represents the importance of a particular topic for that particular year. In contrast, T is a (K x M) matrix where K is the number of latent topics and M is the number of original stems. Here, the entries represent the importance of a particular stem for a particular topics. So, in NMF the idea is to (iteratively) derive the two sets of weights so that when W and T are multiplied together the product approximates the original matrix S (i.e. S[n x m] ~ W[n x k] * T[k x m].

In carrying out this factorization process, the first goal is dimension reduction which revolves around the analysis of the weights in the “T” matrix.  The aim of this analysis is to surface a reduced collection of K topics that consists of a small set of highly weighted, distinctive stems that differs from topic to topic and represents a relatively coherent theme (not just a seemingly random collection of words). The second goal is clustering which revolves around the analysis of the weights in the “W” matrix. The aim of this analysis is to determine the similarities and difference in topics among the various time periods in order to surface trends in the changes in topics across the years.

Calculating and Analyzing Topics

There are a series of papers that detail and illustrate the specific steps and calculations used in performing NMF topic modeling (Blei, 2012; Das, 2015; Green, 2015; and Riddle, 2015A). Virtually all of these papers begin the process with a “raw” corpus, detail the NLP steps needed to turn the corpus into a document-term matrix of frequencies, then describe the steps needed to turn the doc-term matrix of frequencies into a matrix of TFIDF values, and finally submit this matrix of TDIDF values  to the NMF decomposition process to yield the W and T matrices. With these latter matrices in hand, the papers finally demonstrate various procedures for analyzing the results of the decomposition.

To simplify the steps (and to avoid the necessity of having a comprehensive understanding of the underlying math) virtually all of these papers utilize Python programs to carry out the computations and to assist in the analysis. In the case of our sample of rap songs, it was very easy to perform the modeling computations. The reason why is that I already had the matrix of TFIDF values in hand (i.e. the matrix S discussed earlier) which meant that I only needed three lines of Python code to handle the computations:

  1. model = decomposition.NMF(init=”nndsvd”, n_components=ncomp, max_iter=500)
  2. W = model.fit_transform(S)
  3. T = model.components_

The only real decision that needs to be made is to pre-specify the number of topics (ncomp) to be produced by the process. After a bit of experimentation, I settled on 5 topics because I found little practical difference in the results whether the number was 5, 10, 15, or 20.

Setting ncomp = 5 resulted in a T matrix with 5 rows (topics) by 500 columns (stems).  As noted, each cell(i,j) provides a weight for the stem relative to the topic, so by “ordering the weights in a given row and selecting the top-ranked stems, we can produce a description of the corresponding topic.” The top 15 stems for each of the topics is provided in Figure 7 (the sensitive words have been masked to protect the innocent):

topic-lists

Similarly, the top 50 stems for the same 5 topics is shown in a series of wordclouds displayed in Figure 8. The size and color of the words in each of the clouds reflects their relative weights for the associated topic. One of the first things you might notice about the lists and the clouds is that there seems to be overlap from one topic to the next. However, looks can be deceiving. Actually, in this case when you calculate something like the Jaccard Index of Similarity between any two sets of stems, the overlap is relatively wordcloudlow (i.e. most of the pairwise similarities are around .20 out of a max of 1.0). Except for Topic 0 (which is stereotypical Gangsta), another thing you might notice is that even though there are variations in the top ranked terms, it’s fairly difficult to come up with descriptive terms or titles for each of the individual topics. To do this requires a deeper understanding of the trends in the topics across the time and a look at some examples of the individual songs within those time periods.

Trend in Topics

The W matrix provides the key to understanding the associations between the various years and the derived topics.  In this case the weights in the individual cells of W represent the importance of a particular topic  for a particular year (of songs). By comparing successive years we can gain an understanding of the trends in topics over time. Before comparisons are made, the weights in this matrix are usually normalized so they sum to a value of 1 for each row (year) in the matrix. The results of this normalization are displayed in Table 6.

matrix-w

 

 

 

 

 

 

 

 

 

As Table 6 shows, for example, Topic2 had a weight of 1.0 in y91 (1991), which means that the remaining topics (0, 1, 3 and 4) all had weights of 0.0 since the sum of all the weights for any year adds to 1.0.  In y92 things were a bit different. Topic 2 had a weight of  ~.8 in y92 (1992) followed by Topic3 which had a weight of ~.16. The weights for the other topics (0, 3, and 4) were negligible.

One way to visually represent the data in Table 6 is with a series of radar charts.  A radar chart is a graph or plot for displaying multivariate data. A radar chart consists of a series of axes equal to the number of variables. The axis emanate from the same origin, the have equal length (i.e. they are normalized), and are separated by the same angles (like the spokes of a wheel). Typically, a point or dot is placed on each axis to represent the value on the associated variable. Finally, all the dots on adjacent axes are connected by lines, resulting in the formation of a closed polygon. Figure 9 shows the radar chart for a single year y95, Figure 10 shows the radar charts for all 22 time periods.

radar-one

 

 

 

 

 

 

radar-charts

The picture that begins to emerge from Table 6 and Figure 10 is that most of the years are dominated by one or two topics (just like 1995) and that the patterns seem to run in streaks.  Take, for example, the years from 1995 to 1998. They are almost clones of one another, i.e. for each of these years the weights for Topic0 and Topic3 are both close to .5 while the weights of the remaining 3 topics are close to or equal to 0.

The various streaks are also highlighted in Figure 11, which contains a matrix of correlations between the topic weights for any two years in our sample (e.g. r = 0.9 between y89- and y90). In most cases, the correlations between adjacent time periods are large (between .7 and 1.0), although periodically they aren’t (e.g. r = 0.4 between y92 and y93). Blocks of highly correlated adjacent cells signal streaks where the same topics have dominated from year to year.  In Figure 11, the box to the lower left summarizes the streaks based on these blocks, as well as the dominant topics in those years.

corrmat

There are other analysis that can be run to refine the notion of streaks (e.g. variants of hierarchical clustering).  In looking at the topic modeling results discussed here (as well as running a number of other analyses I won’t be discussing here), it appears that the evolution of topics and themes of rap lyrics has followed the same general pattern as the evolution of music audio features found by Mauch et al. (2015).  More specifically, their analysis found that, “while musical evolution was ceaseless, there were periods of relative stasis punctuated by periods of rapid change.” For them there were three major revolutions within the continuous evolution – 1964 (Motown and the British Invasion), 1983 (New Wave, Disco and Hard Rock) and 1991 with rise of Rap.  In the case of the evolution of rap lyrics, the shift in lyrical topics seems to have followed a similar pattern – continuing evolution punctuated by abrupt changes brought about by the rapid appearance of key sub-genres, i.e. the shifts from (1) Old School Funk and Soul to (2) Golden Age East Coast, New School, Rap Rock, and Pop Rap to (3) Gangsta G-Funk, West Coast and East Coast Hardcore and back to the future with (4) mainstream Pop Rap, Dirty South, Crunk to to West Coast to Crunk and now to more mainstream).

Analytical Summary

Sometime after starting this analysis (which has been going on forever), I began to think that it would suffer the same fate as Thompson’s earlier analysis of The Evolution of Pop Lyrics (2015).  In Thompson’s words:

Sadly the results for predicting the musical genre purely based on the lyrics aren’t good. It was only correct for 16% of results… The heat chart below shows a cross tab of the main lyric topic (LT) crossed with the musical genre (MG). It’s a pretty even spread. Therefore it wouldn’t matter what classification model was used these features aren’t going to predict genres… I may return to this dataset and try some different language processing techniques to try and create some different features and hopefully improve on the model accuracy…

In other words I had the sinking feeling that when I was finally finished, the analysis would only be good for the Journal of Non-Significant Findings or the Journal of Negative Results. Those are actually real journals, but unfortunately not in my fields of endeavor. So, not only would the results be useless, but they couldn’t even be published in a journal about useless results.

Nearing completion I’m a little more optimistic than that — for a couple of reasons. First, many of the results are statistically significant (at very high levels of probability). There are a number of stylistic properties in rap lyrics that vary significantly across time– like lexical diversity which has been on the decline (a concept that was the genesis of this analysis in the first place). Similarly, the data indicate that lyrical content also varies in significant ways across time.  Second, unlike Thompson’s analysis, the results from this analysis seem to mirror the patterns found in Mauch et al.’s (2015) study of The Evolution of Popular Music: USA 1960-2010, even though the subject matter differs. I already discussed their notion of “continuous evolution musical with abrupt changes” and the usefulness of this for understanding the results from the NMA topical modeling analysis.  In the same vein, they also noted that, “The frequency of topics in the Hot 100 varied greatly: some topics became rarer, others became more common, yet others cycled (see Figure 12 which is reproduced from their study). By topics, they were referring to a series of harmonic and timbral features in music. The patterns they were referring to are summarized in Figure 12 (a copy of a key figure in their study).  Essentially, this is the same thing that occurred with the various stylistic properties examined in this study – some went up, some went down, some cycled and some stayed the same. Finally, as they noted in their closing, “we have not addressed the cause of the dynamics that we detect.” Like biological evolution, a causal account of musical evolution must “ultimately account for how musicians imitate, and modify, existing music when creating new songs, that is, an account of the mode of inheritance, the production of musical novelty and its constraints.” The same can be said of this analysis.

Hindsight: Round up the Usual Suspects

Any results I’ve reported have to be tempered by a few caveats:

  1. Sample — There a couple of problems here.  First, the songs in this study come from a list generated from the Hot rap songs on Billboard.  It’s a source used by many of the published studies dealing with lyrics of all sorts. Obviously, these are the songs that have enjoyed commercial success and, as a consequence, are probably biased in a number of ways that can potentially impact the style and content of the lyrics. For instance, one known way (that I recently discovered when I came across on article on Quantifying Lexical Novelty in Song Lyrics (Ellis et al. 2015) is that lexical novelty is significantly lower for the songs and artists on the Billboard Top 100. Obviously, this has a number of implications for both the general content and style.  Second, the sample size of 1206 songs was probably too small for a really detailed study of the trends in rap over a 35 year period. This was especially true for the songs of the 80s , as well as the songs from 2010 to the present (both around an n =40).  Given this smaller sample size, it is very easy for the songs of a handful of artists to skew the results, especially those dealing with the analysis of lyrical style.
  2. Independent Variable – In this study the focal point was on time, which seemed natural since I was interested in the evolution of rap. However, as noted above, the key to this analysis rests with artists and writers.  So a better starting point might be to begin with a list of artists who are representative of various sub-genres, locations and eras and build a sample of songs from the albums of these artists. In this way, trends could be viewed from a multi-dimensional rather than a unidimensional perspective.
  3. Data Preparation – Among all the musical lyric genres, rap may be the easiest to recognize with automated NLP but one of the hardest, if not the hardest, to analyze with NLP. It is rife with slang, (intentional) misspellings, phonetic spellings, abbreviations, grammatical issues, and the like. It makes it extremely difficult and very time consuming to convert published lyrics into representative tokens of various sorts (e.g. into stems). This is true for both the analysis of style and content.  Surprisingly, when LIWC was applied to the sample of 1206 rap songs lyrics in their original lowercase form, 85% of the words were found to be in one or more of the LIWC dictionaries which means that 15% weren’t in the dictionaries.  15% may seem like a lot, but this is exactly the average percentage found in the dictionary for the earlier studies.  In terms of analyzing content, quite a bit of work went into substituting one version of a word for all its different forms before any NLP was done. For example, the word m*f*ker had enumerable spellings but was translated into its correct spelling before processing.  Of course, the problem is that it is virtually impossible to catch all these ahead of time, so it take a number of iterations to handle the corrections.
  4. Dependent Variables – In this analysis we used very crude measures for handling rhymes. There are a number of paper devoted solely to this subject. In subsequent analyses, it would be preferable to investigate this particular aspect in more detail. Similarly, this analysis only focused on unigrams and, like a number of earlier studies, completely ignored bigrams or trigrams. The notable exception was Fell and Sporleder study of Lyric-Based Analysis and Classification of Music (referenced in Part 2 of the analysis) which found that n-grams (n<= 3) were the most important factor in classification tasks over and above a range of other stylistic and content variables.  The use of bigrams or trigrams might assist with another major caveat – interpreting the topics and themes.
  5. Interpreting Topics – NMF is a type of unsupervised machine learning technique. That is, it “infers a function to describe hidden structure from unlabeled data. Basically, it cranks the numbers and assigns a weight to a generic topic label (like the weight for Topic0). In contrast, with supervised learning the topics (and their meanings) are known ahead. The sample on which the model is built is divided into various training and testing sets. The goal is to use the training sets with the learning technique to arrive at an algorithm which when used with the testing sets will correctly classify or predicted the known topic for a given object. As noted, a major issue with unsupervised techniques like NMF is that once we’ve arrived at the weights for the various topics it is usually very difficult to arrive at a meaningful label (unless you know some of the possibilities a priori).
  6. Other Modeling Techniques – There are a number of other techniques that can be used for topic modeling besides NMF.  I chose it primarily because it is easy to understand the meaning of the decomposed factors (i.e. matrices W and H).  Latent semantic analysis (LSA) and Latent Dirichlet allocation (LDA) are two other unsupervised techniques I’ve used in the past for other topic modeling projects.  It is possible that one of these other techniques might be preferable (that’s work for another time).
  7. Visualizing the Results – There are a variety of static visualizations that can be used with this sort of trend and topic analysis.  This analysis only employed a few including standard bar charts, heatmaps, word clouds, and radar charts.  Although they weren’t used here, I also created a variety of dendogram, other heatmap types, stacked area charts, streamgraphs, themerivers, and denogram/heatmap combinations. They were constructed with a combination of R, Python and Excel.  For those who are interested, take a look at Visualizing Topic Models at a site devoted to Text Analysis with Topic Models for the Humanities and Social Sciences (Riddell 2015B).  These programs are written in Python. If you prefer R, then the Tutorial on Area Charts at FlowingData.com is worthwhile.  This is Nathan Yau‘s site, he’s the author of a Visualize This and Data Points. It costs a bit of money to access the tutorials but it’s worth it if you are a frequent user of R.

What I haven’t tried is any interactive visualizations with this data set. These are especially useful if you’re trying to develop applications that enable users to interactively browse, search and retrieve song lyrics based on the results of real-time analysis of lyric content and style. These sorts of visualizations provide one entry way into more practical applications of this sort of research.

Applications of Analysis

The Ellis (2015) article cited above, begins with the intriguing claim that, “From 2004 through 2013, both U.S. and worldwide Google searches for “lyrics” outnumbered searches for ‘games’, ‘news’, and ‘weather’, as computed by Google Trends.” Actually there’s an update. The relative positions of the frequency of search for these 4 terms has shifted somewhat since 2013 (Figure 12) so that “lyrics” no longer dominate.  Yet, the sentiment still remains the same.  The interest in lyrics is still very high and it continues to motivate “several explorations for translating a song’s lyrics into “queryable” features: for example, by topic, genre, or mood.”

google-trends

This interest in the queryable features of lyrics is really a small part of  the larger interest in the area of music information retrieval (MIR). While the research side of MIR covers a range of topics including “classic similarity retrieval, genre classification, visualization of music collections, or user interfaces for accessing (digital) audio collections” (Schedl 2014),

[It]… is foremost concerned with the extraction and inference of meaningful features from music (from the audio signal, symbolic representation or external sources such as web pages), indexing of music using these features, and the development of different search and retrieval schemes (for instance, content-based search, music recommendation systems, or user interfaces for browsing large music collections)…

Table 7, which comes from Downie (2009) and has been modified somewhat, details some of the common tasks undertaken by those with an interest in domains as varied as “digital libraries, consumer digital devices, content delivery and musical performance.”

tasks-mir

MIR is really in its infancy (for an overview of the field see Downie 2009 and Schedl et al. 2014). MIR only started 10 to 15 years ago and really has only one major international society (ISMIR) solely devoted its study. Regardless, it has certainly captivated the interest in and investment from of a number of major players in the world of commercial music including companies involved in obtaining, storing, indexing, identifying, streaming, recommending, and delivering to music products and services end consumers.

Much of the early interest and activity both from a research and commercial standpoint has revolved around the audio side of music life.  More recently, the span of interest seems to be widening to include the lyrical side.  How this might playout will be the subject of a future posting.

Resources

References

Blei, David. Probablistic Topic Models. April, 2012. Communications of the ACM.

Bradley, A. and A. Dubois. The Anthology of Rap. 2014. Yale University Press.

Das, Sudeep. “Finding Key Themes from Free-Text Reviews: Topic Modeling.” January, 2015. 

Downie, J. “Music Information Retrieval.” 2009.

Ellis, R. et al. “Quantifying Lexical Novelty In Song Lyrics.” Proceedings of the 16th ISMIR Conference, Malaga, Spain, October 26-30, 2015. 

Greene, D. “NMF Topic Modeling with scikit-learn.” March, 2015.

Hirjee, H. and D. Brown. “Using Automated Rhyme Detection to Characterize Rhyming Style in Rap Music.” October, 2010. Empirical Musicology Review.

Jaime, M. Hip Hop Legends: 55 Game Changing Artists. 2015. Engaged Media.

Kuang, D. et al. “Nonnegative matrix factorization for interactive topic modeling and document clustering.” 2013. 

Lee, D. and Seung, H. “Learning the parts of objects by non-negative matrix factorization.” October 1999. 

McNulty-Finn, C. The Evolution of Rap.” April, 2014. Harvard Political Review.

Pennebaker, J. et al. “LIWC Linguistic Inquiry and Word Count: LIWC2015.”

Malmi, E. “Algorithm That Counts Rap Rhymes and Scouts Mad Lines.” February, 2015.

Riddell, A. “Topic Modeling in Python.” 2015A. Text Analysis with Topic Models for the Humanities and Social Sciences.

Riddell, A. “Visualizing topic models.” 2015B. Text Analysis with Topic Models for the Humanities and Social Sciences.

Schedl, M. et al. “Music Information Retrieval:Recent Developments and Applications.” 2014. Foundations and Trends in Information Retrieval.

Places (Virtual and Real)

International Society of Music Information Retrieval (ISMIR).

Tools and Modules

Basic Radar Chart in R – “fmsb” package.

One-way Analysis of Variance (AOV) with Contrasts – R “stats” package.

Linguistic Inquiry and Word Count (LIWC).

Line plot for Means and Standard Error – R “ggplot2” package with geom_line and geom_errorbar.

Non-Negative Factorization Matrix (NMF) Topic Modeling in Python.  Also, see Sudeep Das’ article.

Word Clouds in R – “wordcloud” package.

 

 

 

Analyzing Rap Lyrics – Part 2: Preparing, Cleaning, and Exploring Rap Lyrics

To briefly recap, the end result of Part 1 was the creation of a corpus of lyrics based on the weekly Billboard (BB) 15 Hot Rap Songs and extracted from the ChartLyrics.com archive using their API. On closer review of the lyrics, it turns out that a number of the songs were either missing, contained major errors or omissions, or were some language other than English. As a consequence, where possible I replaced the missing or erroneous songs with corrected versions from Genius.com, resulting in a final tally of 1206 tracks (which are available in the Dataset section of this blog).

In their current state, this corpus of lyrics is basically a collection unstructured text that requires a series of steps to convert into a structured format that is amenable to data analysis and mining. It’s these steps that are the focus of this part of the analysis.

From Unstructured to Structured Data

Most data analysis and data mining techniques rest on the assumption that the data are structured. Structured data “refers to any data that resides in a fixed field within a record or file.” For example, most of the data in relational databases and spreadsheets qualify as structured. The fact that these data rest on an explicit data model and are well-formed and organized makes them amenable to numerical and analytical manipulation. In contrast, unstructured data “refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner.” It’s typically, although not exclusively, “text-heavy” which in its native form requires transformation before numerical techniques can be applied.

Simply stated, before text can be manipulated or mined it needs to be “turned into numbers” which can then be analyzed with various computational, statistical and data mining methods. As Jennifer Thompson (2012) suggests,

The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various … statistical and machine learning algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project.

Here, the word document is a catchall term that represents the individual entities making up a corpus. For a corpus of tweets it’s the single tweet, for blogs it’s the individual blog, and for our rap lyrics it’s the individual song track.

Rap Analysis StepsIn the case of text (like our corpus of rap lyrics), the transformation is accomplished with the aid of various computational linguistics and statistical natural language processing (NLP) techniques. As Figure 1 shows (at the bottom), the specific steps that are used in transforming text into numbers depend in part on the type of analysis or mining to be performed.

 

In prior research about the classification of song lyrics (see Appendix 1), two types of analysis have generally been performed:

  1. Content – This type of analysis focuses on the content in the lyrics, determining the similarities or differences in the specific words or sequences of words among the individual songs or groups of songs in the corpus.  As indicated in Figure 1, this is done by using natural language processes to convert the text in the corpus of lyrics into individual terms or sequences of terms (S-3) to determine both the unique terms found in the corpus (i.e. the vocabulary), as well as how frequently each of the terms or sequences occurs in the individual songs or groups of songs (step 4).Rap Analysis Song-Stem  Usually these steps produce a document-term matrix (S-5A) where the rows represent the individual documents or a group of documents, each column represents one of the unique terms or sequences of terms in the corpus, and each cell of the matrix — the intersection of row-column — indicates the number of times (weighted or unweighted) that each term occurs in the document (see Figure 2). In mining this corpus of lyrics, we end up mathematically comparing the row vectors of numbers in the matrix with one another – often basing the comparison on the computed distances among the vectors — in order to discover the underlying topics found within the various songs or groups of songs.This sort of matrix is also called a “Bag of Words (BOW).” It’s the type of structure that MXM used to represent their million song dataset.  This type of structure is not without its critics. As Yang and Chen (2011) point out, lyrics differ from regular documents such as news articles in a variety of ways:
    • Lyrics are composed in a poem-like fashion with rich metaphors that make word disambiguation difficult and degrade the performance of unigram models (like BOWs).
    • Negation terms such as no or not can play important roles in lyric analysis, especially those focused on emotion and sentiment.  Simply treating them as single words or as stop words can alter the semantic meaning completely.
    • Lyrics are recurrent because of stanzas.  This recurrent nature is not modeled by BOWs since word order is disregarded.

    For this reason, many of these critics have focused on the structure and style embodied in the lyrics rather concentrating on their content.

  2. Style –  While the content of a lyric refers to “what was written or said,” the style refers to the “way it was said.” As Fell and Sporleder (2014) note, songwriters employ unique stylistic devices to build their lyrics, some of which can be measured automatically, and some of which are unique enough to distinguish one song or group of songs from another.  Examples of these devices include the use of non-standard words and slang, rhythmic structures, references to the past and present, and self- or other-referencing (e.g. “I” versus “you). Essentially, this type of analysis begins with a set of predefined features and associated categorical or ordinal values along with a dictionary (listing) of specific terms, phrases or (grammatical) structures that represent those values (i.e. include lists for S-3C). For example, words or phrases that are slang (versus non-slang) or specific grammatical structures that represent various types of rhyming. With these features and values in hand, the task is to convert the lyrics in a song or group of songs into individual words, phrases or structures that enable automatic matching and counting of the (absolute or relative) number of times particular values within the dictionary occur (S-4).  In this type of analysis, standard statistical tests (e.g. Chi-Square, F-tests, and ANOVA) are used to compare the songs or groups of songs.Although there is no explicit requirement in this type of analysis to generate the equivalent of a document-term matrix (S-5B), sometimes it makes programmatic sense to do so. Rap Analysis Song-Props One option is to use a structure like the one created by Pennebaker et al. and used in an application known as the Linguistic Inquiry and Word Count (LIWC). Here, the documents are still the rows, but the columns are the values of one or more properties, and the row-column intersection represents the number of times (absolute or relative) that a particular property value occurs within that document (see Figure 3). So, in our case, the rows are the songs or groups of songs, the columns might include slang words, swear words, past-tense verbs, etc., and the intersections would be the (absolute or relative) number of slang words, followed by the number of swear words, followed by the number of past-tense verbs, etc.  This type of table is called a data frame in the programming worlds of R, Python, and Spark.

In practice, it’s not really a binary choice between the doc-term matrix or the doc-property data frame.  Instead, those studies that focus on style rather than content usually perform a BOW analysis in order to have a baseline against which to compare the classification capabilities of the various stylistic properties.

Converting the Corpus

For our purposes Steps 1 and 2 in Figure 1 have are already been completed since we have an “established corpus” of rap lyrics already in hand.  The next Step (3) involves a series of mini-steps aimed at “converting” the corpus. These mini-stems are based on a standard natural language processing (NPL) and include:

  • Step 3A – Tokenizing:  Parsing the text to generate terms or tokens. Sophisticated analyzers can also extract various phrases and structures from the text. This is typically a first step in the conversion process regardless of the type of analysis being performed.
  • Step 3B – Normalizing:  Using some combination of filtering and conversion capabilities to discard punctuation, eliminate numbers, and to convert uppercase letters to lowercase.  With the exception of eliminating numbers, analysis aimed at understanding the content usually embraces all of these steps.  In contract, analysis focused on the style typically ignores this step with the exception of converting all the terms to lowercase. When you making everything lowercase, the capitalized and lowercase versions (e.g. “Shorty” and “shorty”) will be treated and counted as the same word rather than as two.
  • Step 3C – Including or Excluding Particular Classes of Words: The heart of style analysis revolves around a series of include lists (dictionaries) against which the tokens produced in Step 3A are compared. In contrast, the analysis of content revolves around a series of exclude lists or cutoff points. For example, take a look at Figure 5.  It shows the counts for the 50 most frequent lower case tokens (words) in our sample of 1206 rap lyrics.  Notice anything about the tokens or words on the horizontal axis? These are not the words that come immediately to mind when you think of rap lyrics.  These are all words we use constantly in everyday discourse. The overwhelming majority of them can be classified as function words, as opposed to content words. In NLP vernacular they are also called stopwords. These are the words that “connect, shape, and organize content words.” There is no way to avoid them in written or spoken communication. In English there are a little over 300 of these words, yet this small number usually accounts for 50% or more of all the words in any kind of corpus.  Since every song has a plethora of these kinds of words, they are of little value in distinguishing the content from one song to the next.  So, they are eliminated.Rap Analysis Top 50 AllIn the analysis of content, the same exclusion tactic is often used at the other end of the spectrum to eliminate words that occur very infrequently. Those words that occur only once in an entire corpus are called hapax legomenon (hapax for short).  What happens if you exclude one of these tokens — say “357” (that’s a gun), or “skitzomanie”, or  “blooooood”, or even “blueprint” all of which are hapax in our sample of rap songs? That answer is not much.  However, what happens if we exclude not one but all the hapax? Figure 7 tells the story. In our sample of lyrics, it turns out that there are around 660K tokens in the corpus (excluding punctuation), while there are about 23K tokens in the vocabulary underlying the corpus. If you eliminate all the hapax (the tall blue bar on the left), you knock out 55% of the vocabulary (~13K/23K) but only 2% (23K/660K) of the corpus. If you eliminate those tokens that appear 2-4 times in the corpus, you knock out another 28% of the vocabulary and only 3% more of the corpus.  When comparing one song or group of songs with another, the elimination of these infrequent words are not likely to have much of an impact on any topic modeling process.  This is one of the reasons that this strategy was used to build the BOWs for the MXM million song dataset. Recall that the MXM data set started with a corpus of 55M words with an underlying vocabulary of ~500K unique words.  Of the 500K words, 238K were hapax and another 112K occurred between 2-4 times.  Eliminating these still left a handy 111K words, way too many for analysis.  In the end, the vocabulary was whittled down to the 5000 most frequently occurring stems (not words) including the stems for the function.  This left 92% of the original corpus (51M out 55M tokens).Again, this sort of exclusion is antithetical to the analysis of style.  Eliminating all these infrequent words can substantially impact, for example, the counts for specialized groups of words like echoisms — songs with repeating characters or substrings — or rhyming sequences.
  • Step D – Stemming:  The final step in the conversion processing is stemming which involves converting a token or word into its root form so that noun forms like ‘doggs’ and ‘dogg’ are considered the same rather than unique, or verb forms like ”want’, ‘wants’, or ‘wanted’ are treated as the same rather than as different words.  The result of the stemming process is often an shortened word form (e.g. so that ‘achieve’, ‘achieves’, ‘achieved’ and ‘achieving’ all become ‘achiev’). Converting words or tokens to their stemmed form is an art rather than a science – a crude one at best. For English language text, there are a variety of stemming algorithms that can be used to carry out the task.  By far the most popular is Porter’s, which is what I used with my sample of rap lyrics.Stemming is basically a chopping algorithm, so it doesn’t handle word forms that mean the same thing but are spelled differently or word forms that are spelled the same but mean different things. This is where lemmatization comes into play. This process removes inflectional forms (e.g. ‘wants’ to ‘want’) and returns the base form (i.e. lemma) of a word so, for example, that the verb forms ‘am’, ‘are’, and ‘is’ would yield ‘be’. In the same vein it can distinguish words that are spelled the same but play different roles in a particular context (e.g. the verb ‘saw’ which is converted to ‘see’ versus the noun ‘saw’ which is converted to ‘saw’). Unfortunately, in the world of rap lyrics both standard lemmatization and stemming fall far short because of the large number of slang terms, abbreviations and intentional misspellings (e.g. most verbs ending in ‘ing’ are shortened to ‘in’). In essence they’re fine when faced with standard English. They don’t work well when the text strays from the standard path.Because the lemmatization and stemming algorithms produce very similar results (especially after the stopwords are removed), I haven’t applied formal lemmatization to the lyrics. I did, however, create special functions to handle swear words and some of the slang terms. In Rap, swear words and their close relatives have an amazing variety of spellings and presentations. For example, in my sample of Rap lyrics the word m***f***r occurred a little over 700. There were over 60 different renditions of the word with the standard dictionary spelling accounting for about 200 occurrences. To handle these specialized terms and versions, I created conversion routine to transform them into the same dictionary spelling or a substitute term (for presentation purposes).As you might imagine, stemming is a no-no in the analysis of style. It would make it impossible, for example, to determine the tense of any verb.

Converting Rap Lyrics

Simple Example

With these general steps in mind, let’s look at how the conversion steps work for the lines from the end segment of  a rap song in the corpus called “Ease My Mind” by Arrested Development (1994). The choice of this particular song and segment is basically random, although the wording is more straightforward than most of the other 1205 songs in the collection and misogyny and vulgarity are virtually non-existent (so on the surface the words don’t automatically offend anyone).

Rap Analysis Ease My MindThe lines from the segment are shown in Figure 6 along with the results of the various conversion steps. The results were produced by a Python program I created using the natural language toolkit (nltk) in combination with a collection of regular expressions and some input/output code modeled after earlier work done by Toby Segaran (2007).  In other text analysis, I’ve also relied sklearn feature extraction module in Python and on a text mining package (TM) in “R” to accomplish similar sorts of tasks. They all can do the job.

Impact on the Rap Corpus

Rap Analysis Corpus Conversion ResultsWith each step, the number of terms in the corpus and the associated vocabulary diminishes. The reduction is apparent in Figure 6, but it’s readily seen in the aggregate counts that result when the entire corpus of rap lyrics is converted (see Table 1). Here, it’s obvious that the biggest impact occurs when the stop and function words are excluded.

As noted, the need for specific steps varies depending on the type of analysis – style versus content. Those studies focused on style usually perform some form of tokenization and normalization and end up working with the lowercase versions of the tokens.  In contrast, most studies dealing with content perform steps 3A-3C, while step 3D is optional. For instance, in his study of the evolution of the content in pop lyrics, James Thompson (2014) chose to bypass stemming. In his words,

After scraping the lyrics, I then built a Python script to process them into a useful format… This involved clearing out punctuation, rogue characters etc. then splitting the lyrics into words and removing the stop words…The Python stop words package provides a good list. I also experimented with stemming; which takes the word back to its root stemming becomes stem etc. In the end I preferred the results without this…

which means that his LDA analysis employed the terms that resulted from excluding various stop and function words (the non-stops).

I’m actually performing both types of analysis.  For style, I’ll be working primarily with the lowercase terms.  For content, I’ll be using the stems which is in keeping with the MXM BOWs

Creating a Song-Stem Matrix: Counting and Weighting

It’s a bit more straightforward to understand the process of creating a document-term matrix (a BOWs) to be used in topic modeling, as opposed to creating a document-property data frame for analyzing style. So, let’s start with the BOWs or bag-of-stems (BOS) in this case.

Selecting the Stems (Columns)

Recall, that the end goal for this type of lyric analysis is a matrix whose rows are songs (i.e. documents), whose columns are the unique stems (i.e. terms) across all the songs, and the cells represent the number of times that a particular song contains a particular stem.  If it doesn’t contain the stem, the number in the cell is zero (0).  There are a variety of ways to come up with the list of unique stems for a corpus, but basically we gather all the stems from all the songs in the corpus and find the “set” for this list. Mathematically, a set is a collection of distinct  or unique elements.  In our vernacular it’s the vocabulary.

Rap Analysis Sorted SetFor example, the (sorted) set associated with the list of stems at the bottom of Figure 6 is shown in Figure 7. Here, the list of stems has gone from 42 to 32. When you gather the stems for the entire corpus into a single list and then create a set from the list, the aggregate change corresponds to the numbers at the bottom of Figure 8.  The list of stems has ~242K terms, while the set of unique stems is ~18K terms. Practically speaking, 18K elements is too large for analytical purposes, after all the MXM BOW is only 5K and it represents the corpus for 210K songs versus the 1206 rap songs in my corpus.

As the earlier discussion (S-3C) suggested, one way to reduce this number is to eliminate terms that appear very infrequently in the corpus.  While you can do this earlier in the process, I decided to postpone this step until after I had empirically examined the results from stemming.   It’s akin to what James Thompson did during his topic analysis of pop lyrics:

Now I needed to build the LDA (topic) model… The biggest decisions I needed to make were around how many iterations to run, and how many topics I wanted to find. I ran the process with a variety of iterations…  It was during this process I also discovered the real art to LDA is in the stop words. I found myself adding many more to the list, particularly more lyric based ‘noises’ rather than words. Things like oohhh, ahhh, doo deee, dah etc. These weren’t adding any value to my topics.

Of course, it’s probably easier to look at a frequency distribution of all the terms or stems to determine what constitutes “noise” rather than excluding them in a piecemeal fashion.  As discussed in S-3C, one way to find the “noise” is by determining those terms or stems that only appear a few times in the corpus, say 4 times or less. Figure 5 only hints at the impact of this reduction because it’s based on the tokens not the non-stops or stems.  Using the stems, if you eliminate those that occur 4 times or less, the number of unique stems drops about 70% and the corpus of stems by about 40%.  This still leaves a list of close to 5000 unique stems, which may be too large a number for the sample size of songs (i.e. 1206 rows versus 5000 columns).

A second way to eliminate the “noise” is by looking at those terms or stems that only appear in a small percentage of the songs, say 10% or less.  This is the sort of tactic suggested by Segran (2007).  In fact he suggested excluding those terms that appeared in less than 10% of the documents (songs) on the low end and more than 50% on the high end.  For this analysis, the high end has already been covered pretty much simply by eliminating the function or stop words.  So, the stem that appears in the most songs is “time” which shows up in 54% of the songs.  If we jettison those that appear in 10% or less of the 1206 songs, we’d end up with ~220 unique stems and reduce the corpus of stems by ~40%.  Here, 220 stems may be too small.

Rap Analysis Freq in Songs vs CorpusAs Figure 8 shows, the number of songs in which a stem appears is strongly related to the number of times it appears in the corpus (obvious).  If they we’re linearly related, the correlation “r” would be .92. The choice is kind of a tossup although they’re easily combined.

Somewhat arbitrarily, I decided for starters to work with those stems that occur a minimum of 50 times in the corpus and appear in at least 50 songs.  This combination yields about 500 unique stems (note: Figure 9 provides a look at the Top 50 of the remaining stems – those that appear most frequently in the corpus ). So, the starting song-stem matrix has 1206 rows and 500 columns. I say starting because it will take a few iterations through the topic modeling process to determine whether 500 is sufficient to distinguish the songs and to determine whether the topics have evolved over time.

Rap Analysis Top 50 Stems

Weighting the Counts

As it now stands, each of the cells in the song-stem matrix indicates the actual number of times that a particular stem occurs in a particular song.  When determining the importance of a word within a song or comparing the differences in word counts from one song to another,  the raw number can be inflated because some songs have more words than others and some words occur more frequently in the corpus than others. For these reasons,  It’s standard practice in text analysis to substitute a weighted value for the raw count.  A common weighting scheme is term frequency-inverse document frequency (tf-idf).  There are various ways to compute tf-idf, but the simplest version is:

tf-idf =  tf(t,d) * log(D/Di)

where tf is the frequency of term t in document d, D is the number of documents, and Di is the number of documents containing the term.

As Manning et al. (2008) note, td-idf is:

  1. highest when the term or stem occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
  3. lowest when the term occurs in virtually all documents

For example, the stem “time” appears in more songs than any other stem — 661 out of the 1206 songs.  In the song “Party to Damascus”(Wyclef Jean 2003) it appears 6 times.  So, instead of using a 6 for this cell in the matrix, we’d use the weighted value 3.6 (i.e. tf-idf = 6 * ln(1206/661)), reducing its prominence in the song because the stem occurs so frequently in the corpus.

Song-Stem Matrix for Topic Modeling

Rap Analysis TF-IDF MatrixThe end result of the various conversion, selection and weighting steps outlined in Figure 1 is a song-stem matrix like the one shown in Figure 10.This is the form that will be used for topic modeling in Part 3 of the Analysis.

Creating a Document-Property Data Frame for Analysis of Style

Lyrical style covers a wide range of properties – everything from emotions and moods, to the intelligence factor, to the rhyming structure. Among the various studies concerned with lyrical style, the analysis by Fell and Sporleder (2014) deals with the widest range of properties.  They group the various properties into five dimensions and 13 features including:

  • Vocabulary — TopK n-grams (n<=3), lexical diversity (type-token ratio) and non-standard words
  • Style – POS/chunk tags, length, echoisms, and rhyme features
  • Semantics – Conceptual imagery
  • Orientation – Pronouns, past tense
  • Song Structure – Chorus, title and repetitive structures.

In this analysis I’ll be covering a subset of these dimensions and features including:

Rap Analysis Style Dimensions Features

The measures for the first two dimensions in the above subset — vocabulary and style — are very similar to those used by Fell and Sporleder.  The only exception is that I’m using a newly constructed algorithm recently posted by Eric Malmi (2015).  For the other two dimensions — semantics and orientation — the features are generally the same but the measures are more extensive.  The reason why it’s more extensive is not because I’ve personally come up with a better set of measures. It’s because I’m relying on a computer application called the Linguistic Inquiry and Word Count (LIWC).  LIWC is the result of years of research by James Pennebaker et al. that began back in the early 1990s. The basic premise of all this work is that “the words we use in daily life reflect who we are and the social relations we are in,” as well as “what we are paying attention to what we are thinking about, what we are trying to avoid, how we are feeling, and how we are organizing and analyzing our worlds.”  Towards this end, they have developed LIWC, an application that analyzes written (and spoken) text with respect to 90 psychometric properties.  For the most part Pennebaker et al. have used LIWC for a “higher purpose” — determining and improving an individual’s psychological well-being — as opposed to using the application to analyze  rap lyrics.  Fortunately, the LIWC is very simple to use, yields a much broader coverage of the semantic and orientation dimensions than previous studies, and can be applied to virtually any corpus.

Song-Property Matrix for Statistical Style Analysis

Rap Analysis Song-Props for StyleAlthough it is certainly not mandatory, once the data corresponding to the features and measures in Table 2 have been completed, it’s useful to put them in a data frame which can then be used for further analysis.  For example, this is what the LIWC does.  It’s also what I’ve done for the next stages of the statistical analysis of style.

Prelude to Part 3

As many researchers have lamented, in most data analysis projects the processes of gathering, cleaning and preparing the data occupy about 80% of the time. The other 80% (that’s not a typo) is devoted to the actual modeling, communicating, visualizing, reporting and packaging.  It’s these later steps that are discussed in Part 3 of this series.

Resources

References

Bertin-Mahieux, T. et al. “The Million Song Dataset.” International Society for Music Information Retrieval (2011).

Fell, M. and Sporleder, C. “Lyrics-Based Analysis and Classification of Music.” 25th International Conference on Computational Linguistics (2014).

Manning, C. and Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press (1999).

Manning, C., Prabhaker, R. and Schutze, H, and . Information Retrieval. Cambridge University Press (2008).

Tausczik, Y. and Pennebaker, J. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology (2010).  [Note: there are a number of articles by Pennebaker et al.  This has one of the more complete descriptions of LIWC].

Segran, T. Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reily (2007).

Thompson, James. “The Evolution of Popular Lyrics and a Tale of Two LDA’s.” (2015).

Thompson, Jennifer. “Text Mining (Big Data, Unstructured Data.” (2012).

People

Christopher Manning

Toby Segran

James Pennebaker

Tools and Modules

Linguistic Inquiry and Word Count (LIWC)

Python Natural Language Toolkit (NLTK)

Python Sklearn Feature Extraction Module

R Text Mining (TM Package)

 

 

Analyzing Rap Lyrics – Part 1 of 3: Creating a Corpus

My initial interest in rap music was spurred a while back by an online announcement I read about the winners of the 2014 Kantar Information is Beautiful Awards.  Among the 2014 awards, the one that peaked my curiosity was Matthew DanielsRappers, Sorted by Size of Vocabulary,” the gold winner in the “Data Visualization” category. Daniels Hip Hop AnalysisThe visualization, which is shown in Figure 1, actually appeared in  a more extensive article entitled, “The Largest Vocabulary in Hip Hop: Rappers, Ranked by the Number of Unique Words Used in their Lyric.” The answer to the question about “who’s is largest” depends on who’s being compared, the specific lyrics being examined, and the meaning of terms “vocabulary” and “largest.” Stated succinctly,  Daniels used 85 well-known rappers, compared  “each rapper’s first 35K lyrics” (about 3-5 albums worth), defined vocabulary as the number of unique words (actually tokens), and created an award winning visualization to display the rankings. By the way, Aesop Rock was the winner (7,392), DMZ was the loser (3,214), and the vocabulary of a number of rappers compared favorably to Shakespeare (5170 unique words in 7 well-known plays) and Melville’s Moby Dick  (6122 unique words in the first 35000 words).

For those folks versed in text analysis, a few thoughts come to mind.  First, “Damn, wish I’d thought of that!” Second, Daniels glossed over a number of difficult analytical questions, some of which he acknowledged and others he ignored.  For instance, he used tokens not stems, so that things like “pimps and pimp” are treated as two unique words, not one. Similarly, he ignored the commonality of slang terms like “shorty” and “shawty,” so that they were also treated as different when in fact that they often have the same meaning.  Decisions like these can result in different rankings, although Aesop would win regardless of the choices. In all fairness, this wasn’t an academic article and, as Daniels noted, even with the issues “it’s still directionally interesting.”

The offshoot of my fascination with Daniels’ article is that I have spent the past 18 months or more periodically investigating rap lyrics, using various sorts of text analysis and text mining to document and visualize their evolution from earlier years to the present.  Like most of the research work I’ve done in the area of automated text analysis and mining (working with online tweets,  blogs, news articles, and product comments, reviews and recommendations), rap lyrics present a number of tough analytical hurdles whose solutions can inform text analysis done in other areas (and vice-versa). Similarly, if you can make a business case for analyzing rap lyrics, you can probably make a business case for doing analysis in other pop culture arenas. I’ll talk about the business specifics later.

Text Analysis of Rap Lyrics

Just to reset.  In this analysis, the discussion will follow the steps outlined in my last blog entry (excuse the hiatus).  The outline includes: 1. Research Question(s); 2. Data Collection; 3. Data Processing & Cleansing; 4. Exploratory Data Analysis; 5. Modeling; 6.  Communicating, Visualizing and Reporting; and 7. Packaging.  Part 1 of this analysis will cover steps 1-2,  Part 2 will be devoted to steps 3-4, and part 3 to 5-7.  More specifically, the aim of Part 1 is to arrive at a corpus of lyrics covering a number of years and artists, that  can be cleaned, prepared and initially explored in Part 2, and finally analyzed in Part 3.

Step 1 – Research Question(s)
Blog - Summary of Articles and Papers on LyricsIn addition to Daniels’ article (and other articles he has written on related subjects), there have been a wide variety of articles and academic papers focusing on the text analysis of lyrics in general and rap and hip hop lyrics in particular. A detailed summary of a number of the more recent articles and papers is provided in Appendix 1. Table 1 provides an overview of the primary focus of those articles and papers.

 

Here, my primary focus is on using machine learning algorithms and data visualization to determine: Whether there have been major shifts in topic content and lyric style across time? If so, what have they been and have they been aligned with the shifts from one genre of rap to the next?

In this respect, this analysis follows in the footsteps of a variety of earlier studies including:

In particular the analysis in this discussion will follow two strains: (1). topic mining which relies on an algorithm known as Non-negative Matrix Factorization; and (2) classification based on the underlying style of a lyric. At the end of this three part discussion (step 7), I’ll reflect on the practical uses of the research being done on rap lyrics.

Step 2 – Data Collection

The foundation of any text analysis is the text corpus on which it is built. By text corpus I simply mean “a large and structured set of texts.”  In this case the set of texts is a collection of lyrics for a sample of rap song titles or tracks.

In the past, textual analysis based on larger collections of song lyrics were virtually impossible.  These earlier studies relied on “close reading” and manual coding of the various words and phrases in the lyrics of interest – a mind numbing process at best. Today, there are enumerable Web sites providing very large collections of song lyrics of various genres. There are so many of these sites that there are special directories devoted to maintaining lists of and links to them (e.g. allrecordlabels.com).  A number of earlier studies have relied on these sites for their data (see Appendix 1).

Blog - Study HeadlinesBe wary of “crowds bearing gifts.” As you might expect, there is a lot of overlap in song titles from one site to the next, although no site has a complete nor perfect archive.  At a number of the larger lyric sites, their collections are built through crowdsourcing.

Submissions from the laity are prone to typos, omissions and other errors. Like Wikipedia, not only do these sites rely on end users for submissions but also for editing.  Apparently, when it comes to lyrics, their oversight is less than sterling. MacRae and Dixon (2012) did a thorough job of comparing the similarities among the same song lyrics at different sites. I’ll spare you the details. The punch line, they found an average accuracy of less than 40% which indicates substantial variability among the various copies of a lyric.  Bottom line, before you start grabbing lyrics from one site or another, you need to get a sense of how much accuracy you require, and how “accurate” and complete the lyrics are at the site or sites you intend to use. And yes, you have to be careful about copyright and licensing issues, if you plan to distribute the lyrics.

Building a Sample of Rap Artist Names and Song Titles

A couple of caveats.  I know there’s a difference between the terms “hip hop” and “rap,” but in this paper I’ll usually use the term “rap.” Similarly, I use the term “artist” to cover both individual musicians as well as groups.

In collecting a large sample of rap lyrics, one of the first challenges is creating the list of songs whose lyrics will be included in the sample. Practically speaking, you can’t go to Google or Bing and search for something like “Give me a list of rap songs from 1979 to the present.”  For the uninitiated, 1979 is considered by many to be the year the first commercially successful rap song was released (Sugarhill Gang’s Rapper’s Delight). Obviously, you can physically make the search request, but what you’ll get back is a series of sites providing lists of their top songs for a given year (say 100), or maybe their top song for each year starting with 1979, or the top songs of all time. What you won’t find is a site providing a large, up-to-date laundry list of rap or hip-hop songs titles.

So, how do you build a list of songs?  Currently, there seems to be three ways to do this:

  1. Create your own list from scratch.
  2. Use a preexisting list.
  3. Use a preexisting database or dataset of songs which is based on a specified list.

Which one is preferable? It really depends on the question you’re trying to answer and how much work you’re willing to do.  In my case, whatever approach is used needs to yield a sample of songs that is large enough and variable enough to adequately trace the shifts in lyrics from the early years to the present (assuming there have been shifts).

Create your own list artists and songs from scratch

While most lyric sites don’t have an organized list of songs, they do have a list of artists. If you go to lyric archive site and (physically or programmatically) click on the name of an artist, up pops a list of their songs or albums which lead to the songs. With a substantial list of artists and a little programming effort, you can extract a sizable sample of song lyrics. Most of the larger and more popular lyric sites (e.g. genius.com, azlyrics.com, or metrolyrics.com) don’t categorize artists and their songs by genre.  While you can get a list of artists from these sites, there’s no easy or direct way to ferret out the rap artists from the others.

Blog - OHHLA siteFortunately, there is a site that is solely devoted to hip hop and rap.  It’s OHHLA.com, “the Original Hip-Hop Lyrics Archive.” OHHLA provides a list of artists that comes in alphabetical chunks that are divided into five separate pages (e.g. A-E, F-J, etc.). A little manual copying, cutting and pasting produces a list of over 3K hip hop artists. Another source is Wikipedia which has two lists that are pertinent – one for hip hop musicians and one for hip hop groups.  Both of these lists are each presented on a single page that is divided into segments based on the first number or letter of the musician’s or group’s name (so two pages with 27 segments). Again, a little manual effort results in a list of 1671 hip hop or rap artists. Certainly, you could write a program to do this work, but I’d only resort to programming if the list was separated into a much larger number of segments (like an individual page for each of the 27 segments).

Now, with one of these list of rap and hip hop artists names in hand, you can use it to search one or more of the lyric sites for the list of songs associated with each of the names.  Sometimes it’s a simple list (like you find on Genius.com) and other times it’s a list of albums leading to a list of songs (like Azlyrics.com).  These lists can be quite long (sometimes in the 1000s) and can include songs where the artist is the sole singer/rapper, songs sung with other artists, songs where they are the primary and others are featured, songs where another artist is primary and he or she is featured … you get the general idea.  Generally, you only want the songs where the artist is sole or primary whether others are featured or not. Unlike the creation of the list of artists, there’s no way you can do this manually unless you have an inordinate amount of spare time on your hands (more about this later).

Use a Preexisting List of Artists and Songs

Most of the previous analysis of song lyrics do not create their lists of artists and songs from scratch.  Instead, they use someone else’s list. A lot of researchers have built their lists from various segments of Billboard (BB) Magazine’s  “Hot Rap” song charts.

Blog - Billboard Rap  Billboard has been charting “Hot Rap” Songs since March 11, 1989. Originally, the were reported every other week but toward the end of 1989 with the November 4th chart they began reporting weekly.  Actually, there are a series of charts, but the three most often used for the analysis of Rap Songs are:

 

 

  1. The 15 Hot Rap Songs released weekly — billboard.com/charts/rap-song/YYYY-MM-DD (this chart is published on Saturdays and requires the exact dates of those Saturdays in order to query the charts);
  2. The Rap Song at the top of the weekly Hot Rap Songs chart — billboard.com/archive/charts/YYYY/rap-song
  3. Top 100 Rap Songs released every year — billboard.com/charts/year-end/YYYY/hot-rap-songs starting in 2006.

About 6 or 7 months ago, I started looking in detail at the BB charts.  My initial BB examination started with the Rap songs at the top of the weekly lists (#2). I did a pretty thorough text analysis of the 300 top ranked songs that appeared on the weekly lists (the data are provided in the Datasets Section for you analyzing leisure). However, the small sample size on which the analysis was built concerned me, especially when it came to comparing songs released in different eras. So, I decided to look at the Top 15 songs over the same time period (#1).

Blog - Evolution of MusicClearly, looking at the top 15 BB rap songs charts results in a much larger sample than restricting it to the songs at the very top – about 1900 songs compared to approximately 300. I was just starting to analyze this larger sample when I came across Mauch et al.’s study of “The Evolution of Popular Music: USA 1960–2010” (2015). The data backbone of this study was the US Billboard Hot 100 charts for the designated time span.  During this time period over 17,000 songs appeared on the weekly BB Hot 100 chart. While their study utilized “features extracted from audio rather than from scores,” their dataset, which is publically available, provides a list of artist names and songs  from a wide variety of genre including rap and hip hop that be used to seed a lyric study.  In fact, this is exactly what James Thompson did in a follow-up study of the lyrics for the same sample of songs.

In Thompson’s words: “Inspired by this amazing paper … I thought it would be interesting to see if something similar could be done with Pop lyrics. First, I need to get the lyrics for the songs. The authors … kindly published the data from their paper. It contains all the song titles and artists. Using this data and the API for ChartLyrics.com, I wrote a Python script to scrape the lyrics for each song. Unfortunately I wasn’t able to find every song, but I found around 80%. This was also an automated process and I haven’t thoroughly checked the output, so I would imagine there are a few incorrect songs in there.” Thompson’s analysis made it easy to follow in his footsteps since he also published his data.

So, inspired by both articles, my next trek down the data rabbit hole was to take their list of songs and metadata (primarily subgenre and years), reduce it to a set of rap and hip hop artists and songs, and, like Thompson, write a Python script to scrape the lyrics for each rap song.  Based on their sets, this meant scraping about 2K rap titles.  Like Thompson, I also wrote a Python script to scrape the lyrics. Unfortunately, while Thompson found lyrics for 80% of the original list, I only found 1185 of the rap and hip hop songs whose lyrics were available at ChartLyrics.com.  A hit rate of 57%, which is pretty poor although 1185 is certainly larger than my original samples of 300 lyrics.

Creating the Corpus – Retrieving and Extracting (Rap) Song Lyrics

Whether you like it or not, if you’re going to do text analysis of any reasonable sized corpus,  there’s virtually no way to avoid programming. If you’re relying on Web sources for your data, it’s close to impossible to avoid something called web scraping.

Web Scraping

Blog - Automate the boring stuff“Web scraping is the term for using a program to download and process content from the Web” (Sweigart, 2015). Generally, it includes the following tasks (Mitchell, 2015) :

  • Retrieving HTML data (a page) from a domain name (URL)
  • Parsing the retrieved data/page for target information
  • Storing the target information
  • Optionally moving to another page to repeat the process

For our purposes, the scraping steps could include one or more of the following:

  1. From a chosen lyric site or sites, retrieving the HTML source page(s) associated with a particular artist’s name from a lyric site.  As noted, the vast majority of the time the retrieved page will contain either a list of album titles associated with the artist, or a list of song titles, or both. Quite often it’s a good idea to store each page in a local file and work from this file, especially in the early stages of a project.
  2. Next, each source page is parsed to retrieve the titles of the associated artist’s albums or songs.  Typically, the titles are listed as domain names or URLs that can be used to retrieve either the list of songs associated with the albums or the lyrics of the songs.
  3. If the page contains a list of album URLs, each one of these will be retrieved and parsed to obtain the list of song URLs in the album.
  4. With the list of song URLs in hand,  the final phase is to use these URLs to retrieve the lyric pages.  I usually store these pages locally (sans any images) so I don’t have to keep visiting the original sites so I don’t have to.  These pages will form the basis of the corpus of the study.

Blog - Song Titles from Artist NamesAs an example, suppose we use the Wikipedia’s list of hip hop artists in order to retrieve the song titles and consequently the lyrics from one of the larger sites that it open to scraping activity. In order to write the scraping code to do this, you first have to construct the URL requests need to access the artists albums or songs. Unfortunately, there are no standards when it comes to the way the artists names are listed on the URLs (or anything other request for that matter).  This is especially true for artists with two or more words in their names which is the vast majority. Table 2 illustrates some of the differences for the of the sites listed earlier.

The result for each of the above requests is a single HTML Web page whose underlying source contains a list of songs associated with the specific artist.  The source can be scraped to retrieve the list.  Again, there are no standards from one site to the next.  Obviously, you’ll have to preview the underlying source for a few artists on a site in order to understand the html tags that can be used to locate and extract the song lists. Figure 2 shows what the relevant sections of underlying HTML source look like from requesting the page for the rap artist “Drake” at Azlyrics.com (i.e. azlyics.com/d/drake.html):

Blog - List of Albums and Songs

  • #1 – Announces the pertinent content with a comment;
  • #2 – Declares the overall list of albums with a div(ision) “id” tag;
  • #3 & #5 – For each album, it provides the title  and year with a div(ision) class;
  • #4 & 6 – Each album is followed by a list of song titles each denoted by an anchor reference (URL) that denotes the relative Web address where the song lyrics can be retrieved.

Blog - Song LyricsFor instance, if we follow the implicit anchor “…/lyrics/drake/dowhatyoudo.html” for the song “Do What You Do” at Azlyrics.com, this will take us to a page with the lyrics for that song. The underlying source HTML is shown in Figure 3. It’s these underlying lyric pages that can form the corpus for a text analysis of rap lyrics.

Regardless of the specific lyric site, once you understand the underlying HTML structure, it’s a straightforward task to create a software program to retrieve and scrape the pages of interest and to extract and save the lyrics from the source to a local file. This can be done in a programming language like Python, using some combination of the Beautiful Soup module, regular expression matching (yes I understand all the debates about this), and custom code. See Sweigart 2015, Mitchell 2015, or Lawson 2015 for details.

A Word about APIs

Blog - Genius APIVarious lyric sites have created application programming interfaces (APIs) to support the integration of their platforms with 3rd party commercial applications. Two of the better known APIs for music lyrics are from Genius.com and Musixmatch.com. The Genius API primarily supports programmatic access for creating, managing and viewing of annotations, i.e. interpretations of various lyric phrases and lines made by the site’s end users. While their API provides access to data and metadata about particular artists and lyrics, the application developer needs to supply Genius’ proprietary numerical IDs for artists and songs in order to gain access to pertinent data about them.  Because there is no consolidated list of the IDs available to the developer, it would be virtually impossible to use the API to develop a large corpus of lyrics.

Blog - MusixmatchAPILike Genius, Musixmatch’s API is also aimed at commercial rather than research use.  Musixmatch was started in 2010 and focuses on allowing “users to scan their music library and streaming playlist to retrieve lyrics via Musixmatch apps.”  This includes 3rd party apps as well.  The site has a lyrics catalog of 7 million lyrics and information about 5 million artists. On the surface, it’s ideal for creating a large sample of lyrics for research purposes.  The problem is that access to their catalog costs a minimum of $25K, a bit prohibitive for the average researcher.  The reason for the charge revolves around lyric copyright issues and licensing restrictions which Musixmatch has to pay in order to display the lyrics on their apps (the same is true for other sites — see Minsker (2013) for the problems Genius.com has encountered). To their credit, Musixmatch recognizes the value of this information for research purposes and has provided a special “bag-of-words” dataset available for non-commercial uses.
Blog - ChartlyricsThe last API I’ll mention is ChartLyrics.com — the one that Thompson used, and I subsequently used. As the name implies, their API is specifically aimed at supporting the retrieval of lyrics from their catalog of songs which includes all genres. The API has two key requirements.  First, it requires the name of the artist and song title in their URL format — this is nothing new.  Second, they have a 20 second governor on retrieval requests of all sorts.  This is sort of new — a lot of API’s have time or frequency restrictions.  In Thompson’s case, he used two 20 second delays for two separate steps (overall 40 seconds) before the song’s lyrics could be retrieved.  As he quipped, “This meant it took days to run and my mac kept going to sleep mid-process.”  Following Thompson’s lead, when I used the Top 100 rap song set discussed earlier, I only had to retrieve approximately 3,000 rap songs versus his 17,000 “pop music” songs.  Fortunately, it only took part of a day, plus I have a second computer that I use for this sort of task and, thanks to a hint from him, I avoided the “going to sleep” part.

Cheating – Copying a corpus from the Million Song Dataset

Blog - MusicxmatchI know all this programming sounds like fun, but you can avoid much of the initial pain thanks to Echo Nest (now part of Spotify),  LabROSA (aka Columbia University’s Laboratory for the Recognition and Organization of Speech and Audio) and Musixmatch.com. In 2011 Echo Nest(echonest.com), in conjunction with LabROSA, released the Million Song Dataset (MSD).  In their words, it’s “a freely-available collection of audio features and metadata for a million contemporary popular music tracks.” (Bertin-Mahieux et al. 2011).  As the name implies the dataset includes audio features for 1M songs/files along with over metadata about 44K unique artists, 2K musicbrainz tags, 2M artist similarity relationships, and release dates for 515K tracks. However, no lyrics.  This is where musiXmatch comes in.  Somewhat later, Echo Nest in partnership with musiXmatch released the musicXmatch (MXM) dataset, a collection of lyrics for over 237K tracks.  Again, in their words, “Through this dataset, the MSD links audio features, tags, artist similarity, … to lyrics” (see labrosa.ee.columbia.edu/millionsong/musixmatch).

Copyright issues prevent the partnership from distributing the full, original lyrics.  As a consequence, the lyrics come in a “bag-of-words” format and are stemmed.  So, technically speaking it’s a bag-of-stems, more specifically a bag-of-5000-stems (in essence the root forms of the words in the lyrics).  For example, this is what a bag-of-words looks like for a sample track:

Blog - BOWs

The first field value “TRZCLPH12903CD1957” is the MSD track id, while the second value “8914621” is the MXM track id. This is followed by a series of “<word idx>:<cnt>” pairs.  For instance, in this track the word “1” appears 26 times (“1:26”),  word 2 occurs 19 times (“2:19”), and so on until the last pair which is the 4832nd word which appears 1 time (“4764:1”).  The translation of word IDs can be found determined from either the MXM training or testing data – word 1 is “I,” word 2 is “the”, …, word 4764 is “flirt” even though the word in the song is “flirting” (remember it’s stems not words).

The 5000 words were chosen to represent the words that occur most frequently (culled to clear out the junk like foreign symbols, glued together words, etc.).  Obviously, there are more than 5000 words in the entire collection of 237K lyrics.  In fact according to the official count there are approximately 498K unique words for a total of over 55M occurrences.  The 5000 words in the dataset account for close to 51M out of the 55M (over 90%).  Of the 5000 words, only those that actually appear in a track are recorded in the dataset. This particular track has 195 out of the 5000 words. For the remainder of the 4805 words there no “<word idx>:0” pair. This is done to reduce the size of the set for storage and transmission purposes.

Using this BOW words certainly simplifies the task of building a corpus of lyrics of song tracks for analysis, eliminating the need to create a list of artists and songs and to retrieve and extract the lyrics for a very large sample of songs.  Yet, there is still a bit of housecleaning to be done, and there are some major drawbacks when it comes to text analysis.  From a housecleaning perspective, we need to determine which of the songs are rap songs, which of the rap songs have lyrics, what years the songs were released, and which of the tracks in the set are duplicates (of which there are a number).  Once we address those issues, we’ll have a “clean bag of stems”.  That’s the good news, the bad news is that we’ll only have the stems.  This limits types of text analysis we can do, even simple things like “n-gram” analysis of frequently occurring sequential pairs or triplets of words. If you understand how to do the housecleaning, don’t mind the restricted analysis, and have machine that can crunch data of this size without putting it in cardiac arrest, then it’s a good place to start and certainly will result in a large corpus of lyrics.

[A parenthetic note about using someone else’s corpus of rap lyrics.  In 2011 Tahir Hemphill introduced his Hip Hop Word Count, “a searchable ethnographic database manually built from the lyrics of over 50,000 hip-hop songs from 1979 to present day.” The massive database is searchable by date, artist, word, word complexity, locality, and a host of other metrics. Originally, my understanding was that the database was going to be publicly available. To my knowledge that hasn’t happened, although he has used it for some special youth oriented projects.  I wish it were publicly available, it would have saved me a substantial amount of work.]

Prelude to Part 2

In Part 2 of the discussion I’ll delve into the details of preparing the corpus of rap lyrics for exploratory analysis, as well as carrying out this initial analysis. The primary datasets I’ll use will rest primarily on the corpus I’ve extracted from the weekly BB Top 100 Rap Songs using the Chartlyrics’ API.

Resources

References

Bertin-Mahieux, T. et al. “The Million Song Dataset.” International Society for Music Information Retrieval (2011).

Fell, M. and Sporleder, C. “Lyrics-Based Analysis and Classification of Music.” 25th International Conference on Computational Linguistics (2014).

Lawson, R. Web Scraping with Python Paperback. Packt Publishing (2015).

Macrae, R. and Dixon, S. “Ranking Lyrics for Online Search.” 13th International Society for Music Information Retrieval Conference (2012).

Mauch, M. The Evolution of Popular Music: USA 1960–2010. Royal Society of Open Science (2015).

Minsker, E. “Study by Camper Van Beethoven’s David Lowery Prompts NMPA to File Take-Down Notices Against 50 Lyric Websites: Rap Genius tops list of 50 ‘undesirable’ sites.” (2013).

Mitchell, R. Web Scraping with Python: Collecting Data from the Modern Web 1st Edition. O’Reilly Media (2015).

Sterckx, L. Topic Detection in a Million Songs. Dissertation, Universiteit of Gent (2013).

Sweigart, A. Automate the Boring Stuff with Python: Practical Programming for Total Beginners 1st Edition. No Starch Press (2015).

Thompson, J. “The Evolution of Popular Lyrics and a Tale of Two LDA’s.” (2015).

People

James Thompson

Matthew Daniels

Tahir Hemphill

Places (Virtual and Real)
Azlyrics.com

Echo Nest

Genius.com

LabROSA

Original Hip Hop Lyrics Archive (OHHLA)

Tools & Modules

Beautiful Soup

ChartLyrics API

Genius API

Musixmatch API

Musixmatch Million Song Dataset

Web Scraping

Foundations of Analysis: Part 2

Focus of Part 2

Part 1 of the “Foundations of Analysis” post briefly introduced the field of the Digital Humanities (DH) and the associated notion of Distant Reading which is the key method underlying DH text analysis.  In the second part of the post, I’ll discuss a potential framework for doing data analysis in the DH. The framework rests on frameworks used in data mining and data science.  It’s a framework that I have used in the past to analyze pop cultural phenomenon and artifacts and plan to use in subsequent posts dealing with this subject matter.  To set the stage for this discussion, I first look at a simple example based on a “distant reading visualization” produced in an earlier study by Stefanie Posavec (an independent graphic designer).

On the Road (Again): Example of Distant Reading Visualization of a Single Text

In Part 1 of this post, reference was made to a recent report by Janicke et al. that surveyed the “close and distant reading visualization techniques” used by DH research papers from 2005-2014.  One of the key findings was that a number of the research studies examining single texts or small collections of texts utilized distant reading visualizations either solely or in combination with close reading visualizations.

posavec close annotation

A frequently cited example of a distant reading of a single text is Stefanie Posavec‘s 2007 visual analysis of Jack Kerouac’s On the Road. The analysis was part of an MA project (Writing without Words) designed to visually represent text and explore differences in writing styles among authors (including “sentence length, themes, parts-of-speech, sentence rhythm, punctuation, and the underlying structure of the text”). Although the distant part is often highlighted in citations, Posavec began with a manual, close reading of the book.  Figure 1 exemplifies the close reading visualization techniques that formed the base for her distant analysis. Essentially, she divided the text into segments – chapter, paragraphs, sentences and words — and then used color and annotations to record things like the number of the paragraph (circled on the left), the sentences and their length (slashed numbers on the right), and the general theme of the paragraph (denoted by color).  These close reading visualizations served as the underpinnings for number of unique and highly artistic distant reading visualizations.

posavecs viz - components of tree

One of her distant reading visualizations is displayed in Panel 1 of Figure 2. Here, the combined panels highlight the relationship between the components of the tree (in the visualization) and the segments of the book. More specifically, in Panel 1 the lines emanating from the center of the tree represent the chapters in the book (14 of them). Panel 2 shows the paragraphs (15) emanating from one of the chapters. Finally, in Panel 3 we see the sentences and words for one of the paragraphs (I think there are 26 sentences and I don’t know how many words). While there are individual lines for the words, they dissolve into a “fan” display in the visualization.  For the various structures, the colors represent the categories or general themes of the sentences in the novel.  A major value of this display is that it is easy to see the general structure of the novel, as well as the coverage of various themes by the segments of the all the way down to the sentences.

An interesting facet of Posavec’s visualizations (this and others) is that they were all done manually — both the close and distant reading visualizations.  Given Posavec background, it’s easy to understand her approach. By training she’s a graphic designer “with a focus on data-related design,” and her visualizationwork reflects it.  As David McCandless (the author of Information is Beautiful and Knowledge is Beautiful) put it, “Stefanie’s personal work verges more into data art and is concerned with the unveiling of things unseen.” She’s not a digital humanist per se, nor is she a data scientist or data analyst (which isn’t a handicap).  If she were the latter, she probably would have used a computer to do the close and distant reading work, although at the time it would have been a little more difficult, especially the tree displays.

Web-Based Text Analysis of On the Road

Today, it’s a pretty straightforward task to do small scale text analysis using computer-based or web-based tools (see chapter 3 of Jockler’s Macroanalysis). For instance, I found a digital version of Kerouac’s On the Road on the web and pasted it into Voyant  — “a web-based reading and analysis environment for digital texts.”  It took me about 20 minutes to find the data and produce the analysis shown in Figure 3. As usual, most of that time (about 15 minutes) was spent cleaning up issues with the digitized text (e.g. paragraphs that had been split into multiple pieces or letters that had been incorrectly digitized).

voyant_tool

The various counts provided by tools like Voyant (e.g. sentence and word counts) are similar to the types of data used as the base for Posavec’s displays. There are a small number of publically available tools of this sort that can be used by non-experts to perform various computational linguistic tasks. Of course, what they don’t provide is the automated classification of the paragraphs and sentences into the various (color coded) themes, nor the automated creation of the tree diagram.  Both of these require programming or specialized software or a combination of both. For a discussion of the issues and analytical techniques underlying automated analysis of “themes” see chapter 8 in Jockler’s Macroanalysis).

Framework for Data Analysis of (Pop) Culture

While incomplete, the brief discussion of Posavec’s visual analysis of On the Road and the outline of a potential computer-based analysis of the same text in “digital” form hints at the broad steps that are used in performing distant reading, and distant viewing or distant listening for that matter. So, returning to the question raised in Part 1: “what are the steps used in doing analysis — data and visual — of some cultural phenomenon of interest and its related artifacts?”

I think I can safely say that there is no “standard” or “approved” framework or methodology for defining and delineating the steps to be used in doing DH.  Unlike the “scientific method” or the methods for doing “statistical hypothesis testing,” there are no well-developed cultural practices for doing analysis in the DH, much less any standards committees sitting around voting on standard practices.  However, this doesn’t mean that there is or should be “epistemological anarchy” (to use an impressive sounding phrase from Feyerabend cited in  Rosenbloom 2012). There are some options that are used in other fields to perform similar sorts of analysis.

crispdm

The first framework comes from the field of data mining (i.e. discovering meaningful patters from large data sets using pattern recognition technologies) and is known as the CRISP-DM methodology. The acronym stands for Cross-Industry Standard Process for Data Mining. Basically, it’s a standards-based methodology detailing the process and steps for carrying out a data mining project.  Figure 4 provides an overview of the “Process” (this is the “official” diagram).

These processes and steps also form the bases of various methodologies and frameworks proposed and employed in the emerging field of data science. Wikipedia defines data science (DS) as “an interdisciplinary field about processes and systems to extract knowledge or insights from large volumes of data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining and predictive analytics, as well as knowledge discovery in databases (KDD).”  Obviously, data science is broader than data mining, however the generic processes that are used with DS are often very similar.  One rendition of the DS Process is shown in Figure 5.

datascience

The process starts with the “Real World.” In some cases the “Real World” represents a real world process producing some data of interest.  In other instances, it’s a surrogate for a problem, question or goal of interest. Similarly, the outcome(s) at the end can either be a “Decision” and/or a “Data Product.”  Again, the “Decision” represents an actual decision, solution, or answer, while a “Data Product” usually refers to a software application (often automated) that solves a particular problem or improves a process.  In many cases the “Real World” and the associated decisions and data products come from the business world. However, because the breadth of DS projects encompasses a wide-variety of models and algorithms including “elements of statistics, machine learning, optimization, image processing, signal processing, text retrieval and natural language processing,” the DS process has been and can be used with a wide variety of theoretical and applied domains in the physical and biological sciences, the social sciences and the humanities.

To show it’s applicability to the humanities, reconsider Posavec’s analysis of On the Road. If we focus on the topical analysis and ignore her lexical or stylistic analysis, then the general steps of the DS process might look like:

  1. Real World (Questions)  – What were the major themes in Keroauc’s On the Road and how did they vary by key segments of the text — chapters, paragraphs and sentences?
  2. Raw Data Collected – This is the text of the book tagged by key segments (e.g. first chapter, 5th paragraph, 2nd sentence) and stored in digital form.
  3. Data Processed into Cleaned Data – Utilize text analysis and natural language processing procedures (e.g. tokenization, stopword removal, stemming, etc.)  to prepare and transform the original text into a clean dataset, as well as an appropriate form and format so that it can either be “explored” and/or “modeled.”
  4. Models & Algorithms – First, employ a topic modeling algorithm (e.g. Latent Dirichlet Allocation – LDA) to “automatically” determine in an unsupervised fashion the underlying themes (i.e. topics) of the book. Second, utilize the results from the topic modeling algorithm to determine the major theme (topic) for each paragraph and sentence. This latter step, which is needed to assign the colors to be used in the tree diagram, requires a new algorithm because it is not automatically performed by the topic modeling algorithm.
  5. Communicate, Visualize & Report – In this instance, the primary focus of Posavec’s analysis and output was a tree diagram where the limbs and branches represented the segments of the book displayed in the order in which they appeared in the text and the colors of the limbs and branches represented the themes (topics) of the various paragraphs and sentences. There are a number of publically available algorithms and programs that can produced color coded tree diagrams (see the output of some of these programs in Figure 6). While none of them that I have seen can reproduce Posavec’s handcrafted displays, I’m pretty confident that some of at least one of them can be modified so that the resulting diagram is close to the original display.
  6. Make Decisions & Data Products — As noted, Posavec’s primary focus was on visualization.  Typically, text analysis of this sort would result in written discussion tying the analytical results to the associated problems or questions of interest.  Alternatively or additionally, the various algorithms could be packaged into an application or system that could perform similar sorts of analysis and visualization for other collections of text.

tree diagrams

Future Analyses

Even though the above example only involves a single book, the DS Process is general enough and broad enough to serve as the framework for DH analysis of all sorts, regardless of the specific cultural phenomenon and artifacts being studied, the size of the data collection involved (micro, meso or macro), and the nature of the questions or problems being addressed (the who and with whom, the what, the when and the where). That said, there are a number of details that vary by each of these factors. These details will be provided in the Posts to come, starting with the next couple of posts dealing with the text analysis of hip/hop and rap lyrics.

Visual Frameworks and Methods

Visualization projects (like Posevac’s) often require a number of specialized steps over and above those outlined in the DS Process. For projects of this sort, there specialized frameworks that can be used to either supplement or supplant the DS Process.  As noted in the introductory post, one of the best of the visualization frameworks was developed by Katy Bӧrner and her team at Indiana University (IU).  Bӧrner is the Victor H. Yngve Professor of Information Science in the Department of Information and Library Science at IU in Bloomington, IN, as well as the Founding Director of the Cyberinfrastructure for Network Science Center at IU. Their visualization framework is detailed in three books: Atlas of Science (2010), Visual Insights (2014), and Atlas of Knowledge (2015). The latest rendition of the framework is described in the Atlas of Knowledge and is shown below in Figure 7 (along with the page numbers in the Atlas associated with the various steps) .  The framework is supported by a toolset – the Science-to-Science (Sci2) tool —  which was also developed by Katy Bӧrner and her team. A detailed description of the toolset is provided in Visual Insights along with a number of examples and case studies illustrating its use.

borner_framew

When looking for guidance in doing visual analysis, another avenue to consider (besides the DS Process and visualization frameworks) is the “informal” processes and practices employed by the graphic designers and illustrators involved in much of the innovative work being done in the areas of information and data visualization. A recent book entitled Raw Data: Infographic Designers’ Sketchbooks (2014) provides a detailed look at the work being done by seventy-three of these designers across the world.

Resources

References

Bӧrner, K. (2010). Atlas of Science Visualizing What We Know. Cambridge, MA.: MIT Press.

Bӧrner, K., and David Polley (2014). Visual Insights. Cambridge, MA.: MIT Press.

Bӧrner, K. (2015). Atlas of Knowledge. Cambridge, MA.: MIT Press.

Heller, S. and R. Landers (2014). Raw Data: Infographics Designer’s Sketchbook. London:Thames and Hudson.

Jockler, M. (2013). Macroanalysis: Digital Methods and Literary History (Topics in the Digital Humanities). Urbana, Chicago, Springfield IL: University of Illinois Press.

McCandless, D. (2010). “Great Visualizers: Stephanie Posavec.” informationisbeautiful.net/2010/great-visualizers-stefanie-posavec/.

McCandless, David. (2009). Information is Beautiful. New York: Harper Collins.

McCandless, David. (2014). Knowledge is Beautiful. New York: Harper Collins.

Prosavec, S. (2007). “Writing without Words.” stefanieposavec.co.uk/writing-without-words.

Rosenbloom, P. “Towards a Conceptual Framework for the Digital Humanities.” digitalhumanities.org/dhq/vol/6/2/000127/000127.html.

Wikipedia. “Data Science.” en.wikipedia.org/wiki/Data_science.

People

Katy Bӧrner
David McCandless
Stefani Posavec

Places (Virtual and Real)

Cyberinfrastructure for Network Science Center

Tools

Science-to-Science (Sci2) Tool

Voyant

Foundations of Analysis: Part 1

On the Shoulders of Giants and Other Researchers of Above Average Height

shoulders4

In the introductory post, I indicated that most of my research in Pop Culture derives its foundation from the field of Digital Humanities (DH) and the works of “giants” in the field like Lev Manovich and Franco Moretti. In this post, I want to put this foundation in context, discussing some of the key ideas and the researchers behind those ideas.

For the record, Manovich and Moretti are #3 and #4 in the picture above. The identity of the others is revealed below. That’s me “standing on their shoulders.” You may heard the phrase If I have seen further, it is by standing on the shoulders of giants” often attributed to Sir Isaac Newton or read Stephen Hawking’s book On the Shoulders of Giants. Matthew Jockers, one of the leaders in DH (#5 above and discussed at the end of the post), also noted in describing the state of DH’s union that, “In 2012 we stand upon the shoulders of giants, and the view from the top is breath taking.” Actually, the original phrase and the one quoted in the picture above was more like “A dwarf standing on the shoulder of giant may see further than a giant himself.”  This phrasing dates to sometime in the 12th century well before political correctness. The term was meant to be metaphorical. referring to someone with mortal skills. For me its both metaphorical and literal since I’m 3 standard deviations to the left of the average height for US males.

Roots of Digital Humanities

The humanities encompass a variety of disciplines focused on the study of human culture.  Among the various disciplines are literature, history, philosophy, religion, languages, the performing arts, and the visual arts. Besides their focus on human culture, the humanities traditionally have been distinguished (from the sciences) by the methods they employ, relying primarily on “reflective assessment, introspection and speculation” underpinned by a large element of historical (textual) data.

In recent years growing segment of humanists (and researchers in other disciplines) have also begun to apply some of the newer analytical and visualization methods to a range of traditional humanists topics. The application of these newer tools and techniques falls under the rubric of digital humanities (DH). The DH label was created at the beginning of the 21st century to distinguish it from its earlier counterpart — humanities computing which encompassed a variety of efforts aimed at standardizing the digital encoding of humanities texts — and to avoid any confusion that might be raised by labeling these newer activities as digitized humanities.

Google Trends and Ngram - Digital Humanities and Humanities Computing

Basically, digitized humanities refers to the digitization of resources so they can be accessed by digital means (e.g. digitizing images so they can be viewed on line or digitizing  an image catalog so it can be queried online).  In contrast, DH refers to humanities study and research performed with the aid of advanced computational, analytical, and visualization techniques enabled by digital technology. Digitization is simply one component of DH, albeit a critical component.

“The ‘IZEs’ have it” or should that be “The ‘IZations’ have it”

Speaking of “digitization,” have you ever noticed the proliferation of “izes” and “izations” in today’s rhetoric, especially IT rhetoric? As John Burkardt, a research associate in the Department of Scientific Computing at Florida State University, reminds us, we owe this phenomenon to the “16th century poet Thomas Nashe for inventing the suffix –ize as a means of easily generating new and longer words” and to the (Madison Ave.) ad-speak of the 1950s which relied on (among other things) the use of -ize and -ization to create new verbs and nouns. Burkhardt provides a long list of (393) examples, many of which are current day.  In the world of IT, “izes” and “izations” abound. For instance, a recent Tweet highlighted “The digitization and ‘cloud’ization of data #digital data.” Similarly, recent articles about the Internet of Things (IoT) have opined that the IoT is about “the ‘dataization’ of our bodies, ourselves, and our environment.”

In the context of this posting, some of the key “izes” and “izations” impacting the world of DH include:

  • Digitize and digitization – Converting cultural artifacts (e.g. text documents, paintings, photographs, song lyrics) to digital form.
  • Webize and webization – Conversion to digital form often goes hand-in-hand with the process of adapting digitized artifacts to the Web (or Internet)
  • Dataize and dataization – “…translating digitized cultural artifacts into ‘data’ which captures their content, form and use.”
  • Algorithmize and algorithmization – Converting an informal description of a process or a procedure into an algorithm.
  • (Data or Information) Visualize and visualization – Presenting data or information in a pictorial or graphic form.
  • Artize and artization – Emphasizing the artistic quality of an information or data visualization.

All but the last one of these “izes” is “real,” meaning that I was able to find definitions for the words on the Web.  The last one is sort of a figment of my imagination. I say “sort of” because the phenomenon is real and important, even though the word doesn’t and probably shouldn’t exist.  Lev Manovich  calls this phenomenon “artistic visualization” or visualizations that place a heavy emphasis on their artistic dimension. I plan to cover this dimension in detail in future postings. A couple of places where you can see a number of examples is Manuel Lima’s visualcomplexity.com or Mike Bostock’s D3 gallery.

In a crude sense, the sequence of “izes” listed above provides an outline for doing research in DH. That is to say, for some (pop) cultural phenomenon of interest and its associated artifacts, the artifacts first have to be digitized and maybe webized, then dataized, then algorithmized, and finally visualized (at least in a primitive sense). So, the question is: how do you do this?  Some of the answers are aimed at specific types of cultural phenomenon and artifacts, while others rest on very general frameworks for doing data analysis, data science, or (data) visualization.  The section below briefly discusses a couple of instances of the former, while general frameworks are described in Part II of the post.

Distant Analysis: Reading, Viewing and Listening

Close Reading

Much of the research and analysis conducted in the humanities revolves around the examination of textual information utilizing a method known as close reading. Close reading involves concentrated, thoughtful and critical analysis that focuses on significant details garnered from one text or a small sample of texts (e.g. a novel or collection of novels) with an eye toward understanding key ideas and events, as well as the underlying form (words and structure). While close reading remains the primary method in literary analysis and criticism, it is not without its drawbacks.  Given its attention to detail, as well as its reliance on the skills of the individual reader, close reading makes it difficult: (1) to replicate results; (2) to ascertain general patterns within a single text or among a collection of text; and (3) to generalize findings from the analysis of a single text or a small (non-random) sample of texts to some larger population of which the analyzed text or sample of texts is a small part.

Distant Reading

This is where distant reading can come into play.  The term distant reading was coined by the literary expert Franco Moretti in 2000 to advocate the use of various statistical, data analysis, and visualization techniques to understand the aggregate patterns of large(r) collections of text at a “distance.” Towards this end, he suggested that the (types of) “graphs used in quantitative history, the maps from geography, and trees from evolutionary theory” could serve to “reduce and abstract” the text within the collections of interest and to “place the literary field literally in front of our eyes — and show us how little we know about it” (Moretti 2007). Obviously, these are fighting words to the average humanist, which means that distant reading has an abundance of critics (both literary and otherwise).

Closely related to concept of distant reading is the notion of culturomics. As Wikipedia notes, this is a form of “computational lexicology” (I’ll let you look that one up) that studies human culture and cultural trends by means of the quantitative analysis of words and phrases in a very large corpus of digitized texts. The best known example of culturomics is the Google Labs’ Ngram Viewer project which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time. Some interesting examples of the types of analysis that can be performed are provided in a research article by Jean-Baptiste Michel (photo#6) and Erez Lieberman Aiden (photo #2) et al. that was entitled “Quantitative Analysis of Culture Using Millions of Digitized Books” and appeared in the January 2011 issue of Science.

While the notions of distant reading and culturomics highlight the need for analyzing larger collections of texts, a number DH research projects conducted in the past 10 years (since the publication of Moretti’s Graphs, Maps and Trees in 2005) have focused on individual texts or smaller samples of texts, employing the analytical and visualization techniques advocated by Moretti to enhance or supplement a close reading exercise. A very recent report by Janicke (photo #1) et al. (2015) surveyed the “close and distant reading visualization techniques” utilized by DH research papers published in key visualization and digital humanities journals from 2005-2014.*  The close reading visualizations used things like color, font size, glyphs or connections lines to highlight various features of the text or the reader’s annotations, while the distant reading visualizations involved the usual suspects –  charts, graphs, maps, networks, etc. Below is a table displaying the relationship between the type(s) of visualization used (close, distant or both) by the size of the text sample being analyzed (single text, small collection, large corpus).

close-distant viz techniques

Among other things, the results indicate that: 1. Although all of these studies fall under the DH umbrella, a sizeable number (%) used either a single text or smaller collection of texts (37 out of 100); 2. Within those studies that employed either a single text or smaller collection of texts, almost half used either distant reading visualization techniques or some combination of both close and distant visualization techniques (18 out of 37).

A Word about Distant Viewing and Listening

Even in humanistic fields outside of literature, like the visual arts where the cultural artifacts of interest (e.g. paintings or sculptures) are non-textual, something akin to close reading and text play critical roles.  That is, single pieces of art, single artists, or even specific styles or movements are the focal point of much of the scholarly research in this area, and much of the scholarly communication is delivered in textual form as “catalogues, treatises, monographs, articles and books.”  Increasingly, the tools and techniques of digital humanities are also being applied to these non-textual areas – what we might call distant viewing and distant listening of much larger, digital collections of art, artists, music and musicians.

In the world of distant viewing, the leading light is Lev Manovich.  Currently, Manovich is a Professor at The Graduate Center, City University of New York (CUNY) and Director of the Software Studies Initiative which has labs at CUNY and the University of California, San Diego (UCSD). His focus, and the focus of the labs, is on the intertwined topics of software (as a cultural phenomenon and artifact) and cultural analytics.  Cultural Analytics is defined as “the use of computational methods for the analysis of massive cultural data sets and flows.” Basically, it’s the subset of Digital Humanities focused on “big (cultural) data,” especially really big image data sets. Manovich and his colleagues at the Software Initiative are very prolific, so it’s hard to pinpoint one article or book that summaries their work.  However, Manovich’s article “Museum without Walls, Art History without Names: Visualization Methods for Humanities and Media Studies” does an excellent job of summarizing a number of his articles dealing with the topics of this post including distant and close reading.

A Final Reference

This post barely touches the world of Digital Humanities and the associated notions of distant and close reading.  For a really good book on these topics, take a look at Matthew Jockers’ Macroanalysis: Digital Methods and Literary History — Topics in Digital Humanities (2013). As the title implies, Jockers prefers the terms microanalysis and macroanalysis (ala Economics)  with a bit of mesoanalysis thrown in between instead of standard terms like close and distant reading. He’s also written a how to book — Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences) that details how to perform micro-, meso- and macroanalysis of text.

As an aside, Jockers (photo #5) was a colleague of Moretti’s at Stanford University and with Moretti was co-founder and co-director of the Stanford Literary Lab.  Today he is an Associate Professor of English at the University of Nebraska, Lincoln, Faculty Fellow in the Center for Digital Research in the Humanities and Director of the Nebraska Literary Lab.

Resources

References

Aiken, M., E. Aiden* et al. (2011). “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science. [* joint lead authors]

Bostock, M. D3 Gallery.

Burkhardt, John, Word Play.

Lima, M. Visual Complexity.

Jänicke, S. et al. (2015). “On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges.” Proceedings of Eurovisstar 2015.

Jockers, M. (2013) Macroanalysis: Digital Methods and Literary History  – Topics in the Digital Humanities. University of Illinois Press.

Jockers, M. (2014). Text Analysis with R for Students of Literature (Quantitative Methods in the Humanities and Social Sciences). Springer.

Manovich, L. (2013). “Museum without Walls, Art History without Names: Visualization Methods for Humanities and Media Studies.” In Oxford Handbook of Sound and Image in Digital Media. Oxford University Press.

Moretti, Franco (2007). Graphs, Maps and Trees:  Abstract Models for Literary History. London: Verso.

Moretti, Franco (2013). Distant Reading.  London: Verso.

Places

Center for Digital Research, University of Nebraska at Lincoln.

Stanford Literary Lab, Stanford University.

Culturomics.

Software Studies Initiative.

What is Dataffiti?

wordpress-logo-dataffiti-2

Dataffiti is a portmanteau that blends the words data and graffiti.  I could have used the term daffiti but it sounds too much like “defeated,” so I opted to keep the “t.” As the tagline — “from pop culture to analytical insights” – suggests, the blend is meant to represent the application of data science techniques and tools to the world of popular or pop culture in order to gain insights into its aggregate structures and trends — the who and with whom, the what, the when and the where.

circle-logo-pop-1

The terms popular culture or pop culture encompass a variety of definitions – both broad and narrow.  Given the fact that I’ve combined the word data with (gr)affiti indicates that the my interests are narrower and will be focused on cultural products that are on or at least started out on the fringes of society before they reached the mainstream. Under this rubric, the types of projects to be consider include: text analysis of the lyrics in hip-hop and rap music; image analysis of graffiti and street art; social network analysis (SNA) of the casts of Bollywood movie, statistical analysis of the episodes of Man vs. Food and Diners, Drive-Ins and Dives; dynamic analysis of the influence networks among the “isms” (like cubism) of modern and contemporary art, to name a few.

circle-logo-analysis-1

For those familiar with the field of Digital Humanities, you may recognize that these topics and the associated analysis follow in the footsteps of Franco Moretti’s work on literary history, especially his essay on “Graphs, maps, and trees,” and Lev Manovich’s cultural analytics, principally the work he and his team have done on the visual analysis of large collections of images (e.g. Instagram images, manga pages, and Impressionist paintings). Unlike Moretti, but like Manovich, most of the (pop culture) data to be analyzed comes from the Web, albeit in a variety of shapes, forms and fashions – including online databases, APIs, and the underlying HTML source of associated Web pages.   As the sample topics indicate, the data science analytical techniques and tools to be employed run the gamut from standard statistical analysis, to machine learning and data mining, to natural language analysis, to image processing and computer vision, and to (social) network analysis. The analysis will rely on two basic kinds of tools including: Programming Languages – Python, R, and JavaScript; and specialized software for Social Network Analysis (SNA) – Pajek, UCINet, Gephi, and ORA.

circle-logo-vis-1

The analysis will also be supported and supplemented by a wide variety of visualizations encompassed by the terms “charts, tables, (statistical) graphs, geospatial maps and network graphs” supporting the statistical, temporal, geospatial, topical and network analysis, respectively. Most of the visualizations will employ the visualization framework detailed by Börner and friends. Like the base analysis, these visualizations will be generated either by visualization capabilities of Python and R or by specialized visualization capabilities of D3.js, Sci2, or the Processing programming language for the visual arts.

References

Wikipedia. https://en.wikipedia.org/wiki/Popular_culture

Crossman, Ashley. http://sociology.about.com/od/P_Index/g/Popular-Culture.htm

Morretti, Franco (2007). Graphs, Maps and Trees: Abstract Models for Literary History. London: Verso.

Manovich, Lev.  Manovich.net & http://lab.softwarestudies.com/p/cultural-analytics.html

Katy Börner et al. at the Cyberinfrastructure for Network Science Center. Major works include: (2010) Atlas of Science: Visualizing What We Know; (2014) Visual Insights: A Practical Guide to Making Sense of Data with David Polley; and (2015) Atlas of Knowledge: Anyone Can Map. All are published by MIT Press.