NLP for News Narrative Analysis: An Explanatory Sequential Study Design

One of the challenges of my project is that not only am I building a digital experience, but I’m attempting a new method of analysis. It is both new to my own research methods toolbox, and it is a process that is rarely used in my field.

NLP (or natural language processing) is a computational method that is often used in computation linguistics, social media analysis, and other big data research disciplines. However, in journalism, we tend to be fairly old school. Many of us journalism academics are former practicing journalists. Unless we’re interactive designers, we typically don’t have much training or knowledge of coding languages. And as researchers, we’re rarely trained in computational methods, but we are saturated in discourse analysis methods. Even quantitative studies like content analyses are typically coded by hand. But when looking at news narratives, hand coding makes analysis of large data trends extremely labor intensive.

For my project, I wanted to look at the differences between a national newspaper (The New York Times) and a local newspaper (The StarTribune) in coverage of the murder of George Floyd. What are the narratives that these papers are presenting to their audiences about the same event? How do they change over time? What meaning are they trying to make and imprint in our collective memory? The sample size I was looking at for this project quickly became overwhelming. In just the first month of coverage, there were 372 articles collected for this project (252 New York Times articles and 120 StarTribune articles). There would be no way I could conduct a discourse analysis of a sample size that large in a semester.

Enter NLP. As someone who traditionally gravitates toward qualitative methods, NLP was an intimidating prospect. I spent many days agonizing over which NLP application to use. Many studies I have read use Google’s Word2Vec. My fellow CHI fellow Phil used spaCy. Eventually, I landed on using Natural Language Toolkit (NLTK). As an open-source NLP project, it has the advantage of the wisdom of crowds where the community is constantly updating and improving the project. My goal behind using NLP was to use it as a framing tool for further narrative analysis. In a sample as large as the one for this project, it can be overwhelming when attempting to “find a way in” to the data. The aim is to use NLP to identify the words that were most frequently used in relation to George Floyd’s name. These word embeddings would provide a quantitative starting point to further examine qualitative data: the narratives media created around Floyd’s murder.

Method

Step 1. Using Proquest Newsstream, I collected all articles from May 25, 2020 to June 1, 2021, using the search phrase “George Floyd” (N= 1,214).

Step 2. The data was collated by publication and publishing date.

Step 3. The 10 most common word embeddings for the articles published within a given month were assessed using NLTK’s NLP code. Stop words, such as “said” and “including” were incorporated into the analysis process. This resulted in 47 word embeddings for the New York Times sample and 35 embeddings for the StarTribune sample. (See collected data).

Step 4. These embeddings will be used as a “way in” to analyze the data qualitatively. For example, “police” was consistently the most-used word in relation to stories about George Floyd, regardless of publication or publication date. The next step will be to qualitatively examine how these words play into the larger narrative conveyed about the “news event” we refer to as the murder of George Floyd.