We can always do more Natural Language Semantics – Cultural Heritage Informatics Initiative

In my last blog post, I discussed a few details about my current CH/DH NLP research project where I intend to investigate to what extent a corpus can be “reconstructed” using NLP techniques. Since then, I’ve done some practical work and some broader thinking. I’ll start with updates on the practical.

As my project calls for NLP analysis, I’ve set myself to learning those tools. I’ve started working with spaCy and would like to say that I have found William J.B Mattingly’s textbook to be very helpful: Mattingly, William J.B. 2023. Introduction to Python for Humanists. Chapman & Hall/CRC the Python Series. Boca Raton: CRC Press, DOI: 10.1201/9781003342175; as well as his YouTube channel: https://www.youtube.com/pythontutorialsfordigitalhumanities.

I’ve jumped into primary literature and have been reading about what other researchers are pursuing in the world of information extraction. While I enjoy reading articles that genuinely constitute Cultural Heritage/Digital Humanities work and involve applications of NLP, as a formal semanticist, I can not resist looking at research where the primary focus is on refining the techniques themselves. Recently, I looked at an article that had a little bit of both and I’d like to recommend it.

Lassner, David et al. 2023. Domain-Specific Word Embeddings with Structure Prediction. Transactions of the Association for Computational Linguistics. vol 11: 320-335.

In the article, the researchers propose some refined methods for analyzing word embeddings, specifically, improvements in W2V procedures. They show that, via the new methods, a more accurate domain specific “semantics” can be obtained. They then suggest that the refined techniques could benefit such fields as computational literary studies insofar as identifying “historical person networks, knowledge distributions and intellectual circles.” To demonstrate this, they conduct a small study using their improved W2V procedures on the “lemmatized versions” of “high literature” texts from the German Text Archives in order to uncover author relationships. They claim to do just that.

(Disclaimer: The following comments are not directed at Lassner et al 2023 which I suggest people read because it’s interesting. But I have some general thoughts on my survey of the literature so far.)

This is a very interesting paper. However, as with many NLP articles, it isn’t always clear to me what the researchers believe that they are uncovering about machine understanding of Natural Language “meaning.” That is, I’m not always sure what assumptions the researchers have about Natural (Human) language semantics and frequently, I don’t seem to be able to figure out how it is that whatever they’ve shown then links up with anything that Linguists and the Cognitive Science community have demonstrated are the genuine problems and research areas. Formal Semantics has a tool kit for carving up and tackling Semantic problems and framing those discussions but I don’t often see those types of things employed in NLP papers.

For example, I’m interested in seeing what we can do about improving machine understanding of inferences (who isn’t?). And when I read a cool paper like that which I cited above, I quickly drift away from the data in the paper and onto Semantics puzzles. I want to know things like: how can I use what’s outlined in Lassner et al (2023) to tackle data like that given below in sentence (1) and (2)?

In (1) and (2), an approximative adverb almost modifies a quantified noun phrase (QNP). The approximative has been italicized and our QNP has been bracketed. One thing that we might hope to have a way of predicting is the following pattern: when the approximative directly modifies the verb phrase (appears immediately to the left) and the QNP is high, we have (at least) one inference about the verbal antonym. Specifically, we have an assertion of positive sentiment that contributes the information that all the students failed. By positive/affirmative, I mean that there are no overt pieces of negation like: no, not, or never. On the other hand, if we raise the adverbial leftward above the QNP, then we have a different inference. Although still a positive sentiment, the information contributed is that very few students failed.

[All of the students] almost passed the exam = all of them failed.
Almost [all of the students] passed the exam = many of the students passed = very few failed.

This, of course, means that almost is very different from something like quickly which has a very free distribution with limited or no effect on the meaning contribution regardless of the position. In other words, it means the same thing no matter where you place it. In (3), the parentheses show all the potential spots. For this example sentence, there are four.

(Quickly), John (quickly) ran (quickly) down the street (quickly).

Adverbials are a fascinating area that allows us to think about the connections that exist between form and meaning, and to what degree these things are flexible. There are even greater problems than those shown above. For example, the scope that one adverb takes over another and the interaction of their meanings is not always what one might predict and can also show us where Dictionary-type definitions fail us. Again, in the case of almost, when we look this adverb up, we find it listed as synonymous with nearly and/or not quite. Certainly, you can create examples that appear to corroborate this. I provide one in (4). No matter which adverb you choose, the sentence in (4) means that: for the most part, I got a gift from all my friends.

Almost/Nearly/Not quite [all of my friends] bought me a gift. = a few friends didn’t buy me a gift

But there are numerous ways of demonstrating that the Dictionaries characterization of these adverbs in this way is inaccurate. One way to do this is to create a construction where an Evaluative adverb like fortunately scopes over one of these items. This is provided in (5) and (6).

Fortunately, almost all of my friends visited me. = it’s good that most friends came to visit
Fortunately, not quite all of my friends visited me. = it’s good that not all friends came to visit

Now, if we take the meaningful contribution of almost and not quite to be sort of a two part meaning which is given in (7) and means something like close to X but not X, then it’s not obvious why our Evaluative adverb fortunately interacts with opposite pieces of the content in (5) and (6). In the case of (5), it’s fortunate that close to X occurred but in (6) it’s fortunate that not X was what occurred.

almost/not quite = Close to X but not X, e.g. close to all my friends but not all my friends.

To be clear, the examples that I’ve provided are not cases of ambiguity or vagueness. These are predictable and fixed interactions: (5) can never mean (6). And given that they largely share a distribution, we’re going to have to do some very careful thinking about how to handle these kinds of cases; which are ubiquitous. So, it seems to me that regardless of the NLP study undertaken, we ought to be as reflective as possible, and pause and ask ourselves; what do we think we’ve learned about machine understanding of Natural Language Semantics and why do we think we’ve learned that? I absolutely understand that not all NLP/Digital Humanities/Cultural Heritage work is about the questions I’ve raised, but the exciting thing is, it could always be a little part.