Recently, I read a very short but interesting article regarding a new project which seeks to improve the methodology for the organization, classification and examination of Cultural Heritage corpora. It is a collaboration between researchers at University of Lausanne, Harvard Houghton Library, Groningen University, and Bibliotheca Hertziana, and the goal is digitize and computationally reorganize the Charles Sanders Peirce archive at Houghton Library.
If you are interested in reading the article, you can find it here:
Citation: Picca et al. (2023) “Exploring the Automated Analysis and Organization of Charles S. Peirce’s PAP Manuscript”. In 34th ACM Conference on Hypertext and Social Media (HT ’23), September 4–8, 2023, Rome, Italy. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3603163.3609066
Basically, the researchers are interested in refining the NLP techniques necessary to assist digitization and organization endeavors when one has a large corpus with unknown aspects of its structure. In this case, they are dealing with the 100,000 pages making up the Peirce archive at Harvard University of which there are significant unknowns regarding the chronological order and structure of the manuscript pages (i.e. the relationships to one another that they possess in terms of intellectual development or theme). Their preliminary investigative work, the subject of their article, focusses on the “Prolegomena to an Apology for Pragmaticism” (PAP). A version of this work was published in The Monist in 1906 and you can find that here:
Citation: Peirce, C. S. (1906) “Prolegomena to an Apology for Pragmaticism,” The Monist, Vol. 16, No. 4 (October): pp. 492-546. https://www.jstor.org/stable/27899680
The researchers have begun with the PAP and then will use their findings to subsequently engage with the remaining corpus. The goal of their textual analysis was primarily to identify semantic relationships that existed throughout the text. Semantic proximity between manuscript pages was estimated using vector representations of words computed by spaCy and network analysis (Relation Extraction) was performed via REBEL. The researchers stated in closing that, “we have successfully identified semantic connections between different sections of the PAP document. The resulting cartographic visualization provides an intuitive representation of the manuscript’s structure and the relationships between its various parts.”
Sounds cool.
So, I’ve decided that over the course of this Spring Semester, I will attempt to reduplicate this type of study. I am intrigued by the project aims and the methodology. I think a pretty natural question to ask is: how accurate is such a methodology? Clearly, the better the understanding of the accuracy, the better we can refine procedures. Although I’m unsure at the moment how to quantify accuracy, conceptually, my idea is as follows: suppose I were to take a large document with a known and well articulated structure and then treat it as though it were a disordered collection with unknown relationships (like the previously discussed Peirce collection). Would the techniques outlined in Picca et al (2023) provide me with enough information to reconstruct anything approaching the actual (and known) structure? That is, could I use this approach to put the document back together and if so, how well could that be done?
The documents that I will be using for the study are the collected papers of David Lewis:
Citation: David Lewis. 1983. Philosophical Papers Volume 1. New York: Oxford U.P.; David Lewis. 1986. Philosophical Papers, Volume 2. New York: Oxford U.P.
I have chosen these for 3 reasons: 1) as a linguist, I’m familiar with the content 2) I need something that approaches a similar kind of complexity as that of the Peirce manuscripts (i.e. technical philosophical work) and 3) these papers have already been digitized and organized, thus, saving me some time. Having said that, I’d like to plug the work of the The Philosophy Data Project: https://philosophydata.com/ where I found the Lewis papers. Here, you will find other interesting NLP projects related to Philosophy. There are a variety of materials located at Github Philosophy NLP & Text Classification: https://github.com/kcalizadeh/phil_nlp.
The nice part about a project like this is that even if you are unsuccessful in your primary goal, you learn a lot about your tools and refine your understanding of both the questions posed and the assumptions guiding the methods employed to tackle them. My next post, in February, will be an update on my progress.
Thanks for reading.
Leave a Reply