Database Decisions – Cultural Heritage Informatics Initiative

After pivoting my project to create a searchable text database of collective bargaining agreements (CBAs) instead of a speculative, collaborative map with local union members, my next step was to decide how to actually create this database.

To build or not to build?

There are two main types of databases: relational and non-relational. Relational databases, like Structured Query Language (SQL) databases, can be thought of like a well-organized filing cabinet where data is stored in tables with rows and columns. Because these databases use a specific language to manage and retrieve data, they’re highly structured and follow strict rules to ensure data remains consistent and accurate. In contrast, NoSQL databases are non-relational and are more like a storage room. They don’t require data to be stored in tables with rows and columns, and there aren’t predetermined relationships between different data items. As a result, NoSQL databases are especially good at handling unstructured or semi-structured data, like text from union contracts, because they can accommodate the variety and complexity without needing a predefined structure.

Initially, I considered building a database from the ground up using a NoSQL database like MongoDB. This strategy would have the advantage of building my technical skills and would allow me to retain control over the data itself. While confidentiality isn’t a particular concern here, since collective bargaining agreements are publicly archived with the US Department of Labor and the universities themselves, I am concerned about either losing access or losing affordable access when using a software as a service (SAAS) platform. Ultimately, after researching MongoDB tutorials, I decided against this route, simply because I wouldn’t have enough time to both learn how to create a database, learn how to integrate search functionality in my website, and actually create the rest of the project before the end of the semester. That said, I may return to this idea in the future and re-do the Graduate Labor Rising site to use a MongoDB document database.

After deciding not to create the database from scratch, I looked at SAAS options. There are a plethora of options available, but most of them are tailored to internal business uses. This presented two challenges. First, as internal products, many database tools either don’t have the capacity for external search, instead only providing access to the data for logged-in users, or the user experience would be unpleasant. While I want visitors to be easily able to search the CBAs, I don’t want them to need to create an account or to have editing privileges. Secondly, some tools that do allow external search without editing privileges are prohibitively expensive. This is an academic project, and while I hope that the site will be widely useful to graduate unions across the U.S., I don’t have sustainable funding to maintain an expensive text database for multiple years.

Luckily, upon the recommendation of Dr. Watrall, I’ve decided to use Baserow.io to build the text database. This is an open-source, no-code tool that allows me to embed a searchable (but not editable) view of my database into my own site using an iframe. Additionally, I can use image headers to provide some form of stylistic customization. So far, this tool has been working well. Visitors to Graduate Labor Rising will be able to conduct a full-text keyword search of contract articles, as well as filter by selected features, such as public versus private universities.

Curation decisions

Because storage is limited with this tool, I’ve had to make decisions about which contract sections to include. Collective bargaining agreements are a legal document, and as such, they often share some boilerplate features that aren’t particularly noteworthy. For example, each CBA typically begins with a definition of who is included in the bargaining unit. Since the National Labor Review Board has ruled that graduate students at private universities may unionize and since employees at public universities are subject to state laws about unionization eligibility, this information isn’t particularly helpful on a broad scale, and I’ve omitted these sections from the database. In addition to making decisions about which CBA sections to include, I’ve also needed to make decisions about which contracts to incorporate first, given the limited time to complete this project. Since the goal of the overall project is to highlight union victories, I’ve chosen to prioritize contracts from unions that have made particularly strong victories, such as the Graduate Employees Union’s funding for professional learning communities at Michigan State University. Coupled with the longform narrative on the project homepage, these selected CBAs will form the core of the project as I continue to expand the database.