Texts from the past meet next-generation tech: The JCB welcomes inaugural STEM Fellow Ethan Meidinger

July 1, 2024

The image shows the character a with a written accent rendered various ways.

We often talk here at the JCB about libraries as the original information infrastructure. In the world of proliferating digital data, it’s been important to think about how our data structures can best serve the needs for access to our collections. A major step for the JCB is the commitment to our digital platform, Americana, and to the complete digitization of our rare books. We have also explored new ways to think about interdisciplinary approaches to our data.

This summer Ethan Meidinger joined the team at the John Carter Brown Library as the Inaugural STEM Fellow. Ethan is a data science and computer science major at William & Mary, and is currently working as a software developer specializing in the development of Optical Character Recognition (OCR). This summer, he will work with Associate Director for Digital Asset Management Pedro Germano Leal on a project geared towards improving scholars’ ability to analyze texts written in guaraní, an Indigenous language spoken in Paraguay and elsewhere in South America.

One of the longstanding problems of doing optical character recognition in the Humanities is the lack of training data that can be used for manuscript or printed materials. Many books have distinct ways of writing or printing letters, making it challenging to produce accurate OCR models for this form of data. One way of circumventing this issue is to gather limited instances of the data and produce copies known as deep fakes. These are letters generated by AI designed to mimic the patterns of the original.

The above image shows generated examples of the letter ȃ used by Guaraní scribes in the early 18th century. The machine attempts to account for variation in the data by generating different instances of ȃ instead of settling on one pattern. This is done to account for issues in the letters such as ink fading, bleedthrough from background text, and inconsistencies in how they are written. The goal of producing these deep fakes is to build a comprehensive OCR dataset to analyze Guaraní text, thus making it easier for scholars who use items written in Guaraní in their research. We’re excited to see the results of Ethan and Pedro’s work!