High School Students Helped an AI Learn to Read Old Handwritten Texts

In Italy, 120 high school students helped solve a centuries-old problem: how to give researchers access to the Vatican Secret Archives, a massive collection of documents detailing the Vatican's activities as far back as the eighth century.

That should look pretty great on their college applications.

The shelves of the Vatican Secret Archives are about 85 kilometers (53 miles) long and house 35,000 volumes of catalogues. But the documents that researchers have scanned and uploaded take up less than an inch. Transcribed documents searchable via computer? Even rarer. That's because the Vatican seems to not have wanted to share the information. Not that they could, anyway — even today's optical-character-recognition (OCR) software simply can't handle the irregularities of the handwritten text.

So if researchers want to view the documents, they have no choice but to visit the Archives in person (assuming the Vatican approves their request for access).

Now, a team of researchers from the Archives and Roma Tre University have a research project designed to address this problem. And they're using artificial intelligence (AI) to transcribe the documents. Their research was published in ERCIM News, the magazine for the European Research Consortium for Informatics and Mathematics.

Click to View Full Infographic

The problem: computers aren't the best at reading human handwriting. So the first step in the so-called In Codice Ratio project was for the students to train it. Using an online platform built by the researchers, the students "voted" on whether a handwritten character sampled from two pages of the Vatican Registers (a collection of letters to and from the Pope) matched variations of a character identified by paleographers (someone who studies old handwriting).

For example, a student might see what looked like a handwritten letter M, accompanied by a series of expert-approved, handwritten M's. If the student thought the two sample M's matched closely enough, they voted "Yes." If not, "No." Enough "Yes" votes, and that handwritten character received a label: M. It took the 120 students just a few hours to work through the entire training set.

But the AI needed more training. Next, the researchers taught their AI to identify the handwritten characters using a method they called "jigsaw segmentation."

Instead of looking at the handwriting as a series of words, or even a combination of letters, the AI looked for strokes. For example, a handwritten M wouldn't look like one character — it would be three strokes closely together. Based on what it knew from the data set produced by the high schoolers, these strokes could be M, or perhaps III.

To help the AI "read" these strokes, the researchers fed it a data set of 1.5 million words in Latin, the language in which the texts are written. Then, when it saw the three strokes, it could determine they probably denoted an M, and not III, since the latter wasn't likely to appear in a Latin word.

When the researchers tested their AI using four pages of the Vatican Registers, it correctly transcribed 65 percent of the words. That's nowhere near perfect, but it's not useless, either. According to the researchers, these transcriptions are accurate enough to provide paleographers with "a solid basis" that could expedite the transcription process. And they're already working to improve the system.

That would be particularly helpful because the Vatican only grants access to something like three documents per day. So a researcher might *think* they know what documents they want to see and visit the Vatican just to realize those documents aren’t helpful.

If everything is transcribed, perhaps researchers wordlwide could eventually search the entire collection for a keyword (“Michelangelo,” or something) and see what documents include it, then ask for access to those. Or, perhaps, get the information they need from the Vatican Secret Archives without taking a trip to Vatican City.

Share This Article