Tag Archives: archival research

Read, Hot and Digitized: AI for OCR & Translation

Read, hot & digitized: Librarians and the digital scholarship they love — In this series, librarians from UTL’s Arts, Humanities and Global Studies Engagement Team briefly present, explore and critique existing examples of digital scholarship.  Our hope is that these monthly reviews will inspire critical reflection of and future creative contributions to the growing fields of digital scholarship.


The foundation of digital humanities is data.  Lots of it.

As the early phases of AI have shown us, there is a staggering amount of textual data available to manipulate and compute–both openly available and that which exists behind paywalls.  All too often the depth and accessibility of digital scholarly textual data in non-English and non-Roman scripts is lacking.  Rather than be left behind or constrained by these lacuna, individual scholars are working to generate their own digital research corpora, often building upon AI tools.

Recently I was introduced to the MITRA project and have been nothing short of amazed.

A research project from the University of California-Berkeley’s AI Research Lab, MITRA “focuses on bridging the linguistic divide between ancient wisdom source languages and contemporary languages through the application of advanced Deep Learning and AI technologies.”  Using Gemini APIs, MITRA builds upon an extensive digitized text corpus and contributions from translators and researchers alike to “harness AI technologies to promote the scholarly study and personal practice of the dharma and to accelerate academic and individual research through open-source collaboration on datasets, models and applications.”  In so doing, MITRA aims to “overcome the challenges inherent in low-resource language translation,” to “minimize language barriers,” and to create “more equitable access to literature and wisdom.” 

I have engaged with OCR and digital text conversion for years but have always found it to be a labor intensive and ultimately less-than-satisfying [or accurate] experience, especially for non-roman languages and scripts.  Of Interest to me, therefore, is how MITRA has harnessed AI to allow one to drag-and-drop PDF files into the tool at which point it can both detect the language (Sanskrit & other Devanagari-based languages, Tibetan, scriptural Chinese or English) and use OCR to produce a relatively accurate text file.  That unto itself is pretty amazing.  From there, however, one can quickly transliterate, translate and/or explain the text into Sanskrit, Buddhist & Modern Chinese, Russian, Korean, Japanese, German, French, Italian, Hindi or Spanish. 

To test it out, I grabbed a small amount of openly accessible text from HathiTrust.  I chose an early Hindi novel, namely Rāmalāla Varmmā’s Banārasī Dupaṭṭā Yā Gularū Zarīnā from 1916 which is readily available in PDF form on HathiTrust.  I grabbed the first page of the novel which looks like this:

Page one of Banārasī Dupaṭṭā Yā Gularū Zarīnā from HathiTrust

I then put a PDF of that page into MITRA to see if it could OCR the text.  Despite some blurriness of the original source text, it most certainly could OCR it (even if not 100% accurate):

MITRA’s OCR of page one of Banārasī Dupaṭṭā Yā Gularū Zarīnā

Encouraged, I then asked MITRA to both transliterate (take the text written in Devanagari script and convert to roman script) and to translate the text which it also did quite quickly and easily:

Ever more optimistic, I then clicked on “English explained” and MITRA was also quite adept at parsing the translated text, the original script of the text, and the grammar and vocabulary. 

MITRA’s “English Explained” of page one of Banārasī Dupaṭṭā Yā Gularū Zarīnā

I repeat, I stand amazed.

While MITRA has clearly captured my attention and my appreciation, I will note that there are other similar projects currently available and equally commendable, from Andrew Ollett’s Indological and OCR tools [and fabulous related explanations] to Tyler Neill’s toolkit, Skrutable

Likewise, the UT Libraries is here to help explore the production of your own digital content for research.  The Scan Tech Studio in the PCL Scholars Lab has the hardware and software you might need to convert print into digital texts, as well as a group of specialists to help you.  We have online guides to introduce the practices and concepts of OCR as well as recordings from OCR workshops

I encourage anyone interested in exploring non-English or non-roman digital texts to jump in, kick the tires, and have some fun with these impressive conversion projects. 

An Adventure in El Paso, Texas

One of my favorite parts of being a librarian is the opportunity to participate in community engagement projects. So when the opportunity to work with Albert A. Palacios on a traveling exhibit as one of my rotations, I immediately said yes. The exhibit was a collaboration with the University of Texas at El Paso’s C.L. Sonnichsen Special Collections Department, which was especially exciting as a UTEP alumnus. This is part of a long standing partnership made possible by a U.S. Department of Education National Resource Center grant. Our exhibit brought together holdings from the Benson Latin American Collection, the C.L. Sonnichsen Special Collections, and the Municipal Archive of Saltillo in a joint physical and digital exhibit about the Mexican Revolution.

A Fight for Democracy exhibit at UTEP
Intertwined Destinies: El Paso and Northern Mexico exhibit at UTEP.

Albert and I traveled to El Paso in May 2025 to finally see the fruits of our labor. When we got to the library’s third floor, Claudia Rivers (Director of the C.L. Sonnichsen Special Collections) was hard at work putting the finishing touches on her exhibit. The U.S.-Mexico border played a big role in the Mexican Revolution, which means that UTEP has a lot of special objects in their archives. One of these objects is a commemorative cigar from when Porfirio Díaz and William Howard Taft met at the border in 1909. It was an incredible experience to see these first hand, and to have people from the community view these as well.

The next day was dedicated to digital scholarship workshops to local scholars. We had participants from all over the El Paso-Juárez region, and an archivist even drove three hours from Alpine to attend! Elisabet Takehana, Director of UTEP’s Center of the Digital Humanities, taught stylometry using the stylo package in R. Sergio Morales, LLILAS Benson Digital Scholarship Graduate Research Assistant and Latin American Studies Master’s student, taught ArcGIS’s Online and StoryMap tools for presenting spatial research using the official photographs from Mexico’s 1910 independence centennial celebration. And finally, I taught how to use Voyant Tools and UDPipe for text analysis using telegrams between Francisco Villa and Lázaro de la Garza. By the end of the day, participants had gotten hands-on experience with all of these different digital humanities tools and processes.

Sergio Morales teaching ArcGIS Online and StoryMaps tools.
Ana A. Rico teaching text analysis.

After the workshops, we headed upstairs to the third floor once again for the exhibit opening. The exhibit curated by Claudia Rivers was incredible – showcasing a silk print of Porfirio Díaz, a camera from the early 1900s, and portraits of Francisco I. Madero and his wife which were taken by an El Paso photographer. Though our exhibit didn’t get there on time for the opening (Albert and I learned how to roll with the punches) we were able to direct people to the digital version of the exhibit. All in all, it was a day full of learning and celebration, as well as making connections to scholars in the area.

People viewing exhibits during the opening reception.

Finally, on the third day, our exhibit arrived and we put it up for students, faculty, and the public to enjoy! It was a joy to share the Benson Latin American Collection with a wider audience. The exhibit, A Fight for Democracy: The First Years of the Mexican Revolution, will be displayed at UTEP for the summer and then travel to the El Paso Border Heritage Center in the fall. A second copy will circulate through the Austin Public Library later this year.

Albert A. Palacios and Ana A. Rico in front of their exhibit.

Acknowledgements
This initiative would not have been possible without the support of the following individuals and sponsorships:

C.L. Sonnichsen Special Collections Department, The University of Texas at El Paso
● Claudia Rivers, Head
● Susannah Holliday, Assistant Head
● Gina Stevenson, Photo and Processing Archivist

Center of the Digital Humanities, The University of Texas at El Paso
● Elisabet Takehana, Director

Municipal Archive of Saltillo
● Olivia Strozzi, Director
● Iván Vartan Muñoz Cotera, Head of Outreach

LLILAS Benson Latin American Studies and Collections
● Melissa Guy, Director, Benson Latin American Collection
● Ryan Lynch, Head of Special Collections
● Jennifer Mailloux, Graphic Designer (special thanks)
● Adela Pineda Franco, LLILAS Director & Lozano Long Endowed Professor
● Theresa Polk, Head of Digital Initiatives
● Ramya Iyer, Grants and Contracts Specialist
● Susanna Sharpe, Communications Coordinator (special thanks)
● Cindy Garza, Accountant
● Leah Long, Administrative Manager

Sponsors
● U.S. Department of Education National Resource Center Title VI Grant
● LLILAS Benson Collaborative Funds