Read, hot & digitized: Librarians and the digital scholarship they love — In this series, librarians from UTL’s Arts, Humanities and Global Studies Engagement Team briefly present, explore and critique existing examples of digital scholarship. Our hope is that these monthly reviews will inspire critical reflection of and future creative contributions to the growing fields of digital scholarship.
The foundation of digital humanities is data. Lots of it.
As the early phases of AI have shown us, there is a staggering amount of textual data available to manipulate and compute–both openly available and that which exists behind paywalls. All too often the depth and accessibility of digital scholarly textual data in non-English and non-Roman scripts is lacking. Rather than be left behind or constrained by these lacuna, individual scholars are working to generate their own digital research corpora, often building upon AI tools.
Recently I was introduced to the MITRA project and have been nothing short of amazed.
A research project from the University of California-Berkeley’s AI Research Lab, MITRA “focuses on bridging the linguistic divide between ancient wisdom source languages and contemporary languages through the application of advanced Deep Learning and AI technologies.” Using Gemini APIs, MITRA builds upon an extensive digitized text corpus and contributions from translators and researchers alike to “harness AI technologies to promote the scholarly study and personal practice of the dharma and to accelerate academic and individual research through open-source collaboration on datasets, models and applications.” In so doing, MITRA aims to “overcome the challenges inherent in low-resource language translation,” to “minimize language barriers,” and to create “more equitable access to literature and wisdom.”
I have engaged with OCR and digital text conversion for years but have always found it to be a labor intensive and ultimately less-than-satisfying [or accurate] experience, especially for non-roman languages and scripts. Of Interest to me, therefore, is how MITRA has harnessed AI to allow one to drag-and-drop PDF files into the tool at which point it can both detect the language (Sanskrit & other Devanagari-based languages, Tibetan, scriptural Chinese or English) and use OCR to produce a relatively accurate text file. That unto itself is pretty amazing. From there, however, one can quickly transliterate, translate and/or explain the text into Sanskrit, Buddhist & Modern Chinese, Russian, Korean, Japanese, German, French, Italian, Hindi or Spanish.
To test it out, I grabbed a small amount of openly accessible text from HathiTrust. I chose an early Hindi novel, namely Rāmalāla Varmmā’s Banārasī Dupaṭṭā Yā Gularū Zarīnā from 1916 which is readily available in PDF form on HathiTrust. I grabbed the first page of the novel which looks like this:
I then put a PDF of that page into MITRA to see if it could OCR the text. Despite some blurriness of the original source text, it most certainly could OCR it (even if not 100% accurate):
Encouraged, I then asked MITRA to both transliterate (take the text written in Devanagari script and convert to roman script) and to translate the text which it also did quite quickly and easily:


Ever more optimistic, I then clicked on “English explained” and MITRA was also quite adept at parsing the translated text, the original script of the text, and the grammar and vocabulary.
I repeat, I stand amazed.
While MITRA has clearly captured my attention and my appreciation, I will note that there are other similar projects currently available and equally commendable, from Andrew Ollett’s Indological and OCR tools [and fabulous related explanations] to Tyler Neill’s toolkit, Skrutable.
Likewise, the UT Libraries is here to help explore the production of your own digital content for research. The Scan Tech Studio in the PCL Scholars Lab has the hardware and software you might need to convert print into digital texts, as well as a group of specialists to help you. We have online guides to introduce the practices and concepts of OCR as well as recordings from OCR workshops.
I encourage anyone interested in exploring non-English or non-roman digital texts to jump in, kick the tires, and have some fun with these impressive conversion projects.


