Read, Hot and Digitized: More is less? Less is more? Minimal computing in South Asian Lexicography

Read, hot & digitized: Librarians and the digital scholarship they love — In this series, librarians from the UT Libraries Arts, Humanities and Global Studies Engagement Team briefly present, explore and critique existing examples of digital scholarship. Our hope is that these monthly reviews will inspire critical reflection of, and future creative contributions to, the growing fields of digital scholarship.

I had the lucky opportunity recently to catch Nickoal Eichmann-Kalwara’s presentation on the University of Colorado’s Digital El Diario project at the UC San Diego Digital Initiatives Symposium wherein she advocated for the use of “minimal computing” to achieve “archival justice.” Deeply inspired by her comments but woefully ignorant of the corpus on minimal computing within DS/DH (what seems a combination of activist- and digital-turn on the “less process, more product” concept in archival work), I took it upon myself to learn more as I struggle with the constant nagging tension between achieving the immediate task at hand (“will a simple Google chart effectively communicate my point?”), exploiting technologies to their fullest extent (“boy, I sure bet I would impress folks if I used a sexy Tableau dashboard”), and justifying resources (“this will cost how much??”).  When, I wondered, is less actually more in DS/DH, when is more actually more, and how should we negotiate those differences?

Way back in 2017, Roopika Risam and Susan Edwards argued (in “Micro DH: Digital Humanities at the Small Scale”) that the fixation of everything “large” is not conducive to justice across our institutions, our staff, nor our data:

“Digital humanities practices are often understood in terms of significant scale: big data, large data sets, digital humanities centers… This emphasis leads to the perception that projects cannot be completed without substantial access to financial resources, data, and labor… While this can be the case, such presumptions serve as a deterrent to the development of an inclusive digital humanities community with representation across academic hierarchies (student, librarian, faculty), types of institutions (public, private, regional), and geographies (Global North, Global South).”

I found their argument compelling and wondered where I had seen these tensions in practice.  As a South Asianist, I had to look no further than the uniquely colonial way of knowing—lexicography–and the uniquely 21st century way of access–digital reformatting. 

For over 20 years, the Digital Dictionaries of South Asia (part of the Digital South Asia Library at the University of Chicago) has arguably been the gold standard for online South Asian language dictionaries.  Recognizing the inadequacies of OCR tools to convert images of most South Asian scripts to accurate text data, the DDSA has utilized strategies such as “double blind keying” to produce highly accurate digital editions of established and respected dictionaries.  The process is time-consuming and expensive but produces trusted full-text data that can be used and manipulated in a variety of ways, including those beyond dictionaries.  The institutional positioning of the University of Chicago has allowed for many successful grants over the years to fund DDSA, including those from the US Department of Education, the Mellon Foundation, the Association for Research Libraries and others.  The DDSA is truly extensive in scope and in impact.

At the other end of the spectrum is the DigitalRoses project.  In this pilot, an individual researcher, Gil Ben Herut, Professor of Religious Studies at the University of South Florida, presents another approach to digital dictionary making.  Rather than seeking a fully searchable, text-mineable dictionary, Herut suggests that simple encoding that operationalizes headwords alone (rather than the full-text) for navigation within a dictionary is sufficient for most user applications.  Using target words, the DigitalRoses approach “resolves a common problem in OCR text ingestion through the utilization of manual indexing of the first entry word on each page in physical media, [thereby… ingesting dictionaries at a fraction of the time and cost of full digitization,… streamlining searching by allowing partial, wildcard and fuzzy searches, and maintaining the richness of the printed layout.”

In comparision, then, we have two approaches to the same problem and therefore two solutions.  See, for example, a search for the Kannada word for “book,” Kitaba/ಕಿತಾಬು, in the DDSA version of Kittel’s Kannada-English Dictionary and in the Digital Roses version.

The thoroughly meticulous approaches used in the DDSA model produce a robust and unique digital experience built on fully manipulatable, multiscript data while the simple imaging and only partial inputting of the DigitalRoses project produces a quick digital surrogate to the analog counterpart. 

Turning back to “minimal computing,” these two projects offer up models to complicate our understanding of who gets to do what and how in our technologically informed research.  Grant funding allows for big data and big research at big institutional levels.  Minimal computing allows individuals and less resourced cohorts to also meaningfully contribute to the field.  Both approaches have the potential to positively impact users and the creation of new knowledge. 

I encourage you to consider where you fall on this debate: is less more? Is more more?  And when does it matter?


For more on minimal computing, justice through DS/DH, lexicography, and Kannada, see:

Constance Crompton, Richard J. Lane and Ray Siemens, eds.  Doing digital humanities: practice, training, research (London; New York: Routledge, 2016)

Howard Jackson, ed. The Bloomsbury Handbook of Lexicography / [edited by] Howard Jackson. (London: Bloomsbury Academic, 2022)

Ferdinand Kittel and Mariappa Bhatt. Kittel’s Kannaḍa-English dictionary. (Madras: University of Madras, 1968-1971)

Roopika Risam. New digital worlds: postcolonial digital humanities in theory, praxis, and pedagogy (Evanston, Illinois: Northwestern University Press, 2019)

Read, Hot, and Digitized: Adventures in Data-Sitting

Read, hot & digitized: Librarians and the digital scholarship they love — In this series, librarians from the UT Libraries Arts, Humanities and Global Studies Engagement Team briefly present, explore and critique existing examples of digital scholarship. Our hope is that these monthly reviews will inspire critical reflection of, and future creative contributions to, the growing fields of digital scholarship.

It will come as no surprise that I, the English Literature Librarian, was a nerdy little bookworm as a child. I actively participated in the Book It! reading program, a literacy initiative sponsored by Pizza Hut. The premise of Book It! was simple: After completing five books and getting the sign-off from my teacher, I would “earn” a coupon for a personal pan pizza. When I was in 5th grade, I read enough Baby-Sitters Club (BSC) books in a single week to earn three pizzas. I felt a tinge of guilt because I had skipped early chapters in each book where the text was reused, word-for-word, from previous books in the series. It was always Chapter 2!

Every devoted Baby-Sitters Club fan knows the text was reused to introduce the characters and the premise of the series. There were over 200 books published in the span of 13 years – of course some of it would be repetitive! But let’s take it a step further. What if we could quantifiably demonstrate the reuse of Chapter 2 text, while also comparing stylistic and narrative changes across multiple ghostwriters and cultural trends? And how would you do this kind of analysis of 200+ novels, spin-offs, and graphic novel adaptations? Well, a feminist collective of scholars called the Data-Sitters Club (DSC) is attempting to do just that. 

Cover art for the Data-Sitters Club, by artist Claire Chenette

The Data-Sitters Club describe their project as “a fun way to learn about computational text analysis for digital humanities”. They created a corpus of Ann M. Martin’s influential young adult series and have analyzed it using a variety of DH methods and tools (Python, R, TEI, Voyant, just to name a few). The Baby-Sitters Club has had a long pop culture shelf-life for Gen X and Millennial readers, with the recent Netflix reboot (which was sadly canceled after two seasons) and the podcasts Stuck in Stonybrook and the Baby-Sitters Club Club. According to the publisher Scholastic, the series has been in print since 1986 and has sold more than 190 million copies. Given the series’ immense popularity and continued pop culture influence, the books are a gold mine for researchers interested in gender, race, class, and sexuality, but, like much of girl culture, the books haven’t been the subject of serious research.

So the Data-Sitters Club saw opportunity for new research, while also making DH more accessible, especially to women and other marginalized groups often sidelined in DH projects. The DSC does this through a series of 16 blog posts on their GitHub site, written to mimic the narrative style of the book series, including titles that riff off the originals. Each blog post covers a use case for the BSC corpus and features a different tool, coding language or technique. Two of my favorites are DSC #2: Katia and the Phantom Corpus and DSC #5: The DSC and the Impossible TEI Quandaries. (A running joke throughout the blog is that later posts refer the reader back to “Chapter 2” to explain the corpus and how it was created, an intentional reference to the Chapter 2 in the original series that reused text to explain the series’ premise.)

Cover art for DSC #2: Katia and the Phantom Corpus, which parodies an original Baby-Sitters Club book cover that I’m pretty sure I read in 3rd or 4th grade. Image courtesy of the Data-Sitters Club

One thing you won’t find on the DSC GitHub site is the corpus itself. The team scanned print books to create a legal corpus, but as of right now, it’s not available publicly online. The DSC has used the project as an advocacy tool to promote the loosening of ebook copyright restrictions to build literary corpra for private research. In partnership with the non-profit Authors Alliance, they wrote to the Librarian of Congress asking for exemptions to the Digital Millennium Copyright Act of 1998 to access the full BSC corpus. Of all the DSC blog posts, I found DSC #7: The DSC and the Mean Copyright Law to be the most fascinating – and frustrating.

I would recommend the Data-Sitters Club blog to any emerging DH scholar or librarian looking to try a new tool or method. Much of the content is highly technical, but the fun, approachable tone of each blog post makes the content accessible. I hope they are able to get legal access to the full ebook corpus so we can see more research on the Baby-Sitters Club books and better understand their cultural impact on a generation of women and girls.

You can find print copies of the original Baby-Sitters Club series in the PCL Youth Collection, and I highly recommend the recent essay collection We Are the Baby-Sitters Club: Essays and Artwork from Grown-up Readers, available at the PCL.  

Meng-Fen Su Honored with Emerita

In recognition of her 20 years of excellent service to the UT Libraries, President Jay Hartzell granted former East Asian Studies Liaison Librarian Meng-fen Su “Emerita Status,” an honorary designation conferred upon retirees to recognize their contributions and accomplishments over their university careers.

Meng-fen came to UT Libraries in 2000 after serving as a cataloger at Ohio State and at Harvard-Yenching Library. 

During her tenure at university, the East Asian Studies collection has more than doubled in size (from 91,000 volumes to 190,000) and has been carefully curated to create a more representative balance between Chinese, Japanese and Korean materials. 

Her reserved demeanor belies the fact that she was an expert at networking to bolster resources. For example, Meng Fen established the first Taiwan Resource Center for Chinese Studies in 201 by building a collaboration between the Libraries and the National Central Library in Taiwan. She submitted multiple successful grants to garner support for both physical (Reference Materials Distribution Program) and digital (Korean Studies e-Resources grants) materials from the Korea Foundation, and also forged relationships to receive publications from research institutes throughout East Asia – the Academia Sinica, the National Museum of Taiwan Literature, Waseda University and the Korean Film Council, to name a few.  

Congratulations and thanks to Meng-fen Su for her devotion to her work on behalf of the university and to the Libraries.