Recent years have witnessed explosive growth in the volume of research publications (Hanson et al., 2024). In order to maintain the basic tenets of scholarship, stakeholders such as funders and publishers are increasingly introducing policies to promote research best practices. For example, the 2022 Nelson Memo directed federal agencies that dispense at least $100m in research funding to revise policies around making the outputs of federally funded research available. Concurrent with the evolution of these policies, research institutions are innovating and developing the necessary infrastructure to support researchers, for which the libraries are an essential component.
These stakeholders and various subgroups within them have a range of interests in tracking the publishing of research outputs. In order to make data-driven decisions around what services we provide in the libraries and how we provide them, we need data about our research community. There is a long history of tracking publication of articles and books, and the infrastructure for doing so is relatively well-developed (e.g., Web of Science, Scopus, Google Scholar). In this regard, we are well-positioned to continue monitoring these outputs in line with the new stipulations for immediate public access in the Nelson Memo. However, the Nelson Memo also stipulated that the research data supporting publications need to be shared publicly. Compared to open access publishing, open sharing of data is less developed culturally and structurally, which makes it all the more important to develop a workflow to begin to gather data on this front.
Predictably, the infrastructure for tracking the sharing of data is not nearly as well-developed as that for articles or books. While some of this is likely due to the relative lack of emphasis on data publishing, there are a variety of reasons why tracking data isn’t quite as easy for motivated parties. Journals, in spite of wide-ranging aesthetic and syntax standards, have relatively uniform metadata standards. In large part, this is because of the homogeneity of their products, across disciplines, which are primarily peer-reviewed research articles that are typeset into PDFs. This allows proprietary solutions like Web of Science and Scopus to harvest vast amounts of metadata (through CrossRef) and to make it available in a readily usable format with relatively little work required to format, clean, or transform. In contrast, research data are published in a wide variety of formats, ranging from loosely structured text-based documents like letters or transcripts to objects with complex or structured formatting like geospatial data and genomic data. As a result, there can be significant differences between platforms that host and publish research data, ranging from general to discipline-specific metadata and file support, level of detail in author information, use of persistent identifiers like DOIs, and curation and quality assurance measures (or lack thereof).

While a few proprietary solutions are beginning to emerge that purport to be able to track institutional research data outputs (e.g., Web of Science), these products have notable shortcomings, including significant cost, difficulty assessing thoroughness of retrieval, and limited number of retrievals. In order to create a more sustainable and transparent solution, the Research Data Services team has developed a Python-based workflow that uses a number of publicly accessible APIs for data repositories and DOI registries. The code for running this workflow has been publicly shared through the UT Libraries GitHub at https://github.com/utlibraries/research-data-discovery so that others can also utilize this open approach to gathering information about research data outputs from user-defined institutions; the code will continue to be maintained and expanded to improve coverage and accuracy. To date, the workflow has identified more than 3,000 dataset publications by UT Austin researchers across nearly 70 different platforms, ranging from generalist repositories that accept any form of data like Dryad, figshare, and Zenodo to highly specialized repositories like the Digital Rocks Portal (for visualizing porous microstructures), DesignSafe (for natural hazards), and PhysioNet (for physiological signal data).

This work is still very much in progress. Perhaps equally important to the data that we were able to obtain are the data we suspect exist, but were unable to retrieve via our workflow (e.g., we didn’t retrieve any UT-affiliated datasets from the Qualitative Data Repository, even though we are an institutional member), as well as the variation in metadata schemas, cross-walks, and quality, which can help to inform our strategies around providing guidance on the importance of high-quality metadata. For example, this process relies on proper affiliation metadata being recorded and cross-walked to DataCite. Some repositories simply don’t record or cross-walk any affiliation metadata, making it essentially impossible to identify which, if any, of their deposits are UT-affiliated. Others record the affiliation in a field that isn’t the actual affiliation field (e.g., in the same field as the author name); some even recorded the affiliation as an author. All of this is on top of the complexity introduced by the multiple ways in which researchers record their university affiliation (UT Austin, University of Texas at Austin, the University of Texas at Austin, etc.)

We also have to account for variation in the granularity of objects, particularly those that receive a PID. For example, in our Texas Data Repository (TDR), which is built on Dataverse software, both a dataset and each of its constituent files receives a unique DOI – each file is also recorded as a ‘dataset’ because the metadata schema used by the DOI minter, DataCite, doesn’t currently support a ‘file’ resource type. We thus have to account for a raw data output that will initially inflate the number of datasets in TDR by at least two orders of magnitude. The inverse of this is Zenodo, which assigns a parent DOI that always resolves to the most recent version, with each version of an object getting its own DOI (so all Zenodo deposits have at least two DOIs, even if they are never updated).
The custom open source solution that we have developed using Python, one of the most common software languages (per GitHub), offers the flexibility to overcome the challenges posed by differences between data repositories and variations in the metadata provided by researchers. Our approach also avoids the shortcomings of proprietary solutions as it offers transparency so that users can understand exactly how dataset information is retrieved, and it is available at no cost to anyone who might want to use it. In many ways, this workflow embodies the best practices that we encourage researchers to adopt – open, freely available, transparent processes. It also allows others (at UT or beyond) to adopt our workflow, and if necessary, to adapt it for their own purposes.