Category Archives: Open Science

Navigating the Data Landscape: An Open Source Workflow

Recent years have witnessed explosive growth in the volume of research publications (Hanson et al., 2024). In order to maintain the basic tenets of scholarship, stakeholders such as funders and publishers are increasingly introducing policies to promote research best practices. For example, the 2022 Nelson Memo directed federal agencies that dispense at least $100m in research funding to revise policies around making the outputs of federally funded research available. Concurrent with the evolution of these policies, research institutions are innovating and developing the necessary infrastructure to support researchers, for which the libraries are an essential component.

These stakeholders and various subgroups within them have a range of interests in tracking the publishing of research outputs. In order to make data-driven decisions around what services we provide in the libraries and how we provide them, we need data about our research community. There is a long history of tracking publication of articles and books, and the infrastructure for doing so is relatively well-developed (e.g., Web of Science, Scopus, Google Scholar). In this regard, we are well-positioned to continue monitoring these outputs in line with the new stipulations for immediate public access in the Nelson Memo. However, the Nelson Memo also stipulated that the research data supporting publications need to be shared publicly. Compared to open access publishing, open sharing of data is less developed culturally and structurally, which makes it all the more important to develop a workflow to begin to gather data on this front.

Predictably, the infrastructure for tracking the sharing of data is not nearly as well-developed as that for articles or books. While some of this is likely due to the relative lack of emphasis on data publishing, there are a variety of reasons why tracking data isn’t quite as easy for motivated parties. Journals, in spite of wide-ranging aesthetic and syntax standards, have relatively uniform metadata standards. In large part, this is because of the homogeneity of their products, across disciplines, which are primarily peer-reviewed research articles that are typeset into PDFs. This allows proprietary solutions like Web of Science and Scopus to harvest vast amounts of metadata (through CrossRef) and to make it available in a readily usable format with relatively little work required to format, clean, or transform. In contrast, research data are published in a wide variety of formats, ranging from loosely structured text-based documents like letters or transcripts to objects with complex or structured formatting like geospatial data and genomic data. As a result, there can be significant differences between platforms that host and publish research data, ranging from general to discipline-specific metadata and file support, level of detail in author information, use of persistent identifiers like DOIs, and curation and quality assurance measures (or lack thereof).

Horizontal bar chart comparing the frequency of different name permutations of UT Austin that were entered in UT Austin datasets. A total of eight different permutations were detected, ranging from 'University of Texas at Austin' to 'UT Austin.' The most common is to use 'at Austin' rather than some form of punctuation like a comma or hyphen instead of 'at.'
Comparison of annual volume of dataset publications. ‘All’ refers to the volume across all discovered repositories and is compared to our institutional repository, the Texas Data Repository, and two common generalists, Dryad and Zenodo.

While a few proprietary solutions are beginning to emerge that purport to be able to track institutional research data outputs (e.g., Web of Science), these products have notable shortcomings, including significant cost, difficulty assessing thoroughness of retrieval, and limited number of retrievals. In order to create a more sustainable and transparent solution, the Research Data Services team has developed a Python-based workflow that uses a number of publicly accessible APIs for data repositories and DOI registries. The code for running this workflow has been publicly shared through the UT Libraries GitHub at https://github.com/utlibraries/research-data-discovery so that others can also utilize this open approach to gathering information about research data outputs from user-defined institutions; the code will continue to be maintained and expanded to improve coverage and accuracy. To date, the workflow has identified more than 3,000 dataset publications by UT Austin researchers across nearly 70 different platforms, ranging from generalist repositories that accept any form of data like Dryad, figshare, and Zenodo to highly specialized repositories like the Digital Rocks Portal (for visualizing porous microstructures), DesignSafe (for natural hazards), and PhysioNet (for physiological signal data).

Horizontal bar chart comparing the total number of UT-Austin-affiliated datasets published in different repositories. Only repositories with at least 30 datasets are individually listed; the remainder are grouped into an 'Other' category. The Texas Data Repository has the most discovered datasets (nearly 1,250), followed by Dryad, Zenodo, Harvard Dataverse, the aggregated 'other', ICPSR, figshare, DesignSafe, Mendeley Data, the Digital Rocks Portal, and EMSL. No repository other than the Texas Data Repository has more than 400 datasets.
Comparison of total number of dataset publications between repositories. Only repositories with more than 30 UT-affiliated publications are depicted individually; all others are grouped into ‘Other.’

This work is still very much in progress. Perhaps equally important to the data that we were able to obtain are the data we suspect exist, but were unable to retrieve via our workflow (e.g., we didn’t retrieve any UT-affiliated datasets from the Qualitative Data Repository, even though we are an institutional member), as well as the variation in metadata schemas, cross-walks, and quality, which can help to inform our strategies around providing guidance on the importance of high-quality metadata. For example, this process relies on proper affiliation metadata being recorded and cross-walked to DataCite. Some repositories simply don’t record or cross-walk any affiliation metadata, making it essentially impossible to identify which, if any, of their deposits are UT-affiliated. Others record the affiliation in a field that isn’t the actual affiliation field (e.g., in the same field as the author name); some even recorded the affiliation as an author. All of this is on top of the complexity introduced by the multiple ways in which researchers record their university affiliation (UT Austin, University of Texas at Austin, the University of Texas at Austin, etc.)

Horizontal bar chart comparing the frequency of different name permutations of UT Austin that were entered in UT Austin datasets. A total of eight different permutations were detected, ranging from 'University of Texas at Austin' to 'UT Austin.' The most common is to use 'at Austin' rather than some form of punctuation like a comma or hyphen instead of 'at.'
Comparison of the frequency of different permutations of ‘UT Austin’ that were entered as affiliation metadata in discovered datasets.

We also have to account for variation in the granularity of objects, particularly those that receive a PID. For example, in our Texas Data Repository (TDR), which is built on Dataverse software, both a dataset and each of its constituent files receives a unique DOI – each file is also recorded as a ‘dataset’ because the metadata schema used by the DOI minter, DataCite, doesn’t currently support a ‘file’ resource type. We thus have to account for a raw data output that will initially inflate the number of datasets in TDR by at least two orders of magnitude. The inverse of this is Zenodo, which assigns a parent DOI that always resolves to the most recent version, with each version of an object getting its own DOI (so all Zenodo deposits have at least two DOIs, even if they are never updated).

The custom open source solution that we have developed using Python, one of the most common software languages (per GitHub), offers the flexibility to overcome the challenges posed by differences between data repositories and variations in the metadata provided by researchers. Our approach also avoids the shortcomings of proprietary solutions as it offers transparency so that users can understand exactly how dataset information is retrieved, and it is available at no cost to anyone who might want to use it. In many ways, this workflow embodies the best practices that we encourage researchers to adopt – open, freely available, transparent processes. It also allows others (at UT or beyond) to adopt our workflow, and if necessary, to adapt it for their own purposes.

New Open Access Initiatives and Annual Report Highlights

Welcome to our semi-annual update on the University of Texas Libraries’ (UTL) commitment to supporting open access (OA) publishing. In this update, we’re excited to announce several new OA initiatives available for the UT community to utilize, alongside a glimpse into the significant cost savings achieved through our OA agreements.

Cogitatio Press

Cogitatio Press offers a range of five OA journals covering diverse fields such as Media and Communication, Politics and Governance, and Urban Planning. Launching late this year, their ‘Ocean and Society’ journal will provide a platform for ocean-related research. The best part? UT Austin corresponding authors can publish in these journals without incurring Article Processing Charges (APCs), thanks to our agreement with Cogitatio.

Free Journal Network (FJN)

FJN, a non-profit organization, focuses on supporting diamond OA journals, ensuring no fees for readers or authors. Their mission includes facilitating journal coordination, sharing best practices, promoting FJN journals, securing funding for journal enhancement, and advocating for improvements in scholarly publishing. We’re thrilled to collaborate with FJN in advancing open access initiatives.

Institute of Physics (IOP)

UTL has secured a Read and Publish deal with the Institute of Physics (IOP), granting the UT community access to all IOP journals. Moreover, UT Austin corresponding authors can publish OA in IOP journals without bearing APC costs, contributing to the dissemination of impactful research across disciplines.

Bloomsbury Open Collections

Bloomsbury is pioneering a collective funding model for OA books, akin to the successful Subscribe to Open model for journals. We’re proud to support the African Studies + International Development collection, which aims to make 20 frontlist titles available immediately upon publication. This initiative underscores our commitment to promoting diverse voices and perspectives in scholarly literature.

Peer Community In

Peer Community In (PCI) is a scientist-led initiative to provide a reviewing and recommending service for pre-print articles; similar to the peer review process for journal articles. Those recommended pre-prints can then be submitted to the Peer Community Journal or a PCI friendly journal which will accept the recommended pre-print article with waived or expedited peer review. We are excited to support this unique publishing model that aims to provide additional value around pre-prints as an important part of the OA ecosystem.

Understanding UT Austin Corresponding Authors

You might wonder, what exactly is a UT Austin corresponding author? In essence, they’re the primary point of contact for communication regarding an article. While typically a senior researcher such as a faculty member, this role isn’t exclusive and can be fulfilled by any UT Austin affiliate involved in the research. For OA agreements offering direct author benefits like waived APCs, eligibility is contingent upon the corresponding author’s affiliation with UT Austin.

Annual Report Highlights

In our latest annual report, completed last fall, we celebrated significant milestones achieved through our OA agreements. Notably, these initiatives resulted in over $600,000 of cost savings through waived or reduced APCs. This substantial figure underscores the tangible impact of our commitment to open access publishing and reflects the growing momentum towards equitable and accessible scholarly communication.


As we continue to champion open access initiatives, we invite the UT community to explore these new opportunities and join us in advancing knowledge dissemination for the betterment of academia and society at large.

For more information on these initiatives and our ongoing efforts, please visit our OA LibGuide.

Thank you for your continued support and engagement in fostering a culture of openness and accessibility in scholarly publishing.

Scholars Lab Hosts First Open Science Summit

The doors of the new Scholars Lab at the Perry-Castañeda Library swung open for the first Texas Open Science Summit, held on Wednesday, September 20.

Hosted by the Libraries, this summit was organized as a call to action for the advancement of open science in recognition of the Year of Open Science, a move by the White House Office of Science and Technology Policy (OSTP) to advance national open science policies across the federal government in 2023.

The Summit marked an initiatory gathering to highlight the commitment of advocates in the campus community to openness, collaboration, and the dissemination of knowledge. The event took place both in-person and virtually, to ensure accessibility to a wide audience.

The event served a diversity of ideas and perspectives to attendees, with participants from various disciplines and backgrounds coming together to explore the benefits of open science practices and individual experiences in the application of those practices. It offered a platform for sharing success stories, discussing challenges, and brainstorming solutions, all with the ultimate goal of promoting transparency and accessibility in research.

The summit provided inspiring keynote addresses and panel discussions featuring local and national experts in open science, including representatives from Higher Education Leadership Initiative for Open Scholarship (HELIOS) and NASA’s Transform to Open Science (TOPS) program.

These thought-provoking sessions covered a broad spectrum of topics, from open-access publishing to data sharing and reproducibility. Participants left inspired and armed with practical insights to implement in their own work.

Attendees were also introduced to the university’s new Open Source Programs Office (OSPO) – funded by the Alfred P. Sloan Foundation – which has recently been launched to promote open source and open science opportunities to students, faculty, staff and researchers at UT.

Those who attended expressed that the Summit was a resounding success in reaffirming the global scientific community’s dedication to open science principles. Participants left the event with a deeper understanding of open science practices and a shared commitment to making research more transparent and accessible.