Tag Archives: open source

Navigating the Data Landscape: An Open Source Workflow

Recent years have witnessed explosive growth in the volume of research publications (Hanson et al., 2024). In order to maintain the basic tenets of scholarship, stakeholders such as funders and publishers are increasingly introducing policies to promote research best practices. For example, the 2022 Nelson Memo directed federal agencies that dispense at least $100m in research funding to revise policies around making the outputs of federally funded research available. Concurrent with the evolution of these policies, research institutions are innovating and developing the necessary infrastructure to support researchers, for which the libraries are an essential component.

These stakeholders and various subgroups within them have a range of interests in tracking the publishing of research outputs. In order to make data-driven decisions around what services we provide in the libraries and how we provide them, we need data about our research community. There is a long history of tracking publication of articles and books, and the infrastructure for doing so is relatively well-developed (e.g., Web of Science, Scopus, Google Scholar). In this regard, we are well-positioned to continue monitoring these outputs in line with the new stipulations for immediate public access in the Nelson Memo. However, the Nelson Memo also stipulated that the research data supporting publications need to be shared publicly. Compared to open access publishing, open sharing of data is less developed culturally and structurally, which makes it all the more important to develop a workflow to begin to gather data on this front.

Predictably, the infrastructure for tracking the sharing of data is not nearly as well-developed as that for articles or books. While some of this is likely due to the relative lack of emphasis on data publishing, there are a variety of reasons why tracking data isn’t quite as easy for motivated parties. Journals, in spite of wide-ranging aesthetic and syntax standards, have relatively uniform metadata standards. In large part, this is because of the homogeneity of their products, across disciplines, which are primarily peer-reviewed research articles that are typeset into PDFs. This allows proprietary solutions like Web of Science and Scopus to harvest vast amounts of metadata (through CrossRef) and to make it available in a readily usable format with relatively little work required to format, clean, or transform. In contrast, research data are published in a wide variety of formats, ranging from loosely structured text-based documents like letters or transcripts to objects with complex or structured formatting like geospatial data and genomic data. As a result, there can be significant differences between platforms that host and publish research data, ranging from general to discipline-specific metadata and file support, level of detail in author information, use of persistent identifiers like DOIs, and curation and quality assurance measures (or lack thereof).

Horizontal bar chart comparing the frequency of different name permutations of UT Austin that were entered in UT Austin datasets. A total of eight different permutations were detected, ranging from 'University of Texas at Austin' to 'UT Austin.' The most common is to use 'at Austin' rather than some form of punctuation like a comma or hyphen instead of 'at.'
Comparison of annual volume of dataset publications. ‘All’ refers to the volume across all discovered repositories and is compared to our institutional repository, the Texas Data Repository, and two common generalists, Dryad and Zenodo.

While a few proprietary solutions are beginning to emerge that purport to be able to track institutional research data outputs (e.g., Web of Science), these products have notable shortcomings, including significant cost, difficulty assessing thoroughness of retrieval, and limited number of retrievals. In order to create a more sustainable and transparent solution, the Research Data Services team has developed a Python-based workflow that uses a number of publicly accessible APIs for data repositories and DOI registries. The code for running this workflow has been publicly shared through the UT Libraries GitHub at https://github.com/utlibraries/research-data-discovery so that others can also utilize this open approach to gathering information about research data outputs from user-defined institutions; the code will continue to be maintained and expanded to improve coverage and accuracy. To date, the workflow has identified more than 3,000 dataset publications by UT Austin researchers across nearly 70 different platforms, ranging from generalist repositories that accept any form of data like Dryad, figshare, and Zenodo to highly specialized repositories like the Digital Rocks Portal (for visualizing porous microstructures), DesignSafe (for natural hazards), and PhysioNet (for physiological signal data).

Horizontal bar chart comparing the total number of UT-Austin-affiliated datasets published in different repositories. Only repositories with at least 30 datasets are individually listed; the remainder are grouped into an 'Other' category. The Texas Data Repository has the most discovered datasets (nearly 1,250), followed by Dryad, Zenodo, Harvard Dataverse, the aggregated 'other', ICPSR, figshare, DesignSafe, Mendeley Data, the Digital Rocks Portal, and EMSL. No repository other than the Texas Data Repository has more than 400 datasets.
Comparison of total number of dataset publications between repositories. Only repositories with more than 30 UT-affiliated publications are depicted individually; all others are grouped into ‘Other.’

This work is still very much in progress. Perhaps equally important to the data that we were able to obtain are the data we suspect exist, but were unable to retrieve via our workflow (e.g., we didn’t retrieve any UT-affiliated datasets from the Qualitative Data Repository, even though we are an institutional member), as well as the variation in metadata schemas, cross-walks, and quality, which can help to inform our strategies around providing guidance on the importance of high-quality metadata. For example, this process relies on proper affiliation metadata being recorded and cross-walked to DataCite. Some repositories simply don’t record or cross-walk any affiliation metadata, making it essentially impossible to identify which, if any, of their deposits are UT-affiliated. Others record the affiliation in a field that isn’t the actual affiliation field (e.g., in the same field as the author name); some even recorded the affiliation as an author. All of this is on top of the complexity introduced by the multiple ways in which researchers record their university affiliation (UT Austin, University of Texas at Austin, the University of Texas at Austin, etc.)

Horizontal bar chart comparing the frequency of different name permutations of UT Austin that were entered in UT Austin datasets. A total of eight different permutations were detected, ranging from 'University of Texas at Austin' to 'UT Austin.' The most common is to use 'at Austin' rather than some form of punctuation like a comma or hyphen instead of 'at.'
Comparison of the frequency of different permutations of ‘UT Austin’ that were entered as affiliation metadata in discovered datasets.

We also have to account for variation in the granularity of objects, particularly those that receive a PID. For example, in our Texas Data Repository (TDR), which is built on Dataverse software, both a dataset and each of its constituent files receives a unique DOI – each file is also recorded as a ‘dataset’ because the metadata schema used by the DOI minter, DataCite, doesn’t currently support a ‘file’ resource type. We thus have to account for a raw data output that will initially inflate the number of datasets in TDR by at least two orders of magnitude. The inverse of this is Zenodo, which assigns a parent DOI that always resolves to the most recent version, with each version of an object getting its own DOI (so all Zenodo deposits have at least two DOIs, even if they are never updated).

The custom open source solution that we have developed using Python, one of the most common software languages (per GitHub), offers the flexibility to overcome the challenges posed by differences between data repositories and variations in the metadata provided by researchers. Our approach also avoids the shortcomings of proprietary solutions as it offers transparency so that users can understand exactly how dataset information is retrieved, and it is available at no cost to anyone who might want to use it. In many ways, this workflow embodies the best practices that we encourage researchers to adopt – open, freely available, transparent processes. It also allows others (at UT or beyond) to adopt our workflow, and if necessary, to adapt it for their own purposes.

OER Faculty Author Spotlight: Dr. Victor Eijkhout

Dr. Victor Eijkhout, Texas Advanced Computing Center

In observation of Open Education Week, UT Libraries is proud to spotlight a few of our talented faculty members who are on the forefront of the open education movement as open educational resource (OER) authors! Because we can’t limit ourselves to just one week, we’re excited to celebrate open education throughout the month of March. 

We’re starting this year’s series with Dr. Victor Eijkhout. Dr. Eijkhout is part of the Texas Advanced Computing Center, which he joined in 2005 as a Research Scientist in the High Performance Computing group. He conducts research in linear algebra, scientific computing, parallel programming, and machine learning. Before coming to TACC, he held positions at the University of Illinois, the University of California at Los Angeles, and the University of Tennessee in Knoxville.

Dr. Eijkhout has authored open courseware, including several open textbooks and accompanying programs and code sets. Below, he generously shares his experiences developing OER with us.

Do you recall how you first became aware of open educational resources (OER) or the open education movement more broadly?

“In science, open software and open courseware predates the term ‘Open Source’ by a wide margin. In the 1980s I provided feedback on a tutorial document that someone on   a different continent was making, and that proved very popular. In the mid-1990s I co-authored a computer science textbook for which we got the publisher (SIAM) to agree on a dual license: the book was for sale but also downloadable (including software) and viewable as web pages. In a similar spirit, I started writing my textbooks about 15 years ago without any awareness of being part of a movement. After I finished my first open textbook I did some searching and found the Saylor Foundation which develops OER. They licensed my book for what is probably a similar amount as I would have made from commercially publishing the book.”

You’ve developed a wealth of open courseware, including several open textbooks and accompanying materials like Introduction to High-Performance Scientific Computing; Parallel Programming in MPI, OpenMP, PETSc; and Introduction Scientific Programming in Modern C++ and Fortran. What inspired you to create these resources?

“These textbooks were written for courses that TACC teaches. (The Texas Advanced Computing Center provides a small number of academic courses in addition to many short trainings. These courses are – for historic reasons – provided as part of the SDS department.) When I was slated to teach a course, I searched for available textbooks, but usually I disagreed in some way or other with the approaches they took, so I started writing my own. In a way, writing a textbook, for me, is a form of self-defense: if I only prepare lecture notes, I will often find, standing in front of the class, that I miss details. By writing out everything in full paragraphs and mathematical derivations, I make sure I don’t overlook anything.”

What was the most challenging part of developing your own resources? Was there anything that surprised you?

“The challenge is in dotting the is and crossing the ts. As in most things, the first 80 percent is easy. Getting to a finished product is hard, which is why you find many more lecture notes online than textbooks. An example of what I ran into in my programming books is the challenge of making sure code is 100% correct, and corresponds 100% to the output given. For this, I developed a whole infrastructure of example programs, from which snippets are clipped to be included in the text, and similarly the output captured to be included side-by-side.


In this aspect, self-publishing the way I do, through downloads and repositories, has advantages over publishing commercially: you can release a product informally in an earlier stage and revise it more easily and more often.”

Do you use any OER developed by others as teaching resources?

“Not directly, but if I come across resources I will often peruse them to get inspiration, or even to ‘borrow’ bits for my own texts.”

How do your students respond to the resources you’ve developed?

“I wish I could say that they really appreciate it, but the reactions have a wide range. For many of course a textbook is just a textbook and it goes unmentioned. Some of them have delved into the literature and tell me my book is really good. On the other hand, in a sign of the times, students’ first reaction to problems seems to be to look online rather than in the textbook. Unfortunately, in programming this sometimes leads them to outdated material.”

What advice would you offer to an instructor who is interested in using or developing their own OER but isn’t sure how to get started?

“The threshold for open resources is low. Any lecture notes you put up for download will be found by the search engines. My advice would be to write what *you* need. If it’s useful to other people it will be found.”

Want to get started with OER or find other free or low cost course materials? Contact Ashley Morrison, Tocker Open Education Librarian (ashley.morrison@austin.utexas.edu)

Open Access in 2017

As we prepared for Open Access (OA) Week 2017, it’s been exciting to think back about how far we’ve come in the last several years. For those who aren’t familiar, OA Week is a celebration of efforts to make research publications and data more accessible and usable. Just ten short years ago we lacked much of the infrastructure and support for open access that exist today.

Open@TexasBy 2007 we had implemented one of the core pieces of our OA infrastructure by joining Texas Digital Library (TDL). TDL is a consortium of higher education institutions in the state of Texas. TDL was formed to help build institutions’ capacity for providing access to their unique digital collections. That membership continues to grow and TDL now hosts our institutional repository, Texas ScholarWorks, our data repository, Texas Data Repository, our electronic thesis and dissertation submission system, Vireo, and is involved in our digital object identifier (DOI) minting service that makes citing articles and data easier and more reliable. These services form the backbone of our open access publishing offerings.

Our institutional repository, Texas ScholarWorks (TSW), went live in 2008. TSW is an online archive that allows us to share some of the exciting research being created at the university. We showcase electronic theses and dissertations, journal articles, conference papers, technical reports and white papers, undergraduate honors theses, class and event lectures, and many other types of UT Austin authored content.

TSW has over 53,000 items that have been downloaded over 19 million times in the past nine years.

In spring of 2017 we launched the Texas Data Repository (TDR) as a resource for those who are required to share their research data. TDR was intended to serve as the data repository of choice for those researchers who lack a discipline-specific repository or who would prefer to use an institutionally supported repository. TDR serves as a complementary repository to Texas ScholarWorks. Researchers who use both repositories will be able to share both their data and associated publications and can provide links between the two research outputs.

For several years the library has been supporting alternative forms of publishing like open access publishers and community supported publishing and sharing. Examples of this support include arXiv, Luminos, PeerJ, Open Library of the Humanities, Knowledge Unlatched, and Reveal Digital. These memberships are important because it’s a way for us to financially support publishing options that are more financially sustainable than the traditional toll access journals. Many of these memberships also provide a direct financial benefit to our university community, like the 15% discount on article processing charges from our BioMed Central membership.

In an effort to lead by example, the UT Libraries passed an open access policy for library staff in 2016. This is an opt-out policy that applies to journal articles and conference papers authored by UT Libraries employees. With this policy the library joins dozens of other institutions across the U.S. that have department level open access policies.

This past year we started a very popular drop-in workshop series called Data & Donuts. Data & Donuts happens at the same time every week, with a different data-related topic highlighted each week. All the sessions have a shared goal of improving the reproducibility of science.

Data & Donuts has attracted over 340 people in the past nine months which makes it one of our most successful outreach activities.

We have another reason to be optimistic this year. The Texas state legislature passed a bill this summer that should expand the awareness of and use of open educational resources (OER). SB810 directs colleges to make information about course materials available to students via the course catalog. If there is an online search feature for the catalog, the college has to make it possible for people to sort their search by courses that incorporate OER. The catalog functionality is set to go into effect this spring, so we’ll be keeping an eye on how things develop over this academic year.

We will continue the momentum we have generated from the launch of TDR, our Data & Donuts series, and our support of open publishers. We are putting together topics for Data & Donuts this spring, planning events associated with open access and author rights, and continuing to improve our online self-help resources. We are committed to offer assistance to any faculty, staff, or student at the university who has a question about open access.

We encourage department chairs and tenure and promotion committees to talk with their colleagues and/or engage with us in discussions about what open access means for their discipline.

UT Libraries will continue to explore new publishing models and initiatives to share UT’s rich scholarship and discoveries, to find ways to increase access to open educational resources, and to support future faculty and scholars in accessing, using and curating the growing body of data that is central to the research enterprise.