Shining Light on Dark Data

27 April 2020

In his PhD research at HLRS, computer scientist Björn Schembera proposed strategies to make valuable research data more productive and long-term data management at HPC centers more efficient.

Simulations running at high-performance computing (HPC) centers produce massive amounts of data. Once a research project is finished, however, potentially valuable data too often ends up abandoned and forgotten, taking up space on long-term storage servers as researchers move on to other topics.

In a publication released in the March 2020 issue of the journal Philosophy & Technology, HLRS computer scientist Björn Schembera and philosopher of science Juan Durán characterize such data as "dark data." Just as astrophysicists know that dark matter must comprise a sizable proportion of the universe's mass even if it can't be observed, dark data can fill countless petabytes of storage — unlabeled, unorganized, and unusable by researchers.

The accumulation of dark data at HPC centers presents several problems. For one, creating, storing, and curating large data sets requires sizable funding, including the costs of building ever larger data storage systems and supplying them with power. From a scientific perspective, the virtual disappearance of dark data also means lost opportunities for computational scientists and engineers working on research that would benefit from access to it. Dark data can also pose security or legal risks, particularly in relation to personally identifiable data and data ownership.

"The concept of dark data has been discussed in other contexts," Schembera says, "but we wanted to better understand its unique features within a high-performance computing context. The paper was the first step toward identifying strategies that could minimize its accumulation." In his recently completed doctoral thesis, Schembera proposes several potential solutions for this problem.

Causes and effects of dark data

While pursuing his PhD at HLRS, Schembera simultaneously worked as a member of the Project & User Management and Accounting department, which oversees data management at the center. This position has given him a first-hand perspective on how data is produced, saved, and used at the center. Based on this experience, two principal sources of dark data came to light.

In many cases, data goes dark due to missing or difficult to interpret metadata, standardized information about datasets that provide structure. Scientists typically have no time or incentive to tag their data carefully, and often use individual, ad hoc filing systems for organizing data without systematically annotating it. Although this might suffice in the midst of an active simulation project, it often later becomes extremely difficult to reconstruct what the data represent, or to identify connections between it and other related data.

A second source of dark data results when users of HPC systems become inactive. Once a simulation is complete at HLRS, for example, HPC systems store data in a data server, later moving it to long-term tape storage. However, when scientists disengage from the center — for example, when graduate students leave a university and find other jobs — the data remains unclaimed.

Schembera points out that the accumulation of dark data has a number of implications: It costs the center financially to maintain and operate the computing hardware needed to save the data; it raises legal and security risks when personal data is included; and it is inconsistent with FAIR principles (findability, accessibility, interoperability, and reusability), which govern best practices in data management and reuse. Eliminating dark data could therefore improve the operation and scientific productivity of HPC centers in multiple ways.

The Scientific Data Officer

Because academic users of HPC lack the necessary training or incentives to avoid producing dark data, Schembera argues, high-performance computing centers need to take responsibility for managing the problem.

The paper Schembera published with Durán proposes addressing this issue through the creation of a new kind of administrative position within high-performance computing centers: the scientific data officer (SDO).

Specifically, the SDO would be an expert in data management and HPC tools who moderates among researchers, administrators, and an HPC center's management to ensure that best practices are followed in data management. The SDO's responsibilities would include implementing and maintaining a standardized metadata framework for labeling data that is consistent with FAIR standards, and assisting in the management and retrieval of stored data.

Moreover, the SDO would work to reduce the amount of dark data saved at an HPC center. This could include identifying data connected to inactive or deleted users that could be eliminated from the system, evaluating the value of left-behind data to determine whether it should be preserved, and making decisions with regard to data stewardship. To ensure that the position's authority is not abused, they therefore also recommend a code of conduct governing professional behavior to ensure that data is managed ethically.

Automating metadata curation

Considering the enormous amount of data being generated at an HPC center like HLRS, organizing it through metadata is a formidable task for researchers or for a potential SDO.

In his dissertation, Schembera addresses this challenge by introducing a metadata model called EngMeta, which specifies a standardized framework for categorizing and organizing research data in computational engineering. He also extends this framework by developing software to automate metadata extraction. Although currently such a tool would need the support of an SDO or researcher to identify significant discipline-specific keywords, he suggests that such a tool could simplify the often tedious process of metadata management as an automatic part of simulation workflows.

Ultimately, Schembera sees potential in these proposals to improve the productivity and efficiency of HPC centers on multiple levels. Reducing the amount of dark data that is produced and stored could make computing centers more economically efficient and, considering the power requirements of keeping large computer servers running, more environmentally sustainable. Archiving the right kinds of data from past simulations in a more organized and accessible way also holds the potential to make data more scientifically productive.

Christopher Williams

Learn more

Schembera B, Durán JM. 2020. Dark data as the new challenge for big data science and the introduction of the Scientific Data Officer. Philos Technol. 33:93-115.

This article about Schembera's work appeared in the 2019 HLRS Annual Report.