Minisymposia PASC22
“Leveraging Data Lakes to Manage and Process Scientific Data”
- Speaker Details
- Important Information
  - Funding
  - Acknowledgements

Minisymposia PASC22

“Leveraging Data Lakes to Manage and Process Scientific Data”

Month XXth 2022, XX:00 - XX:00

In recent years, data lakes have become increasingly popular as central storage, particularly for unstructured data. Generally, data lakes aim to integrate heterogeneous data from diverse sources into a unified information management system, where data is retained in its original format. Storing data in raw format, opposed to inferring a schema on write as it is commonly done in a data warehouse, supports the reuse and sharing of already collected data for researchers. The idea is to basically dump the data into the lake and later fish for knowledge using sophisticated analysis tools. Collecting all data in one integrated data management system prevents the formation of independent data silos, i.e. the creation of isolated information systems. Here, research teams are neither easily able to exchange data and analysis workflows, nor can they perform cross-silo analysis. Thus, these data silos typically exclude people unfamiliar with group internal knowledge. A data lake, however, integrates data and processes while adhering to overarching standards available to all users, therefore opening up the available knowledge to a broader audience.

Integrating a data lake into research workflows, however, has its own technical challenges. Most importantly, it needs to be ensured that all data, no matter the number or size of the different data sets, will be found and can be accessed later on. Other important issues include the question, how a data lake can be designed to attract a wider range of users and not only specialists. Especially for domain researchers in public research institutions, a research data management solution should not only ensure the preservation of the data but also support and guide scientists in complying with good scientific practices from the very beginning. In order to discuss the current challenges, their possible solutions and share personal insights into data lakes, we want to bring different experts together and discuss with the scientific community the potential and technical approaches to realize scientific workflows.

Speaker Details

Prof. Dr. Rihan Hai (Affiliation: TU Delft)

Title: Data Integration in Data Lakes.

Although big data is being discussed for some years, it still has many research challenges, such as the variety of data. The diversity of data sources often exists in information silos, which are a collection of non-integrated data management systems with heterogeneous schemas, query languages, and data models. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in these information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lake systems have been proposed as a solution to this problem, which are repositories storing raw data in its original formats and providing a common access interface. In this talk, I will discuss the landscape of existing data lake problems, and our solutions for integrating multiple heterogeneous data sources in data lakes. I will also introduce the recent advances in supporting AI in data lakes.

Dr. Pegdwendé Nicolas Sawadogo (Affiliation: Fondation de l'AP-HP)

Title: Enabling industrialized analysis of textual documents in data lakes.

The concept of data lake was introduced in 2010 by James Dixon as an alternative to data warehouses for big data analysis and management. Unlike data warehouses, data lakes follow a schema-on-read approach to better support ad’hoc analyses. In the absence of a fixed schema, data from the lake can be handled miscellaneously. This however makes hard industrialized analyses from data lakes. More recently, the concept of data lakehouse has been proposed as a solution to activate industrialized analyses in data lakes. That consists to merge the better from data lake and data warehouse concepts. Nevertheless, data lakehouses still limited as they essentially focus on structured data management. Yet, the majority of big data is made by unstructured data, amongst which textual data. To remedy the limitations of data lakehouses we introduce a new approach to activate industrialized analyses on textual documents from a data lake. Our approach is based on techniques from information retrieval and text-mining domains. In this presentation, we particularly focus on architecting and metadata management which are essential issues while building a data lake system.

Dr. Mark Greiner (Affiliation: Max Planck Institute for Chemical Energy Conversion)

Title: Utilizing Data Lakes for Managing Multidisciplinary Research Data.

Scientific research institutes face a lot of the same challenges as commercial organizations when it comes to managing data. Just like for commercial organizations, a common situation is data silos, or even wors, data swamps. The fundamental problem is that the continual manual effort needed to govern data prooves to be too much for many research institutes. A possible solution would be to automate as much of the process as possible, and to minimize the amount of duplicated efforts. In the present talk, we discuss our current efforts to improve data management of a mid-sized an academic research institution. We show that, while some aspects of data management are very similar to those faced in commercial organizations–such as data ingestion, processing, and reporting–some others aspects are quite specific to the use case of academic research. In these cases, we adapt or re-build cutom modules to accomodate for the unique workflows of researchers. In the end, we aim to make use of known best practices and technologies, while embracing the uniqueness of research practices.

Hendrik Nolte (Affiliation: Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)

Title: A FAIR Digital Object-Based Data Lake Architecture to Support Various User Groups and Scientific Domains.

Across various domains, data lakes are successfully utilized to centrally store all data of an organization in their raw format. This promises a high reusability of the stored data since a schema is implied on read, which prevents an information loss due to ETL (Extract, Transform, Load) processes. Despite this schema-on-read approach, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality. These data models are maintained within a central data catalog which can be queried. To further organize the data in the data lake, different architectures have been proposed, like the most widely known zone architecture where data is assigned to different zones according to the degree of processing. In this talk, a novel data lake architecture based on FAIR (Findable, Accessible, Interoperable, Reusable) Digital Objects (FDO) with (high-performance) processing capabilities is presented. These FDOs abstract away the handling of the underlying mass storage and databases, thereby enforcing a homogeneous state, while offering a flat yet easily comprehensible research data management. The FDOs are connected by a provenance-centered graph. Users can define generic workflows, which are reproducible by design, making this data lake implementation ideally suited for science.

Presenter	Contact	Topic
Mark Greiner	E-Mail: mark.greiner@cec.mpg.de	“Utilizing Data Lakes for Managing Multidisciplinary Research Data”
Rihan Hai	E-Mail: r.hai@tudelft.nl	“Data Integration in Data Lakes”
Pegdwendé Nicolas Sawadogo	E-Mail: sawadogonicholas44@gmail.com	“Enabling Industrialized Analysis of Textual Documents in Data Lakes”
Hendrik Nolte	E-Mail: hendrik.nolte@gwdg.de	“A FAIR Digital Object-Based Data Lake Architecture to Support Various User Groups and Scientific Domains”

Important Information

Date and Time	Day, Month XXth 2022, XX:00 - XX:00
Venue	Open
Organizers	Hendrik Nolte (GWDG), hendrik.nolte@gwdg.de
	Julian Kunkel (Uni Göttingen/GWDG), julian.kunkel@gwdg.de

Funding

This workshop is funded by the GWDG.

Acknowledgements

We gratefully acknowledge funding by the ”Niedersachsisches Vorab”funding line of the Volkswagen ̈Foundation and NHR.