Through adopting a new approach to reusability of scientific workflows, we demonstrate benefits for ease-of-sharing and for real-world performance of a biology machine learning workflow and of relevant scientific benchmark examples. Making reusability of the workflow a primary concern not only improved the science usage experience but also yielded improved performance and scalability.
Significance and Impact
The acronym FAIR, for “Findable, Accessible, Interoperable, Reusable”, has become an important watch word for scientific data sets that should be open and accessible to the community, whether for policy, reproducibility, or intellectual merit. By connecting FAIR principles as defined for data to software engineering concerns, we define a new approach to making open, community-accessible workflows. This approach prioritizes reusability, and in particular automation of reusability, in order to make workflows and scientific pipelines sharable without incurring substantial human costs.
- Addressing FAIR workflows and software engineering concerns for technical debt simultaneously drives a need for metadata collection that can be used for automation of reuse.
- Identified six categories of properties to gauge workflow reusability. These 6 are grouped into three that relate to data (access, schema, and semantics) and three that relate to software (granularity, customizability, and provenance).
- Leveraged and extended previous ASCR efforts by utilizing existing software stacks such as Cheetah, Adios, and Skel
- Experimental evaluations on
- Computational biology pipeline
- Reference performance optimization workflow
- ML bioinformatics workflow
- Improved reusability and abstraction allowed for better performance (up to 400% improvement) in addition to the target of better shareability and reuse.
The FAIR principles of open science (Findable, Accessible, Interoperable, and Reusable) have had transformative effects on modern large-scale computational science. How best to apply the FAIR principles to workflows themselves, and software more generally, is not yet well understood. This work demonstrates that the software engineering concept of technical debt management provides a useful guide for application of those principles to workflows, and in particular that it implies reusability should be considered as `first among equals'. The work constructs novel systems and tools that are based on a new abstraction approach for reusable workflows, with demonstrations for both synthetic workloads and real-world computational biology and machine learning workflows. This makes it easier to selectively reason about and automate the trade-offs across user ease and performance concerns.
From: Wolf, M., Logan, J., Choi, J. Y., Mehta, K., Jacobson, D., Cashman, M., ... Cli, A. (2021). Reusability first: toward FAIR workflows. In Proceedings of 2021 IEEE International Conference on Cluster Computing. CLUSTER ’21. (10.1109/Cluster48925.2021.00053)
Last Updated: October 20, 2021 - 1:54 pm