Provenance-based Semantic Approach for the Reproducibility of Scientific Experiments
Data provenance has become an integral part of the natural sciences where data flow through several complex steps of processing and analysis to generate intermediate and final results. To reproduce scientific experiments, scientists need to understand how the steps were performed in order to check the validity of the results. The scientific experiments consist of activities in the real world (e.g., wet lab or field work) and activities in cyberspace. Many scientists now write scripts as part of their field research for different tasks including data analysis, statistical modeling, numerical simulation, computation and visualization of results. Reproducibility of the computational and non-computational parts are important steps towards reproducibility of the experiments as a whole. In order to reproduce results or to detect which error occurred in the output, it is required to know which input data was responsible for the output, the steps involved in generating them, the devices and the materials used, the settings of the devices used, the dependencies, the agents involved and the execution environment etc. The aim of our work is to semantically describe the provenance of the complete execution of a scientific experiment in a structured form using linked data without worrying about any underlying technologies. In our work, we propose an approach to ensure this reproducibility by collecting the provenance data of the experiment and using the REPRODUCE-ME ontology extended from the existing W3C vocabularies to describe the steps and sequence of steps performed in an experiment. The ontology is developed to describe a scientific experiment along with its steps, input and output variables and their relationship with each other. The semantic layer on top of the captured provenance provided with ontology-based data access allows the scientists to understand and visualize the complete path taken in a computational experiment along with its execution environment. We also provide a provenance-based semantic approach which captures the data from interactive notebooks in a multi-user environment provided by JupyterHub and semantically describe the data using the REPRODUCE-ME ontology.