Highlights
Abstract
Keywords
1. Introduction
2. Reproducibility
3. Quantifying support for reproducibility
4. A survey of the reproducibility support of machine learning platforms
5. The reproducibility of digits classification
6. Conclusion
CRediT authorship contribution statement
Declaration of Competing Interest
Acknowledgment
References
Vitae
Abstract
Science is experiencing an ongoing reproducibility crisis. In light of this crisis, our objective is to investigate whether machine learning platforms provide out-of-the-box reproducibility. Our method is twofold: First, we survey machine learning platforms for whether they provide features that simplify making experiments reproducible out-of-the-box. Second, we conduct the exact same experiment on four different machine learning platforms, and by this varying the processing unit and ancillary software only. The survey shows that no machine learning platform supports the feature set described by the proposed framework while the experiment reveals statstically significant difference in results when the exact same experiment is conducted on different machine learning platforms. The surveyed machine learning platforms do not on their own enable users to achieve the full reproducibility potential of their research. Also, the machine learning platforms with most users provide less functionality for achieving it. Furthermore, results differ when executing the same experiment on the different platforms. Wrong conclusions can be inferred at the at 95% confidence level. Hence, we conclude that machine learning platforms do not provide reproducibility out-of-the-box and that results generated from one machine learning platform alone cannot be fully trusted.
1. Introduction
A concern has grown in the scientific community related to the reproducibility of scientific results. The concern is not unjustified. According to a Nature survey, the scientific community is in agreement that there is an on-going reproducibility crisis [1]. According to the findings of the ICLR 2018 Reproducibility Challenge, experts in machine learning have similar concerns about reproducibility; more worryingly, their concern increased after trying to reproduce research results [2]. In psychology, the reproducibility project was only able to reproduce 36 out of 100 psychology research articles with statistically significant results [3]. Braun and Ong argue that computer science and machine learning should be in a better shape than other sciences, as many if not all experiments are completely conducted on computers [4]. However, even though this is true, computer science and machine learning research is not necessarily reproducible. Collberg and Proebsting report an experiment in which they tried to execute the code published as part of 601 papers. Their efforts succeeded in 32.1% of the experiments when not communicating with authors and 48.3% when communicating with the authors [5]. In their experiment, they only tried to run the code; they did not evaluate whether the results were reproducible.
Machine learning is still and to a very large degree an empirical science, so the issues with reproducibility is a concern. For example, to establish which algorithm is better for a task, an experiment is designed where the algorithms are trained and tested on the same datasets that represent the task. The one that compares best according to one or more performance metrics is deemed to be the best for a given task. Now, imagine that we have two algorithms that we want to compare. Algorithm A is our own and algorithm B is developed by third party. The results depend on how much documentation that is made available to us by the third party whom authored algorithm B. For example, if we only have access to written material, we have to implement the algorithm ourselves and test it on data that we collect ourselves. There is practically no way we can verify that we have implemented and configured the algorithm in the exact same way as the original authors. So, the more documentation (textual description, code and data) that is released by the original investigators, the easier for independent investigators to reproduce the reported results.