Abstract
۱٫ Introduction
۲٫ Preliminaries
۳٫ Partition–dyadic CART
۴٫ Algorithms
۵٫ Empirical studies
۶٫ Discussion
Declaration of Competing Interest
Acknowledgements
Appendix A. Full Intersection-Validation results
Appendix B. Overview of used UCI data sets including full prediction results
Appendix C. Remaining hyperparameter sensitivity plots
Appendix D. Additional results for studies with ordinal variables
References
Abstract
Modern exact algorithms for structure learning in Bayesian networks first compute an exact local score of every candidate parent set, and then find a network structure by combinatorial optimization so as to maximize the global score. This approach assumes that each local score can be computed fast, which can be problematic when the scarcity of the data calls for structured local models or when there are both continuous and discrete variables, for these cases have lacked efficient-to-compute local scores. To address this challenge, we introduce a local score that is based on a class of classification and regression trees. We show that under modest restrictions on the possible branchings in the tree structure, it is feasible to find a structure that maximizes a Bayes score in a range of moderate-size problem instances. In particular, this enables global optimization of the Bayesian network structure, including the local structure. In addition, we introduce a related model class that extends ordinary conditional probability tables to continuous variables by employing an adaptive discretization approach. The two model classes are compared empirically by learning Bayesian networks from benchmark real-world and synthetic data sets. We discuss the relative strengths of the model classes in terms of their structure learning capability, predictive performance, and running time.
Introduction
Bayesian networks (BNs) compose a multivariate distribution as a product of univariate conditional probability distributions (CPDs). The potential complexity of the local CPDs, together with the global acyclicity constraint of the BN structure, make the task of learning BNs from data challenging in many ways. Even if we assume that the available data contain neither unobserved variables nor missing entries—like we will do throughout this article—there often remain the following three concerns: (i) Statistical efficiency: the data may be scarce, rendering simple tabular parameterizations of the CPDs, such as the conditional probability tables (CPTs), statistically inefficient. (ii) Heterogeneous variables: so-called hybrid BNs contain both discrete and continuous variables, again ruling out the most convenient and standard parameterizations of CPDs. (iii) Computational efficiency: the most natural formulations of the learning problem are computationally hard. This limits both the dimensions of the data and the complexity of the CPDs in relation to what can be solved under useful quality guarantees. Each of these issues (i–iii) has been addressed in the literature, as we will describe in the next paragraphs. The present authors are, however, not aware of any prior work that would simultaneously address all the three concerns.