Adaptive Thresholding in Structure Learning of a Bayesian Network

Adaptive Thresholding in Structure Learning of a Bayesian Network Abstract Thresholding a measure in conditional independence (CI) tests using a fixed value enables learning and removing edges as part of learning a Bayesian network structure. However, the learned structure is sensitive to the threshold that is commonl...

Abstract Thresholding a measure in conditional independence (CI) tests using a fixed value enables learning and removing edges as part of learning a Bayesian network structure. However, the learned structure is sensitive to the threshold that is commonly selected: 1) arbitrarily; 2) irrespective of characteristics of the domain; and 3) fixed for all CI tests. We analyze the impact on mutual information – a CI measure – of factors, such as sample size, degree of variable dependence, and variables’ cardinalities. Following, we suggest to adaptively threshold individual tests based on the factors. We show that adaptive thresholds better distinguish between pairs of dependent variables and pairs of independent variables and enable learning structures more accurately and quickly than when using fixed thresholds. 1 Introduction Constraint-based (CB) structure-learning algorithms of Bayesian networks (BN) use conditional independence (CI) tests to threshold information-based measures, such as mutual information (MI), or statistics, such as X2 or G. CB algorithms are popular, but arbitrariness in CI testing increases the chances of learning redundant edges, removing true edges, and orienting edges wrongly or not at all, and thus learning an erroneous structure [10, 14]. Arbitrariness is a result of two main sources: 1) inaccurate estimation of the CI measure, especially using small samples and large cardinalities of the conditioning set and the involved variables (henceforth “cardinality”), and 2) uneducated selection of the threshold on the measure. As the degrees of freedom (df) for and the cardinality for conditional MI (CMI) increase, statistic and CMI increase, respectively, because both sum increasing numbers of non-negative terms. As the sample size increases, the statistic increases and the bias between the estimated and real MI decreases [32]. Compared to , CMI has no means, such as df, to adapt to increase in cardinality, and both measures have no means to adapt to the sample size. So far, the analysis of cardinality (or df) and sample size (henceforth “factors”), as sources of arbitrariness in estimating CI measures, was limited [10, 12, 14, 15, 31, 33] yielding no practical ways to tackle this arbitrariness. Regarding the second main source of arbitrariness – threshold selection – a wrong threshold influences structure learning [2, 12, 14, 20]. Commonly, a threshold is selected according to three main methods: as a user default [10, 29, 31, 33], based on limited trial and error experimentation [5], or automatically as the threshold, from a range of candidates, that max(in)imizes a performance measure [9, 14, 20, 23, 35]. The first method provides simplicity and low runtime, but cannot guarantee satisfactory performance for most problems. The second method overcomes exhaustive experimentation using a limited number of candidates, but it is manual and incomplete. The third method may find a global optimal threshold but may also suffer from a long runtime. A threshold is usually not only selected arbitrarily but also irrespectively of test factors. While the critical value increases with df, it is indifferent to sample size. The CMI threshold is indifferent to all factors. To our best knowledge, no idea for setting a CMI threshold that depends on the factors has ever been offered. Also, no proposal has ever been given to adapt a threshold to factors of each CMI test. We suggest adaptive thresholding for individual CI tests based on test factors, and apply this concept in CMI-based CI testing. Statistical tests are well established and consistent but compared to CMI tests can only reject independence, and not confirm dependence as CMI, and lose accuracy due to multiple comparisons. Section 2 reviews common tests and motivates adaptive thresholding. Section 3 suggests adaptive thresholds, which depend on test factors. Section 4 shows that CMI-based adaptive thresholds: 1) follow MI sensitivity to the factors; 2) distinguish pairs of dependent from pairs of independent variables better than fixed thresholds; 3) enable learning structures that are significantly more accurate than if using fixed thresholds; 4) facilitate complexity and expedite learning; and 5) allow learning structures that fit and classify data more accurately. Conclusions are in Section 5. 2 Background BN is a graphical model that describes independence relations between variables. Its structure is a directed acyclic graph (DAG) composed of nodes representing variables and directed edges that connect nodes and probabilistically quantify their connection [21, 28, 31]. Learning a BN structure is NP-hard [12], and we focus on the CB approach to structure learning [9, 10, 20, 28, 31, 35]. First, CB methods (starting from a complete graph) test pairs of variables to determine whether edges should be Adaptive Thresholding in Structure Learning of a Bayesian Network�� Boaz Lerner, Michal Afek, Rafi Bojmel Ben-Gurion University of the Negev, Israel boaz@bgu.ac.il; {caspimic,rafibojmel}@gmail.com Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence 1458 removed. If a statistical test is used, the null hypothesis is that the two variables are (conditionally) independent. If the p-value, corresponding to the probability of obtaining a value as extreme as the one observed, given that the null is true, is less than the significance level, �, the null is rejected [25]. If a CMI test is used [13], two (conditionally) independent variables are assumed (for an infinite sample size) to share a zero CMI [30], hence we test them (for a finite sample size) against a small threshold [10]. Failing in the test (CMI) or failing to reject the null ( ) indicates node (conditional) independence that yields edge removal. By performing tests for conditioning sets of different sizes, we form an undirected graph [10, 29, 31, 35] that is directed using orientation rules [17, 26, 27]. The result is a complete partial DAG (CPDAG) that shares the same oriented edges with the true DAG and uses undirected edges to represent statistically indistinguishable connections in the true graph [11, 31]. We concentrate on ways to improve CI testing in domains with discrete variables and complete data. 2.1 Common CI Tests Common ways of CI testing are by thresholding CMI or a statistic that measures statistical independence between variables (in Pearson’s chi-square or likelihood ratio G-test). Mutual information MI between variables X and Y measures the amount of information shared between these variables. It also measures how much uncertainty about Y decreases when X is observed (and vice versa) [13] (1) MI is the KL divergence between and [13], measuring how much the joint is different from the marginals’ product, or how much the variables can be considered not independent. CMI between X and Y measures information flow between X and Y given a conditioning set S (2) By definition, MI(X;Y) and CMI(X;Y|S) are non-negative. MI(X;Y) = 0 (CMI(X;Y|S) = 0) iff X and Y are independent (given S). The true MI is unknown, and the estimated is larger than [32], and thus for independent variables larger than 0. Practically, is compared to a small threshold, , to distinguish pairs of dependent and pairs of independent variables [2, 6, 9, 10]. If , X and Y are regarded as independent and the edge connecting them is removed. The test for CI using CMI is similar. Pearson’s chi-square and G test Statistical tests compare the null hypothesis that two variables are independent with the alternative hypothesis. If the null is rejected (cannot be rejected), the edge is learned (removed). A statistic, which is asymptotically chi-square distributed, is calculated and compared to a critical value. If it is greater (smaller) than the critical value, the null is rejected (cannot be rejected) [1, 31]. In Pearson’s chi-square test, the statistic X2st is (3) where Oxy (Exy) is the number of records (expected to be if the null was correct) for which X = x, Y = y, and |X| and |Y| are the corresponding cardinalities. If the null is correct, , and we expect that and for Ex and Ey, which are the numbers of records in which X = x and Y = y, respectively, and N is the total number of records. If X2st is larger than a critical value for a significance value � then we reject the null. Instead, based on maximum likelihood, if the statistic Gst ∙ (4) is larger than the previous critical value , then we reject the null. 2.2 Why Adaptive Thresholding? Too high a threshold applied in CI testing leads to many unjustified edge removals and thus a sparse network that represents the domain partially. Too low a threshold yields erroneous learning of too many edges and thus a dense network that overfits the data and requires high-order tests, which decrease reliability and increase complexity and run- time. Clearly, both networks are wrong. Not only a threshold should not be set arbitrarily, but it also should be sensitive to the factors (cardinalities and sample size) that affect the measure. In this study, we concentrate on, and analyze, the sensitivity of the MI measure to the factors. MI sensitivity to sample size If the sample was unlimited, would estimate MI accurately, of a pair of independent variables would be 0, and a threshold would be redundant. But, the sample size is finite and affects MI [32, 33]. As the sample size decreases, the bias between and MI increases, and for independent variables increases from 0. Then, it is more complicated to find a threshold to distinguish between dependent and independent pairs. Moreover, the sample size changes with the domain. MI sensitivity to cardinality Asymptotically for independent variables, MI(X;Y) = 0, and there is no dependence on variable cardinality. However, cardinality affects MI [18, 32, 34]. Since , we can write . For totally dependent variables (i.e., have the same values), . As variable cardinality increases, variable entropy increases (due to increased uncertainty) as does its MI with other variables (e.g., for a uniform distribution, the entropy is log of cardinality [13]). Also statistical tests (3) (4) depend on the cardinality (df). Naturally, the cardinality changes with the CI test and so should do the threshold. 3 Adaptive Thresholds Focusing on MI, our adaptive thresholds change their values among CI tests and, similar to MI, increase with the cardinality and decrease with the sample size. 3.1 MI Threshold Candidates C1 threshold A relation between MI and is [32] (5) where is a non-converging series of coefficients. approximates [32] when the sample . Thus, to separate 1459 between independent variables (MI=0) and dependent variables (MI>0), we suggest an adaptive threshold, (6) S1 threshold Since Gst and are related [7] (7) we suggest a second adaptive threshold, which is (see (4)) (8) S2 threshold Since (7) is true when is calculated using the natural logarithm, whereas MI in (1) is calculated using base 2, we propose S2 by changing bases for S1, i.e., (9) Between the thresholds All thresholds decrease with N, as desired. S2 is larger than S1 by 1/ln2 and thus increases with cardinality faster than S1. C1’s numerator equals the expected value of a chi-square distributed variable, which is the number of variable’s df [1], whereas S2’s numerator is the chi-square critical value. Thus, the latter is higher than the former. Both S1’s numerator and denominator are larger than C1’s numerator and denominator, and S1 is larger than C1 if . Usually (α=0.05), this happens for df � 35, which is common. 3.2 Threshold Extension to CMI Extending the thresholds from MI to CMI, we account for the increase in df due to the cardinalities of all variables in set S. Thus, when conditionally testing two variables, the adaptive thresholds are corrected by [31, 33], (10) 3.3 Complexity Complexities of the CMI, X2st, and Gst measures are O(c3n) for n variables and maximal variable cardinality c. The complexity of a fixed threshold is O(1) and that of an adaptive threshold (and X2d.f,α for thresholding ) is O(n); hence, adaptive thresholding does not add to complexity. 4 Empirical Evaluation In six experiments, we compared adaptive to fixed thresholds. We used ANOVA with a 0.05 significance level to evaluate the significance of the results. ANOVA examines [3] whether the effect of a factor (e.g., sample size, variable cardinality) on the dependent variable (e.g., CMI value) is significant. The null hypothesis is that all factors perform similarly and the observed differences are merely random; thus, the effect is insignificant. ANOVA divides the total variability into that caused by each of the factors and the residual (error) variability. If the factor variability is considerably larger than the error variability, the null is rejected, and we conclude that some factor has a significant effect on the dependent variable. Then, we find which factor makes the effect using a post hoc test (Tukey-HSD) [16]. 4.1 Experiment 1 To learn about the suitability of the adaptive thresholds to MI-based CI testing, we sampled datasets of 500, 1,000, and 5,000 records from the Alarm, Insurance, and Child networks [4]. For every dataset and pair of variables, we plotted against thresholds’ values, as computed by (6), (8), and (9). We plotted for increasing products of cardinality values ( ). For each product, we presented thresholds and for all pairs of variables with the same product along with the average . Figure 1, for Alarm and 500 records, validates the analysis in Section 3.1, by which S2 is larger than S1, which is larger than C1. The analysis for all networks and dataset sizes reveals that increases with the decrease in the dataset size, but it is yet not clear if the increase with cardinality is not due to high dependence between tested variables with high cardinality. Figure 1. MI and thresholds’ values (Alarm, 500 records). 4.2 Experiment 2 To understand the joint effect on of cardinality, sample size, and the independence level between the variables, we generated 48 synthetic datasets. They describe, for two variables connected by a directed edge, all combinations of: 1) X1 =2, 5; 2) X2 =3, 6; 3) 500, 1,000, and 5,000 records; and 4) four variable independence levels: independent, “quite” independent, “quite” dependent, and dependent. To create independent variables, each variable was generated based on a different sequence of random numbers drawn from U(0,1). For example, if |X1|=2, and the first sampled random number from the first sequence was lower than 0.5, then X1’s value in the first record was 1, otherwise it was 2. If |X2|=3, and the first sampled random number from the second sequence was lower than 0.33, then X2’s value in the first record was 1, otherwise, if it was lower than 0.66, then X2’s value in the first record was 2, otherwise 3. In practice, N random numbers were sampled to represent each variable; hence, the variables are fully independent [24]. To create “quite” independent variables, 66.6% of the variables’ values were generated based on different sequences of random numbers (as for independent variables) and 33.3% of the values based on the same sequence of random numbers for both X1 and X2. To create “quite” dependent variables, 33.3% of the variables’ values were generated based on different sequences and 66.6% of the variables’ values based on the same sequence for both X1 and X2. To create dependent variables, the variables were sampled based on the same sequence of random numbers, i.e., each random number in the sequence simultaneously determines a specific value for both X1 and X2. 1460 (a) (b) (c) Figure 2. Average values as a function of (a) independence level, (b) dataset size, and (c) cardinality. Each of the 48 datasets was randomly replicated 30 times. For every dataset replication, values between the two variables were calculated. The effects of the factors on were analyzed using a three-way ANOVA test (the three examined factors were independence level, dataset size, and cardinality). It was followed by a Tukey-HSD post hoc analysis [16], when significance was found by ANOVA [3]. Independence level As expected, increases with the degree of dependence (Figure 2(a)) and this degree was found significant to in statistical analysis (p-value=0.00). Dataset Size Figure 2(b) shows that decreases with Figure 3. Average values as a function of cardinality and independence level. the dataset size and the size was also found significant. Cardinality increases with cardinality (Figure 2(c)). This factor was also found significant statistically. Figure 3 shows results for the interaction between cardinality and independence level, which is the most interesting one, because we expect that when the variables are dependent, the effect of the cardinality will be stronger than when the variables are independent. Figure 3 shows that increases faster with cardinality when the variables are more dependent. This interaction was found significant (p-value=0.00), as those between cardinality and dataset size and independence level and dataset size. Overall, the results of this experiment encourage the use of thresholds that are adaptive to the sample size and cardinality. 4.3 Experiment 3 We evaluated the adaptive thresholds in distinguishing pairs of independent variables from pairs of dependent variables using the Alarm, Insurance, Child, Mildew, Asia, and Hail- Finder networks [4]. Samples for each network included 500, 1,000, and 5,000 records. For every pair of variables, we calculated and each of the adaptive thresholds. If the value was higher than a tested threshold, we decided that, based on data, the variables are dependent; otherwise, we decided they are independent. By examining d-separation between all variable pairs (for conditioning sets of 0 order) using the true network, we could test these decisions for all variable pairs and compute the accuracy of the decisions for each adaptive threshold. We repeated this procedure with two common, fixed thresholds, 10-2 and 10-3 [2, 10, 33, 35], and computed an average accuracy (weighted by the number of pairs tested for each network) for each threshold. The results show that S1 (α = 0.01, 0.05) and S2 (α = 0.01) are the best thresholds and the fixed thresholds (and C1) are the worst. ANOVA showed that threshold type significantly affects the percentage of correct decisions (p-value=0.00). Table 1 presents a Tukey-HSD post hoc analysis by which S1 (α = 0.01, 0.05) and S2 (α = 0.01) make a subgroup of thresholds that distinguish between pairs of independent variables and pairs of dependent variables most accurately, and 10-3 is the least accurate threshold. 1461 Threshold T Subset 1 2 3 4 5 10-3 10110 .5934 C1 10110 .6436 10-2 10110 .6624 .6624 S1(0.1) 10110 .6693 .6693 S2 (0.01) 10110 .6816 .6816 .6816 S1(0.05) 10110 .6843 .6843 S1(0.01) 10110 .6920 Table 1. Homogenous subsets of the accuracy of the thresholds in distinguishing pairs of independent from pairs of dependent variables. The number of observations (T) is the number of tested pairs in 6 networks (e.g., 666 for Alarm) and 3 dataset sizes. 4.4 Experiment 4 We checked the effects on of the number of variables in the conditioning set and their cardinalities. In Experiment A, we generated a serial connection, X1�X2�X3 with cardinalities |X1|=|X3|=2, and |X2|=

                    本文档为【Adaptive Thresholding in Structure Learning of a Bayesian Network】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Adaptive Thresholding in Structure Learning of a Bayesian Network

你可能还喜欢