Abstract
Thresholding a measure in conditional
independence (CI) tests using a fixed value enables
learning and removing edges as part of learning a
Bayesian network structure. However, the learned
structure is sensitive to the threshold that is
commonly selected: 1) arbitrarily; 2) irrespective
of characteristics of the domain; and 3) fixed for all
CI tests. We analyze the impact on mutual
information – a CI measure – of factors, such as
sample size, degree of variable dependence, and
variables’ cardinalities. Following, we suggest to
adaptively threshold individual tests based on the
factors. We show that adaptive thresholds better
distinguish between pairs of dependent variables
and pairs of independent variables and enable
learning structures more accurately and quickly
than when using fixed thresholds.
1 Introduction
Constraint-based (CB) structure-learning algorithms of
Bayesian networks (BN) use conditional independence (CI)
tests to threshold information-based measures, such as
mutual information (MI), or statistics, such as X2 or G. CB
algorithms are popular, but arbitrariness in CI testing
increases the chances of learning redundant edges, removing
true edges, and orienting edges wrongly or not at all, and
thus learning an erroneous structure [10, 14].
Arbitrariness is a result of two main sources: 1)
inaccurate estimation of the CI measure, especially using
small samples and large cardinalities of the conditioning set
and the involved variables (henceforth “cardinality”), and 2)
uneducated selection of the threshold on the measure.
As the degrees of freedom (df) for and the cardinality
for conditional MI (CMI) increase, statistic and CMI
increase, respectively, because both sum increasing numbers
of non-negative terms. As the sample size increases, the
statistic increases and the bias between the estimated and
real MI decreases [32]. Compared to , CMI has no means,
such as df, to adapt to increase in cardinality, and both
measures have no means to adapt to the sample size. So far,
the analysis of cardinality (or df) and sample size
(henceforth “factors”), as sources of arbitrariness in
estimating CI measures, was limited [10, 12, 14, 15, 31, 33]
yielding no practical ways to tackle this arbitrariness.
Regarding the second main source of arbitrariness –
threshold selection – a wrong threshold influences structure
learning [2, 12, 14, 20]. Commonly, a threshold is selected
according to three main methods: as a user default [10, 29,
31, 33], based on limited trial and error experimentation [5],
or automatically as the threshold, from a range of
candidates, that max(in)imizes a performance measure [9,
14, 20, 23, 35]. The first method provides simplicity and
low runtime, but cannot guarantee satisfactory performance
for most problems. The second method overcomes
exhaustive experimentation using a limited number of
candidates, but it is manual and incomplete. The third
method may find a global optimal threshold but may also
suffer from a long runtime.
A threshold is usually not only selected arbitrarily but
also irrespectively of test factors. While the critical value
increases with df, it is indifferent to sample size. The CMI
threshold is indifferent to all factors. To our best knowledge,
no idea for setting a CMI threshold that depends on the
factors has ever been offered. Also, no proposal has ever
been given to adapt a threshold to factors of each CMI test.
We suggest adaptive thresholding for individual CI tests
based on test factors, and apply this concept in CMI-based
CI testing. Statistical tests are well established and
consistent but compared to CMI tests can only reject
independence, and not confirm dependence as CMI, and
lose accuracy due to multiple comparisons. Section 2
reviews common tests and motivates adaptive thresholding.
Section 3 suggests adaptive thresholds, which depend on
test factors. Section 4 shows that CMI-based adaptive
thresholds: 1) follow MI sensitivity to the factors; 2)
distinguish pairs of dependent from pairs of independent
variables better than fixed thresholds; 3) enable learning
structures that are significantly more accurate than if using
fixed thresholds; 4) facilitate complexity and expedite
learning; and 5) allow learning structures that fit and
classify data more accurately. Conclusions are in Section 5.
2 Background
BN is a graphical model that describes independence
relations between variables. Its structure is a directed acyclic
graph (DAG) composed of nodes representing variables and
directed edges that connect nodes and probabilistically
quantify their connection [21, 28, 31].
Learning a BN structure is NP-hard [12], and we focus on
the CB approach to structure learning [9, 10, 20, 28, 31, 35].
First, CB methods (starting from a complete graph) test
pairs of variables to determine whether edges should be
Adaptive Thresholding in Structure Learning of a Bayesian Network��
Boaz Lerner, Michal Afek, Rafi Bojmel
Ben-Gurion University of the Negev, Israel
boaz@bgu.ac.il; {caspimic,rafibojmel}@gmail.com
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence
1458
removed. If a statistical test is used, the null hypothesis is
that the two variables are (conditionally) independent. If the
p-value, corresponding to the probability of obtaining a
value as extreme as the one observed, given that the null is
true, is less than the significance level, �, the null is rejected
[25]. If a CMI test is used [13], two (conditionally)
independent variables are assumed (for an infinite sample
size) to share a zero CMI [30], hence we test them (for a
finite sample size) against a small threshold [10]. Failing in
the test (CMI) or failing to reject the null ( ) indicates
node (conditional) independence that yields edge removal.
By performing tests for conditioning sets of different
sizes, we form an undirected graph [10, 29, 31, 35] that is
directed using orientation rules [17, 26, 27]. The result is a
complete partial DAG (CPDAG) that shares the same
oriented edges with the true DAG and uses undirected edges
to represent statistically indistinguishable connections in the
true graph [11, 31]. We concentrate on ways to improve CI
testing in domains with discrete variables and complete
data.
2.1 Common CI Tests
Common ways of CI testing are by thresholding CMI or a
statistic that measures statistical independence between
variables (in Pearson’s chi-square or likelihood ratio G-test).
Mutual information MI between variables X and Y
measures the amount of information shared between these
variables. It also measures how much uncertainty about Y
decreases when X is observed (and vice versa) [13]
(1)
MI is the KL divergence between and [13],
measuring how much the joint is different from the
marginals’ product, or how much the variables can be
considered not independent. CMI between X and Y measures
information flow between X and Y given a conditioning set
S
(2)
By definition, MI(X;Y) and CMI(X;Y|S) are non-negative.
MI(X;Y) = 0 (CMI(X;Y|S) = 0) iff X and Y are independent
(given S). The true MI is unknown, and the estimated is
larger than [32], and thus for independent variables
larger than 0. Practically, is compared to a small
threshold, , to distinguish pairs of dependent and pairs of
independent variables [2, 6, 9, 10]. If , X and Y
are regarded as independent and the edge connecting them is
removed. The test for CI using CMI is similar.
Pearson’s chi-square and G test Statistical tests compare
the null hypothesis that two variables are independent with
the alternative hypothesis. If the null is rejected (cannot be
rejected), the edge is learned (removed). A statistic, which is
asymptotically chi-square distributed, is calculated and
compared to a critical value. If it is greater (smaller) than
the critical value, the null is rejected (cannot be rejected) [1,
31]. In Pearson’s chi-square test, the statistic X2st is
(3)
where Oxy (Exy) is the number of records (expected to be if
the null was correct) for which X = x, Y = y, and |X| and |Y|
are the corresponding cardinalities. If the null is correct,
, and we expect that
and for Ex and
Ey, which are the numbers of records in which X = x and Y =
y, respectively, and N is the total number of records. If X2st
is larger than a critical value for a significance value �
then we reject the null.
Instead, based on maximum likelihood, if the statistic Gst
∙ (4)
is larger than the previous critical value
, then we reject the null.
2.2 Why Adaptive Thresholding?
Too high a threshold applied in CI testing leads to many
unjustified edge removals and thus a sparse network that
represents the domain partially. Too low a threshold yields
erroneous learning of too many edges and thus a dense
network that overfits the data and requires high-order tests,
which decrease reliability and increase complexity and run-
time. Clearly, both networks are wrong. Not only a
threshold should not be set arbitrarily, but it also should be
sensitive to the factors (cardinalities and sample size) that
affect the measure. In this study, we concentrate on, and
analyze, the sensitivity of the MI measure to the factors.
MI sensitivity to sample size If the sample was unlimited,
would estimate MI accurately, of a pair of
independent variables would be 0, and a threshold would be
redundant. But, the sample size is finite and affects MI [32,
33]. As the sample size decreases, the bias between and
MI increases, and for independent variables increases
from 0. Then, it is more complicated to find a threshold to
distinguish between dependent and independent pairs.
Moreover, the sample size changes with the domain.
MI sensitivity to cardinality Asymptotically for
independent variables, MI(X;Y) = 0, and there is no
dependence on variable cardinality. However, cardinality
affects MI [18, 32, 34]. Since
, we can write .
For totally dependent variables (i.e., have the same values),
. As variable cardinality increases, variable
entropy increases (due to increased uncertainty) as does its
MI with other variables (e.g., for a uniform distribution, the
entropy is log of cardinality [13]). Also statistical tests (3)
(4) depend on the cardinality (df). Naturally, the cardinality
changes with the CI test and so should do the threshold.
3 Adaptive Thresholds
Focusing on MI, our adaptive thresholds change their values
among CI tests and, similar to MI, increase with the
cardinality and decrease with the sample size.
3.1 MI Threshold Candidates
C1 threshold A relation between MI and is [32]
(5)
where is a non-converging series of coefficients.
approximates
[32] when the sample . Thus, to separate
1459
between independent variables (MI=0) and dependent
variables (MI>0), we suggest an adaptive threshold,
(6)
S1 threshold Since Gst and are related [7]
(7)
we suggest a second adaptive threshold, which is (see (4))
(8)
S2 threshold Since (7) is true when is calculated using
the natural logarithm, whereas MI in (1) is calculated using
base 2, we propose S2 by changing bases for S1, i.e.,
(9)
Between the thresholds All thresholds decrease with N,
as desired. S2 is larger than S1 by 1/ln2 and thus increases
with cardinality faster than S1. C1’s numerator equals the
expected value of a chi-square distributed variable, which is
the number of variable’s df [1], whereas S2’s numerator is
the chi-square critical value. Thus, the latter is higher than
the former. Both S1’s numerator and denominator are larger
than C1’s numerator and denominator, and S1 is larger than
C1 if . Usually
(α=0.05), this happens for df � 35, which is common.
3.2 Threshold Extension to CMI
Extending the thresholds from MI to CMI, we account for
the increase in df due to the cardinalities of all variables
in set S. Thus, when conditionally testing two variables, the
adaptive thresholds are corrected by [31, 33],
(10)
3.3 Complexity
Complexities of the CMI, X2st, and Gst measures are O(c3n)
for n variables and maximal variable cardinality c. The
complexity of a fixed threshold is O(1) and that of an
adaptive threshold (and X2d.f,α for thresholding ) is O(n);
hence, adaptive thresholding does not add to complexity.
4 Empirical Evaluation
In six experiments, we compared adaptive to fixed
thresholds. We used ANOVA with a 0.05 significance level
to evaluate the significance of the results. ANOVA examines
[3] whether the effect of a factor (e.g., sample size, variable
cardinality) on the dependent variable (e.g., CMI value) is
significant. The null hypothesis is that all factors perform
similarly and the observed differences are merely random;
thus, the effect is insignificant. ANOVA divides the total
variability into that caused by each of the factors and the
residual (error) variability. If the factor variability is
considerably larger than the error variability, the null is
rejected, and we conclude that some factor has a significant
effect on the dependent variable. Then, we find which factor
makes the effect using a post hoc test (Tukey-HSD) [16].
4.1 Experiment 1
To learn about the suitability of the adaptive thresholds to
MI-based CI testing, we sampled datasets of 500, 1,000, and
5,000 records from the Alarm, Insurance, and Child
networks [4]. For every dataset and pair of variables, we
plotted against thresholds’ values, as computed by (6),
(8), and (9). We plotted for increasing products of
cardinality values ( ). For each product, we presented
thresholds and for all pairs of variables with the same
product along with the average . Figure 1, for Alarm and
500 records, validates the analysis in Section 3.1, by which
S2 is larger than S1, which is larger than C1. The analysis for
all networks and dataset sizes reveals that increases with
the decrease in the dataset size, but it is yet not clear if the
increase with cardinality is not due to high dependence
between tested variables with high cardinality.
Figure 1. MI and thresholds’ values (Alarm, 500 records).
4.2 Experiment 2
To understand the joint effect on of cardinality, sample
size, and the independence level between the variables, we
generated 48 synthetic datasets. They describe, for two
variables connected by a directed edge, all combinations of:
1) X1 =2, 5; 2) X2 =3, 6; 3) 500, 1,000, and 5,000 records;
and 4) four variable independence levels: independent,
“quite” independent, “quite” dependent, and dependent.
To create independent variables, each variable was
generated based on a different sequence of random numbers
drawn from U(0,1). For example, if |X1|=2, and the first
sampled random number from the first sequence was lower
than 0.5, then X1’s value in the first record was 1, otherwise
it was 2. If |X2|=3, and the first sampled random number
from the second sequence was lower than 0.33, then X2’s
value in the first record was 1, otherwise, if it was lower
than 0.66, then X2’s value in the first record was 2,
otherwise 3. In practice, N random numbers were sampled
to represent each variable; hence, the variables are fully
independent [24]. To create “quite” independent variables,
66.6% of the variables’ values were generated based on
different sequences of random numbers (as for independent
variables) and 33.3% of the values based on the same
sequence of random numbers for both X1 and X2. To create
“quite” dependent variables, 33.3% of the variables’ values
were generated based on different sequences and 66.6% of
the variables’ values based on the same sequence for both
X1 and X2. To create dependent variables, the variables were
sampled based on the same sequence of random numbers,
i.e., each random number in the sequence simultaneously
determines a specific value for both X1 and X2.
1460
(a)
(b)
(c)
Figure 2. Average values as a function of (a) independence
level, (b) dataset size, and (c) cardinality.
Each of the 48 datasets was randomly replicated 30 times.
For every dataset replication, values between the two
variables were calculated. The effects of the factors on
were analyzed using a three-way ANOVA test (the three
examined factors were independence level, dataset size, and
cardinality). It was followed by a Tukey-HSD post hoc
analysis [16], when significance was found by ANOVA [3].
Independence level As expected, increases with the
degree of dependence (Figure 2(a)) and this degree was
found significant to in statistical analysis (p-value=0.00).
Dataset Size Figure 2(b) shows that decreases with
Figure 3. Average values as a function of cardinality and
independence level.
the dataset size and the size was also found significant.
Cardinality increases with cardinality (Figure 2(c)).
This factor was also found significant statistically.
Figure 3 shows results for the interaction between
cardinality and independence level, which is the most
interesting one, because we expect that when the variables
are dependent, the effect of the cardinality will be stronger
than when the variables are independent. Figure 3 shows
that increases faster with cardinality when the variables
are more dependent. This interaction was found significant
(p-value=0.00), as those between cardinality and dataset
size and independence level and dataset size. Overall, the
results of this experiment encourage the use of thresholds
that are adaptive to the sample size and cardinality.
4.3 Experiment 3
We evaluated the adaptive thresholds in distinguishing pairs
of independent variables from pairs of dependent variables
using the Alarm, Insurance, Child, Mildew, Asia, and Hail-
Finder networks [4]. Samples for each network included
500, 1,000, and 5,000 records. For every pair of variables,
we calculated and each of the adaptive thresholds. If the
value was higher than a tested threshold, we decided that,
based on data, the variables are dependent; otherwise, we
decided they are independent. By examining d-separation
between all variable pairs (for conditioning sets of 0 order)
using the true network, we could test these decisions for all
variable pairs and compute the accuracy of the decisions for
each adaptive threshold. We repeated this procedure with
two common, fixed thresholds, 10-2 and 10-3 [2, 10, 33, 35],
and computed an average accuracy (weighted by the number
of pairs tested for each network) for each threshold. The
results show that S1 (α = 0.01, 0.05) and S2 (α = 0.01) are the
best thresholds and the fixed thresholds (and C1) are the
worst. ANOVA showed that threshold type significantly
affects the percentage of correct decisions (p-value=0.00).
Table 1 presents a Tukey-HSD post hoc analysis by which
S1 (α = 0.01, 0.05) and S2 (α = 0.01) make a subgroup of
thresholds that distinguish between pairs of independent
variables and pairs of dependent variables most accurately,
and 10-3 is the least accurate threshold.
1461
Threshold T Subset
1 2 3 4 5
10-3 10110 .5934
C1 10110 .6436
10-2 10110 .6624 .6624
S1(0.1) 10110 .6693 .6693
S2 (0.01) 10110 .6816 .6816 .6816
S1(0.05) 10110 .6843 .6843
S1(0.01) 10110 .6920
Table 1. Homogenous subsets of the accuracy of the thresholds in
distinguishing pairs of independent from pairs of dependent
variables. The number of observations (T) is the number of tested
pairs in 6 networks (e.g., 666 for Alarm) and 3 dataset sizes.
4.4 Experiment 4
We checked the effects on of the number of variables in
the conditioning set and their cardinalities.
In Experiment A, we generated a serial connection,
X1�X2�X3 with cardinalities |X1|=|X3|=2, and |X2|=
本文档为【Adaptive Thresholding in Structure Learning of a Bayesian Network】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。