This Is a Publication of
The American Association for Artificial Intelligence
This electronic document has been retrieved from the
American Association for Artificial Intelligence
445 Burgess Drive
Menlo Park, California 94025
(415) 328-3123
(415) 321-4457
info@aaai.org
http://www.aaai.org
(For membership information,
consult our web page)
The material herein is copyrighted material. It may not be
reproduced in any form by any electronic or
mechanical means (including photocopying, recording,
or information storage and retrieval) without permission
in writing from AAAI.
Articles
50 AI MAGAZINE
u n d e r s t a n d i n g
(Charniak and Gold-
man 1989a, 1989b;
Goldman 1990),
vision (Levitt, Mullin,
and Binford 1989),
heuristic search
(Hansson and Mayer
1989), and so on. It
is probably fair to
say that Bayesian
networks are to a
large segment of
the AI-uncertainty
community what
resolution theorem
proving is to the AI-
logic community.
Nevertheless, despite what seems to be their
obvious importance, the ideas and techniques
have not spread much beyond the research
community responsible for them. This is
probably because the ideas and techniques are
not that easy to understand. I hope to rectify
this situation by making Bayesian networks
more accessible to
the probabilis-
tically unso-
Over the last few
years, a method of
reasoning using
probabilities, vari-
ously called belief
networks, Bayesian
networks, knowl-
edge maps, proba-
bilistic causal
networks, and so on,
has become popular
within the AI proba-
bility and uncertain-
ty community. This
method is best sum-
marized in Judea
Pearl’s (1988) book,
but the ideas are a
product of many hands. I adopted Pearl’s
name, Bayesian networks, on the grounds
that the name is completely neutral about
the status of the networks (do they really rep-
resent beliefs, causality, or what?). Bayesian
networks have been applied to problems in
medical diagnosis (Heckerman 1990; Spiegel-
halter, Franklin, and Bull 1989), map learning
(Dean 1990), lan-
guage
Bayesian Networks
without Tears
Eugene Charniak
I give an introduction to Bayesian networks for
AI researchers with a limited grounding in prob-
ability theory. Over the last few years, this
method of reasoning using probabilities has
become popular within the AI probability and
uncertainty community. Indeed, it is probably
fair to say that Bayesian networks are to a large
segment of the AI-uncertainty community what
resolution theorem proving is to the AI-logic
community. Nevertheless, despite what seems to
be their obvious importance, the ideas and
techniques have not spread much beyond the
research community responsible for them. This is
probably because the ideas and techniques are
not that easy to understand. I hope to rectify this
situation by making Bayesian networks more
accessible to the probabilistically unsophisticated.
0738-4602/91/$4.00 ©1991 AAAI
…making
Bayesian
networks more
accessible to
the probabilis-
tically
unsophis-
ticated.
phisticated. That is, this article tries to make
the basic ideas and intuitions accessible to
someone with a limited grounding in proba-
bility theory (equivalent to what is presented
in Charniak and McDermott [1985]).
An Example Bayesian Network
The best way to understand Bayesian networks
is to imagine trying to model a situation in
which causality plays a role but where our
understanding of what is actually going on is
incomplete, so we need to describe things
probabilistically. Suppose when I go home at
night, I want to know if my family is home
before I try the doors. (Perhaps the most con-
venient door to enter is double locked when
nobody is home.) Now, often when my wife
leaves the house, she turns on an outdoor
light. However, she sometimes turns on this
light if she is expecting a guest. Also, we have
a dog. When nobody is home, the dog is put
in the back yard. The same is true if the dog
has bowel troubles. Finally, if the dog is in the
backyard, I will probably hear her barking (or
what I think is her barking), but sometimes I
can be confused by other dogs barking. This
example, partially inspired by Pearl’s (1988)
earthquake example, is illustrated in figure 1.
There we find a graph not unlike many we see
in AI. We might want to use such diagrams to
predict what will happen (if my family goes
out, the dog goes out) or to infer causes from
observed effects (if the light is on and the dog
is out, then my family is probably out).
The important thing to note about this
example is that the causal connections are
not absolute. Often, my family will have left
without putting out the dog or turning on a
light. Sometimes we can use these diagrams
anyway, but in such cases, it is hard to know
what to infer when not all the evidence points
the same way. Should I assume the family is
out if the light is on, but I do not hear the
dog? What if I hear the dog, but the light is
out? Naturally, if we knew the relevant proba-
bilities, such as P(family-out | light-on, ¬ hear-
bark), then we would be all set. However,
typically, such numbers are not available for
all possible combinations of circumstances.
Bayesian networks allow us to calculate them
from a small set of probabilities, relating only
neighboring nodes.
Bayesian networks are directed acyclic graphs
(DAGs) (like figure 1), where the nodes are
random variables, and certain independence
assumptions hold, the nature of which I dis-
cuss later. (I assume without loss of generality
that DAG is connected.) Often, as in figure 1,
the random variables can be thought of as
states of affairs, and the variables have two
possible values, true and false. However, this
need not be the case. We could, say, have a
node denoting the intensity of an earthquake
with values no-quake, trembler, rattler, major,
and catastrophe. Indeed, the variable values
do not even need to be discrete. For example,
the value of the variable earthquake might be
a Richter scale number. (However, the algo-
rithms I discuss only work for discrete values,
so I stick to this case.)
In what follows, I use a sans serif font for
the names of random variables, as in earth-
quake. I use the name of the variable in italics
to denote the proposition that the variable
takes on some particular value (but where we
are not concerned with which one), for exam-
ple, earthquake. For the special case of Boolean
variables (with values true and false), I use the
variable name in a sans serif font to denote
the proposition that the variable has the
value true (for example, family-out). I also
show the arrows pointing downward so that
“above” and “below” can be understood to
indicate arrow direction.
The arcs in a Bayesian network specify the
independence assumptions that must hold
between the random variables. These inde-
pendence assumptions determine what prob-
ability information is required to specify the
probability distribution among the random
variables in the network. The reader should
note that in informally talking about DAG, I
said that the arcs denote causality, whereas in
the Bayesian network, I am saying that they
specify things about the probabilities. The
next section resolves this conflict.
To specify the probability distribution of a
Bayesian network, one must give the prior
probabilities of all root nodes (nodes with no
predecessors) and the conditional probabilities
Articles
WINTER 1991 51
light-on
family-out
dog-out
bowel-problem
hear-bark
Figure 1. A Causal Graph.
The nodes denote states of affairs, and the arcs can be interpreted as causal connections.
Articles
52 AI MAGAZINE
of all nonroot nodes given all possible combi-
nations of their direct predecessors. Thus,
figure 2 shows a fully specified Bayesian net-
work corresponding to figure 1. For example,
it states that if family members leave our
house, they will turn on the outside light 60
percent of the time, but the light will be turned
on even when they do not leave 5 percent of
the time (say, because someone is expected).
Bayesian networks allow one to calculate
the conditional probabilities of the nodes in
the network given that the values of some of
the nodes have been observed. To take the
earlier example, if I observe that the light is
on (light-on = true) but do not hear my dog
(hear-bark = false), I can calculate the condi-
tional probability of family-out given these
pieces of evidence. (For this case, it is .5.) I
talk of this calculation as evaluating the
Bayesian network (given the evidence). In
more realistic cases, the networks would con-
sist of hundreds or thousands of nodes, and
they might be evaluated many times as new
information comes in. As evidence comes in,
it is tempting to think of the probabilities of
the nodes changing, but, of course, what is
changing is the conditional probability of the
nodes given the changing evidence. Some-
times people talk about the belief of the node
changing. This way of talking is probably
harmless provided that one keeps in mind
that here, belief is simply the conditional
probability given the evidence.
In the remainder of this article, I first
describe the independence assumptions
implicit in Bayesian networks and show how
they relate to the causal interpretation of arcs
(Independence Assumptions). I then show
that given these independence assumptions,
the numbers I specified are, in fact, all that
are needed (Consistent Probabilities). Evaluat-
ing Networks describes how Bayesian net-
works are evaluated, and the next section
describes some of their applications.
Independence Assumptions
One objection to the use of probability
theory is that the complete specification of a
probability distribution requires absurdly
many numbers. For example, if there are n
binary random variables, the complete distri-
bution is specified by 2n-1 joint probabilities.
(If you do not know where this 2n-1 comes
from, wait until the next section, where I
define joint probabilities.) Thus, the complete
distribution for figure 2 would require 31
values, yet we only specified 10. This savings
might not seem great, but if we doubled the
…the
complete
specification
of a
probability
distribution
requires
absurdly
many
numbers.
light-on (lo)
family-out (fo)
dog-out (do)
bowel-problem (bp)
hear-bark(hb)
Figure 2. A Bayesian Network for the family-out Problem.
I added the prior probabilities for root nodes and the posterior probabilities for nonroots given all possible values of
their parents.
P(fo) = .15 P(bp) = .01
P(lo ‰ fo) = .6
P(lo ‰ Ø fo) = .05
P(hb ‰ do) = .7
P(hb ‰ Ø do) = .01
P(do ‰ fo bp) = .99
P(do ‰ fo Ø bp) = .90
P(do ‰ Ø fo bp) = .97
P(do ‰ Ø fo bp) = .3
size of the network by grafting on a copy, as
shown in figure 3, 2n-1 would be 1023, but
we would only need to give 21. Where does
this savings come from?
The answer is that Bayesian networks have
built-in independence assumptions. To take a
simple case, consider the random variables
family-out and hear-bark. Are these variables
independent? Intuitively not, because if my
family leaves home, then the dog is more
likely to be out, and thus, I am more likely to
hear it. However, what if I happen to know
that the dog is definitely in (or out of) the
house? Is hear-bark independent of family-out
then? That is, is P(hear-bark | family-out, dog-
out) = P(hear-bark | dog-out)? The answer now
is yes. After all, my hearing her bark was
dependent on her being in or out. Once I
know whether she is in or out, then where
the family is is of no concern.
We are beginning to tie the interpretation
of the arcs as direct causality to their proba-
bilistic interpretation. The causal interpreta-
tion of the arcs says that the family being out
has a direct causal connection to the dog
being out, which, in turn, is directly connected
to my hearing her. In the probabilistic inter-
pretation, we adopt the independence
assumptions that the causal interpretation
suggests. Note that if I had wanted to say that
the location of the family was directly relevant
to my hearing the dog, then I would have to
put another arc directly between the two.
Direct relevance would occur, say, if the dog is
more likely to bark when the family is away
than when it is at home. This is not the case
for my dog.
In the rest of this section, I define the inde-
pendence assumptions in Bayesian networks
and then show how they correspond to what
one would expect given the interpretation of
the arcs as causal. In the next section, I for-
mally show that once one makes these inde-
pendence assumptions, the probabilities
needed are reduced to the ones that I speci-
fied (for roots, the priors; for nonroots, the
conditionals given immediate predecessors).
First, I give the rule specifying dependence
and independence in Bayesian networks:
In a Bayesian network, a variable a is
Articles
WINTER 1991 53
dog-out
family-out
light-on
bowel-problem
hear-bark
frog-out
family-lout
night-on
towel-problem
hear-quark
Figure 3. A Network with 10 Nodes.
This illustration is two copies of the graph from figure 1 attached to each other. Nonsense names were given to the
nodes in the second copy.
A path from q to r is d-con-
necting with respect to the evi-
dence nodes E if every interior
node n in the path has the proper-
ty that either
1. it is linear or diverging and
not a member of E or
2. it is converging, and either
n or one of its descendants is in E.
In the literature, the term d-separa-
tion is more common. Two nodes are
d-separated if there is no d-connect-
ing path between them. I find the
explanation in terms of d-connecting
slightly easier to understand. I go
through this definition slowly in a
moment, but roughly speaking, two nodes
are d-connected if either there is a causal path
between them (part 1 of the definition), or
there is evidence that renders the two nodes
correlated with each other (part 2).
To understand this definition, let’s start by
pretending the part (2) is not there. Then we
would be saying that a d-connecting path must
not be blocked by evidence, and there can be
no converging interior nodes. We already saw
why we want the evidence blocking restric-
tion. This restriction is what says that once
we know about a middle node, we do not
need to know about anything further away.
What about the restriction on converging
nodes? Again, consider figure 2. In this dia-
gram, I am saying that both bowel-problem
and family-out can cause dog-out. However,
does the probability of bowel-problem depend
on that of family-out? No, not really. (We
could imagine a case where they were depen-
dent, but this case would be another ball
game and another Bayesian network.) Note
that the only path between the two is by way
of a converging node for this path, namely,
dog-out. To put it another way, if two things
can cause the same state of affairs and have
no other connection, then the two things are
independent. Thus, any time we have two
potential causes for a state of affairs, we have
a converging node. Because one major use of
Bayesian networks is deciding which poten-
tial cause is the most likely, converging nodes
are common.
Now let us consider part 2 in the definition
of d-connecting path. Suppose we know that
the dog is out (that is, dog-out is a member of
E). Now, are family-away and bowel-problem
independent? No, even though they were
independent of each other when there was
no evidence, as I just showed. For example,
knowing that the family is at home should
raise (slightly) the probability that the dog
has bowel problems. Because we eliminated
dependent on a variable b given evidence E
= {e1 … en} if there is a d-connecting path
from a to b given E. (I call E the evidence
nodes. E can be empty. It can not include a
or b.) If a is not dependent on b given E, a
is independent of b given E.
Note that for any random variable {f} it is
possible for two variables to be independent
of each other given E but dependent given E
¨ {f} and vise versa (they may be dependent
given E but independent given E ¨ {f}. In par-
ticular, if we say that two variables a and b
are independent of each other, we simply
mean that P(a | b) = P(a). It might still be the
case that they are not independent given, say,
e (that is, P(a | b,e) „ P(a | e).
To connect this definition to the claim that
family-out is independent of hear-bark given
dog-out, we see when I explain d-connecting
that there is no d-connecting path from
family-out to hear-bark given dog-out because
dog-out, in effect, blocks the path between
the two.
To understand d-connecting paths, we need
to keep in mind the three kinds of connec-
tions between a random variable b and its
two immediate neighbors in the path, a and
c. The three possibilities are shown in figure 4
and correspond to the possible combinations
of arrow directions from b to a and c. In the
first case, one node is above b and the other
below; in the second case, both are above;
and in the third, both are below. (Remember,
we assume that arrows in the diagram go
from high to low, so going in the direction of
the arrow is going down.) We can say that a
node b in a path P is linear, converging or
diverging in P depending on which situation
it finds itself according to figure 4.
Now I give the definition of a d-connecting
path:
Articles
54 AI MAGAZINE
a c b
a cb
a
b
c
Linear
Converging Diverging
Figure 4. The Three Connection Types.
In each case, node b is between a and c in the undirected path between the two.
the most likely explanation for the dog being
out, less likely explanations become more
likely. This situation is covered by part 2.
Here, the d-connecting path is from family-
away to bowel-problem . It goes through a
converging node (dog-out), but dog-out is
itself a conditioning node. We would have a
similar situation if we did not know that the
dog was out but merely heard the barking. In
this case, we would not be sure the dog was
out, but we do have relevant evidence (which
raises the probability), so hear-bark, in effect,
connects the two nodes above the converging
node. Intuitively, part 2 means that a path
can only go through a converging node if we
are conditioning on an (indirect) effect of the
converging node.
Consistent Probabilities
One problem that can plague a naive proba-
bilistic scheme is inconsistent probabilities.
For example, consider a system in which we
have P(a | b) = .7, P(b | a) = .3, and P(b) = .5.
Just eyeballing these equations, nothing looks
amiss, but a quick application of Bayes’s law
shows that these probabilities are not consis-
tent because they require P(a) > 1. By Bayes’s
law,
P(a) P(b | a) / P(b) = P(a | b) ;
so,
P(a) = P(b) P(b | a)/ P(b | a) = .5 * .7 / .3 =
.35 / .3).
Needless to say, in a system with a lot of
such numbers, making sure they are consistent
can be a problem, and one system (PROSPECTOR)
had to implement special-purpose techniques
to handle such inconsistencies (Duda, Hart,
and Nilsson 1976). Therefore, it is a nice
property of the Bayesian networks that if you
specify the required numbers (the probability
of every node given all possible combinations
of its parents), then (1) the numbers will be
consistent and (2) the network will uniquely
define a distribution. Furthermore, it is not
too hard to see that this claim is true. To see
it, we must first introduce the notion of joint
distribution.
A joint distribution of a set of random vari-
ables v1 … vn is defined as P(v1 … vn ) for all
values of v1 … vn. That is, for the set of
Boolean variables (a,b), we need the probabili-
ties P(a b), P(¬ a b), P(a ¬ b), and P(¬ a ¬ b). A
joint distribution for a set of random vari-
ables gives all the information there is about
the distribution. For example, suppose we
had the just-mentioned joint distribution for
(a,b), and we wanted to compute, say, P(a | b):
P(a | b) = P(a b) / P(b) = P(a b) / (P(a b) +
P(¬ a b) .
Note that for n Boolean variables, the joint
distribution contains 2n
本文档为【Bayesian networks withour tears】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。