Bayesian networks withour tears

Bayesian networks withour tears This Is a Publication of The American Association for Artificial Intelligence This electronic document has been retrieved from the American Association for Artificial Intelligence 445 Burgess Drive Menlo Park, California 94025 (415) 328-3123 (415) 321-44...

This Is a Publication of The American Association for Artificial Intelligence This electronic document has been retrieved from the American Association for Artificial Intelligence 445 Burgess Drive Menlo Park, California 94025 (415) 328-3123 (415) 321-4457 info@aaai.org http://www.aaai.org (For membership information, consult our web page) The material herein is copyrighted material. It may not be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from AAAI. Articles 50 AI MAGAZINE u n d e r s t a n d i n g (Charniak and Gold- man 1989a, 1989b; Goldman 1990), vision (Levitt, Mullin, and Binford 1989), heuristic search (Hansson and Mayer 1989), and so on. It is probably fair to say that Bayesian networks are to a large segment of the AI-uncertainty community what resolution theorem proving is to the AI- logic community. Nevertheless, despite what seems to be their obvious importance, the ideas and techniques have not spread much beyond the research community responsible for them. This is probably because the ideas and techniques are not that easy to understand. I hope to rectify this situation by making Bayesian networks more accessible to the probabilis- tically unso- Over the last few years, a method of reasoning using probabilities, vari- ously called belief networks, Bayesian networks, knowl- edge maps, proba- bilistic causal networks, and so on, has become popular within the AI proba- bility and uncertain- ty community. This method is best sum- marized in Judea Pearl’s (1988) book, but the ideas are a product of many hands. I adopted Pearl’s name, Bayesian networks, on the grounds that the name is completely neutral about the status of the networks (do they really rep- resent beliefs, causality, or what?). Bayesian networks have been applied to problems in medical diagnosis (Heckerman 1990; Spiegel- halter, Franklin, and Bull 1989), map learning (Dean 1990), lan- guage Bayesian Networks without Tears Eugene Charniak I give an introduction to Bayesian networks for AI researchers with a limited grounding in prob- ability theory. Over the last few years, this method of reasoning using probabilities has become popular within the AI probability and uncertainty community. Indeed, it is probably fair to say that Bayesian networks are to a large segment of the AI-uncertainty community what resolution theorem proving is to the AI-logic community. Nevertheless, despite what seems to be their obvious importance, the ideas and techniques have not spread much beyond the research community responsible for them. This is probably because the ideas and techniques are not that easy to understand. I hope to rectify this situation by making Bayesian networks more accessible to the probabilistically unsophisticated. 0738-4602/91/$4.00 ©1991 AAAI …making Bayesian networks more accessible to the probabilis- tically unsophis- ticated. phisticated. That is, this article tries to make the basic ideas and intuitions accessible to someone with a limited grounding in proba- bility theory (equivalent to what is presented in Charniak and McDermott [1985]). An Example Bayesian Network The best way to understand Bayesian networks is to imagine trying to model a situation in which causality plays a role but where our understanding of what is actually going on is incomplete, so we need to describe things probabilistically. Suppose when I go home at night, I want to know if my family is home before I try the doors. (Perhaps the most con- venient door to enter is double locked when nobody is home.) Now, often when my wife leaves the house, she turns on an outdoor light. However, she sometimes turns on this light if she is expecting a guest. Also, we have a dog. When nobody is home, the dog is put in the back yard. The same is true if the dog has bowel troubles. Finally, if the dog is in the backyard, I will probably hear her barking (or what I think is her barking), but sometimes I can be confused by other dogs barking. This example, partially inspired by Pearl’s (1988) earthquake example, is illustrated in figure 1. There we find a graph not unlike many we see in AI. We might want to use such diagrams to predict what will happen (if my family goes out, the dog goes out) or to infer causes from observed effects (if the light is on and the dog is out, then my family is probably out). The important thing to note about this example is that the causal connections are not absolute. Often, my family will have left without putting out the dog or turning on a light. Sometimes we can use these diagrams anyway, but in such cases, it is hard to know what to infer when not all the evidence points the same way. Should I assume the family is out if the light is on, but I do not hear the dog? What if I hear the dog, but the light is out? Naturally, if we knew the relevant proba- bilities, such as P(family-out | light-on, ¬ hear- bark), then we would be all set. However, typically, such numbers are not available for all possible combinations of circumstances. Bayesian networks allow us to calculate them from a small set of probabilities, relating only neighboring nodes. Bayesian networks are directed acyclic graphs (DAGs) (like figure 1), where the nodes are random variables, and certain independence assumptions hold, the nature of which I dis- cuss later. (I assume without loss of generality that DAG is connected.) Often, as in figure 1, the random variables can be thought of as states of affairs, and the variables have two possible values, true and false. However, this need not be the case. We could, say, have a node denoting the intensity of an earthquake with values no-quake, trembler, rattler, major, and catastrophe. Indeed, the variable values do not even need to be discrete. For example, the value of the variable earthquake might be a Richter scale number. (However, the algo- rithms I discuss only work for discrete values, so I stick to this case.) In what follows, I use a sans serif font for the names of random variables, as in earth- quake. I use the name of the variable in italics to denote the proposition that the variable takes on some particular value (but where we are not concerned with which one), for exam- ple, earthquake. For the special case of Boolean variables (with values true and false), I use the variable name in a sans serif font to denote the proposition that the variable has the value true (for example, family-out). I also show the arrows pointing downward so that “above” and “below” can be understood to indicate arrow direction. The arcs in a Bayesian network specify the independence assumptions that must hold between the random variables. These inde- pendence assumptions determine what prob- ability information is required to specify the probability distribution among the random variables in the network. The reader should note that in informally talking about DAG, I said that the arcs denote causality, whereas in the Bayesian network, I am saying that they specify things about the probabilities. The next section resolves this conflict. To specify the probability distribution of a Bayesian network, one must give the prior probabilities of all root nodes (nodes with no predecessors) and the conditional probabilities Articles WINTER 1991 51 light-on family-out dog-out bowel-problem hear-bark Figure 1. A Causal Graph. The nodes denote states of affairs, and the arcs can be interpreted as causal connections. Articles 52 AI MAGAZINE of all nonroot nodes given all possible combi- nations of their direct predecessors. Thus, figure 2 shows a fully specified Bayesian net- work corresponding to figure 1. For example, it states that if family members leave our house, they will turn on the outside light 60 percent of the time, but the light will be turned on even when they do not leave 5 percent of the time (say, because someone is expected). Bayesian networks allow one to calculate the conditional probabilities of the nodes in the network given that the values of some of the nodes have been observed. To take the earlier example, if I observe that the light is on (light-on = true) but do not hear my dog (hear-bark = false), I can calculate the condi- tional probability of family-out given these pieces of evidence. (For this case, it is .5.) I talk of this calculation as evaluating the Bayesian network (given the evidence). In more realistic cases, the networks would con- sist of hundreds or thousands of nodes, and they might be evaluated many times as new information comes in. As evidence comes in, it is tempting to think of the probabilities of the nodes changing, but, of course, what is changing is the conditional probability of the nodes given the changing evidence. Some- times people talk about the belief of the node changing. This way of talking is probably harmless provided that one keeps in mind that here, belief is simply the conditional probability given the evidence. In the remainder of this article, I first describe the independence assumptions implicit in Bayesian networks and show how they relate to the causal interpretation of arcs (Independence Assumptions). I then show that given these independence assumptions, the numbers I specified are, in fact, all that are needed (Consistent Probabilities). Evaluat- ing Networks describes how Bayesian net- works are evaluated, and the next section describes some of their applications. Independence Assumptions One objection to the use of probability theory is that the complete specification of a probability distribution requires absurdly many numbers. For example, if there are n binary random variables, the complete distri- bution is specified by 2n-1 joint probabilities. (If you do not know where this 2n-1 comes from, wait until the next section, where I define joint probabilities.) Thus, the complete distribution for figure 2 would require 31 values, yet we only specified 10. This savings might not seem great, but if we doubled the …the complete specification of a probability distribution requires absurdly many numbers. light-on (lo) family-out (fo) dog-out (do) bowel-problem (bp) hear-bark(hb) Figure 2. A Bayesian Network for the family-out Problem. I added the prior probabilities for root nodes and the posterior probabilities for nonroots given all possible values of their parents. P(fo) = .15 P(bp) = .01 P(lo ‰ fo) = .6 P(lo ‰ Ø fo) = .05 P(hb ‰ do) = .7 P(hb ‰ Ø do) = .01 P(do ‰ fo bp) = .99 P(do ‰ fo Ø bp) = .90 P(do ‰ Ø fo bp) = .97 P(do ‰ Ø fo bp) = .3 size of the network by grafting on a copy, as shown in figure 3, 2n-1 would be 1023, but we would only need to give 21. Where does this savings come from? The answer is that Bayesian networks have built-in independence assumptions. To take a simple case, consider the random variables family-out and hear-bark. Are these variables independent? Intuitively not, because if my family leaves home, then the dog is more likely to be out, and thus, I am more likely to hear it. However, what if I happen to know that the dog is definitely in (or out of) the house? Is hear-bark independent of family-out then? That is, is P(hear-bark | family-out, dog- out) = P(hear-bark | dog-out)? The answer now is yes. After all, my hearing her bark was dependent on her being in or out. Once I know whether she is in or out, then where the family is is of no concern. We are beginning to tie the interpretation of the arcs as direct causality to their proba- bilistic interpretation. The causal interpreta- tion of the arcs says that the family being out has a direct causal connection to the dog being out, which, in turn, is directly connected to my hearing her. In the probabilistic inter- pretation, we adopt the independence assumptions that the causal interpretation suggests. Note that if I had wanted to say that the location of the family was directly relevant to my hearing the dog, then I would have to put another arc directly between the two. Direct relevance would occur, say, if the dog is more likely to bark when the family is away than when it is at home. This is not the case for my dog. In the rest of this section, I define the inde- pendence assumptions in Bayesian networks and then show how they correspond to what one would expect given the interpretation of the arcs as causal. In the next section, I for- mally show that once one makes these inde- pendence assumptions, the probabilities needed are reduced to the ones that I speci- fied (for roots, the priors; for nonroots, the conditionals given immediate predecessors). First, I give the rule specifying dependence and independence in Bayesian networks: In a Bayesian network, a variable a is Articles WINTER 1991 53 dog-out family-out light-on bowel-problem hear-bark frog-out family-lout night-on towel-problem hear-quark Figure 3. A Network with 10 Nodes. This illustration is two copies of the graph from figure 1 attached to each other. Nonsense names were given to the nodes in the second copy. A path from q to r is d-con- necting with respect to the evi- dence nodes E if every interior node n in the path has the proper- ty that either 1. it is linear or diverging and not a member of E or 2. it is converging, and either n or one of its descendants is in E. In the literature, the term d-separa- tion is more common. Two nodes are d-separated if there is no d-connect- ing path between them. I find the explanation in terms of d-connecting slightly easier to understand. I go through this definition slowly in a moment, but roughly speaking, two nodes are d-connected if either there is a causal path between them (part 1 of the definition), or there is evidence that renders the two nodes correlated with each other (part 2). To understand this definition, let’s start by pretending the part (2) is not there. Then we would be saying that a d-connecting path must not be blocked by evidence, and there can be no converging interior nodes. We already saw why we want the evidence blocking restric- tion. This restriction is what says that once we know about a middle node, we do not need to know about anything further away. What about the restriction on converging nodes? Again, consider figure 2. In this dia- gram, I am saying that both bowel-problem and family-out can cause dog-out. However, does the probability of bowel-problem depend on that of family-out? No, not really. (We could imagine a case where they were depen- dent, but this case would be another ball game and another Bayesian network.) Note that the only path between the two is by way of a converging node for this path, namely, dog-out. To put it another way, if two things can cause the same state of affairs and have no other connection, then the two things are independent. Thus, any time we have two potential causes for a state of affairs, we have a converging node. Because one major use of Bayesian networks is deciding which poten- tial cause is the most likely, converging nodes are common. Now let us consider part 2 in the definition of d-connecting path. Suppose we know that the dog is out (that is, dog-out is a member of E). Now, are family-away and bowel-problem independent? No, even though they were independent of each other when there was no evidence, as I just showed. For example, knowing that the family is at home should raise (slightly) the probability that the dog has bowel problems. Because we eliminated dependent on a variable b given evidence E = {e1 … en} if there is a d-connecting path from a to b given E. (I call E the evidence nodes. E can be empty. It can not include a or b.) If a is not dependent on b given E, a is independent of b given E. Note that for any random variable {f} it is possible for two variables to be independent of each other given E but dependent given E ¨ {f} and vise versa (they may be dependent given E but independent given E ¨ {f}. In par- ticular, if we say that two variables a and b are independent of each other, we simply mean that P(a | b) = P(a). It might still be the case that they are not independent given, say, e (that is, P(a | b,e) „ P(a | e). To connect this definition to the claim that family-out is independent of hear-bark given dog-out, we see when I explain d-connecting that there is no d-connecting path from family-out to hear-bark given dog-out because dog-out, in effect, blocks the path between the two. To understand d-connecting paths, we need to keep in mind the three kinds of connec- tions between a random variable b and its two immediate neighbors in the path, a and c. The three possibilities are shown in figure 4 and correspond to the possible combinations of arrow directions from b to a and c. In the first case, one node is above b and the other below; in the second case, both are above; and in the third, both are below. (Remember, we assume that arrows in the diagram go from high to low, so going in the direction of the arrow is going down.) We can say that a node b in a path P is linear, converging or diverging in P depending on which situation it finds itself according to figure 4. Now I give the definition of a d-connecting path: Articles 54 AI MAGAZINE a c b a cb a b c Linear Converging Diverging Figure 4. The Three Connection Types. In each case, node b is between a and c in the undirected path between the two. the most likely explanation for the dog being out, less likely explanations become more likely. This situation is covered by part 2. Here, the d-connecting path is from family- away to bowel-problem . It goes through a converging node (dog-out), but dog-out is itself a conditioning node. We would have a similar situation if we did not know that the dog was out but merely heard the barking. In this case, we would not be sure the dog was out, but we do have relevant evidence (which raises the probability), so hear-bark, in effect, connects the two nodes above the converging node. Intuitively, part 2 means that a path can only go through a converging node if we are conditioning on an (indirect) effect of the converging node. Consistent Probabilities One problem that can plague a naive proba- bilistic scheme is inconsistent probabilities. For example, consider a system in which we have P(a | b) = .7, P(b | a) = .3, and P(b) = .5. Just eyeballing these equations, nothing looks amiss, but a quick application of Bayes’s law shows that these probabilities are not consis- tent because they require P(a) > 1. By Bayes’s law, P(a) P(b | a) / P(b) = P(a | b) ; so, P(a) = P(b) P(b | a)/ P(b | a) = .5 * .7 / .3 = .35 / .3). Needless to say, in a system with a lot of such numbers, making sure they are consistent can be a problem, and one system (PROSPECTOR) had to implement special-purpose techniques to handle such inconsistencies (Duda, Hart, and Nilsson 1976). Therefore, it is a nice property of the Bayesian networks that if you specify the required numbers (the probability of every node given all possible combinations of its parents), then (1) the numbers will be consistent and (2) the network will uniquely define a distribution. Furthermore, it is not too hard to see that this claim is true. To see it, we must first introduce the notion of joint distribution. A joint distribution of a set of random vari- ables v1 … vn is defined as P(v1 … vn ) for all values of v1 … vn. That is, for the set of Boolean variables (a,b), we need the probabili- ties P(a b), P(¬ a b), P(a ¬ b), and P(¬ a ¬ b). A joint distribution for a set of random vari- ables gives all the information there is about the distribution. For example, suppose we had the just-mentioned joint distribution for (a,b), and we wanted to compute, say, P(a | b): P(a | b) = P(a b) / P(b) = P(a b) / (P(a b) + P(¬ a b) . Note that for n Boolean variables, the joint distribution contains 2n

                    本文档为【Bayesian networks withour tears】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Bayesian networks withour tears

你可能还喜欢