We previously discussed the problem of learning the parameters of the Bayesian network whose structure was given to us. But that might not always be the case, it's not always the case that we have a domain expert to understand the structure of the domain, the relationship between the variables well enough to write down in the network that we think is good enough. So, what are some cases where this might happen? the first is when, when we actually want to use a network for answering new queries, of what ever kind, for example, new medical queries in a, in a medical domain or other settings. So, in this case we might, it still might be the case that although there might be some domain expertise, it might not be sufficient to come up with a model that is good enough to use, and we might do better by taking data and actually learning the dependencies that the data tells are most significant. A second setting where this might happen is where we don't necessarily even care about using the network, we just want to discover the network. And this kind of use model occurs, for example, in scientific, or biological data sets for example, where the goal is to discover the interrelationships between the variables just so that we understand the domain better. So, let's focus for a moment on the first of these two cases, where our goal is to use, is to learn a network for the purpose of using it. And let's think about the kinds of mistakes that might happen and how they affect the ability of the network to apply well to new instances. So, let's imagine that the network, that the true network is this one over here. And now, let's consider different kinds of mistakes that a learning algorithm might make. So, in one case, for example, the, the learning algorithm might learn a network that's missing an arc. On the other hand, it might learn a network to which we've added an arc. Now, of course, one can also have cases where we have a bit of each, but let's think about these two types of errors separately. If we're missing an arc effectively the net, the independencies that the network, that the learned network tells us is incorrect. That is relative to the ground truth network, which is this one, which if you remember we called it G*, the network in, the network that we learned is making independent assumptions that are not correct relative to G*. Conversely, in this case, we have spurious dependencies. That is in, when we have added an arc, there's a spurious dependency, in this case, between A and B. Now, what are the implications of this? If we're missing an arc, then the correct distribution P* cannot be learned. That is, there's no way in general if if P* was associated with this network over here, that is G* is a perfect map for P*. Then we can't learn P* using a network that's missing an edge. Now, that might seem like a very bad idea and we would prefer perhaps the error model on the right which, even though it has these excess edges, it does allow us to correctly learn P*, that is, this additional edge over here between A and B, there is a setting of the parameters for which the correct distribution P* can still be estimated on this network structure. So, it might seem that this error model is actually, error mode, is actually better than the one where we're missing an arc. But empirically, that actually turns out to be an oversimplified view. Specifically, the model where we have excess arcs gives rise to an increase in the number of parameters. And the more parameters we have to elicit, the harder it is to elicit them correctly, given the limited amount of data. We've already talked about this when we talked about fragmentation, as the number of of the data, as the number of parents increases. And so, one of the important conclusions from this is that this, as we've already seen, can give rise to worst generalization which means worst performance on unseen instances. And so, the fact is that sometimes, having fewer edges could actually generalize better than having more edges even though the correct distribution cannot be encoded. And so, the tradeoffs between missing arcs and extra arcs are subtle. And it's not always clear which, which is the right which is going to give us the best performance. So, with that introduction, what is the way in which we typically learn Bayesian network structure from data? the general paradigm which we will refine in subsequent parts of this of this segment of the course, is that we define a scoring function that evaluates, for each structure, how well that structure matches the data. So here, we have a data set D. and we have, in this case, three example network structures. And we're going to define a scoring function that tells us, for each of these network structures, how good it is relative to the data that we've seen And then, we have the goal of searching for a network structure that maximizes the score. And so, we have now turned the learning problem into an optimization problem. It's an optimization over a combinatorial space, the space of network structures. And by defining a score that gives us the objective that we're trying to optimize and now, we need to come up with an algorithm that optimizes that score. So, we've separated this out into the score component and the optimization component. And we'll talk separately about each of these in subsequent parts.