Our final big module in this course is that of learning a probabilistic graphical model from data. Before we delve into the details of, of learning, of specific learning algorithms, let's think about some of the reasons why we might want to learn a probabilistic graphical model from data. Some of the different scenarios in which this learning problem might arise and how we might go about evaluating the results of our learning algorithm. So, the set up here is that we assume that we have some kind of true distribution, which is typically denoted by P*. And, in many cases, although not always, we might assume that P* is actually generated from a probabilistic graphical model M*. And that assumption allows us to talk about the differences between a learned model and the ground truth model and start the generated the distribution. Now, we're assuming that from this distribution, P*, we get a data set, D, of instances, d1 up to dM and, we're assuming that those are sampled from the distribution P*. [SOUND] Now, in addition to the data, we may or may not have some amount of domain expertise that allows us to put in some prior knowledge into the model. And in fact, the ability to put in prior knowledge is one of the strengths of probabilistic graphical model learning, as compared to a variety of other learning algorithms where this is not always quite as easily done. So combining elicitation from an expert and learning what we end up with is a network that we can then look at and use for different purposes. So, to make this a little bit more concrete, let's look at the, different scenarios in the context of Bayesian network, the issues in the Markov network look fairly identical. So in the case of known structure and complete data, we have a network which we assume to be true. We have input data which is nice and clean. You see that all of the variables have, have values in every single instance and our goal is to produce this set of CPDs for the network. In the case of unknown structure and complete data, we have the same type of data set, but notice that now the initial network has no edges in it and we now need to infer the edge connectivity as well as the CPDs. Incomplete data arises when, notice that here, we have some of the variables are not observed in the training data and as we'll see this can actually complicate the learning problem quite considerably and finally the unknown structure incomplete data. Now, in the latent variable case, notice that we have a situation where we know about three of the variables, X1, X2, and Y. But our final model has an addition to X1, X,2 and Y, and additional latent variable H that we didn't even know about, it might have been here but we didn't observe any of the values for it. We didn't even know of its existence. And we want to learn a model that involves not only X1, X2 and Y, but also the variable H. So, now let's think about the reasons why we might want to learn a probabilistic graphical model. and who, the most obvious one is that we want a model that we can use in the same way that we would use one that we elicited by hand, to just answer probabilistic queries, whether conditional probability queries or map queries about new instances that we haven't seen before. Now, introducing concepts that we'll study in more detail a little bit later on, the simplest possible metric that we might envision for training a PGM is basically how probable are the instances that we've seen relative to a given model? So, this metric is called training set likelihood and it's formalized as the following, it's the probability of the data that we've seen, our data set D, relative to a given model M. And the intuition behind this is, is that if a model makes the data more likely, that it was more likely to have generated this data set, then it's a pretty good model or pretty good assumption about the process that generated our data. And, in this I'm just opening up this definition. This this just turns into the product over over instance M of the probability of the individual instances given the model given the candidate model M. And this is assuming that the instances are, IID, Independent and Identically distributed from, the model M. So, one important notion that will accompany us through out this discussion, is that while training set likelihood seems intuitively like a pretty good surrogate for a pretty good scoring function for picking the model, it, isn't what we actually care about. Because what we really care about is new data, not the data that we got before, we care about making conclusions about data we haven't seen. And so what we really want to do is evaluate our model on a separate test set. And you've all already seen the notion of test sets in the context of other learning problems and the same the same idea is fundamental here in PGMs as well, is that are evaluation really should care about, not the original data set D but rather a new data set D Prime, which gives us a surrogate for what's called generalization performance. A related but somewhat different variant in the notion of, of a learning task that you might want the PGM to perform is when we have a specific prediction problem that we care about. So, for example, we might so where we specifically care about predicting a particular set of target variables Y from a set of observed variables X. And we've seen multiple examples of this such as image segmentation where we have, for example, X being the pixels in the image and Y being the predictant class labels. Speech recognition is another such example where we have an acoustic signal as X and the set, and a sequence of phonemes as Y. So all of these are are cases where we have a particular prediction task. Now, although, in this case, we often care about a specialized objective. So, for example, a pixel-level segmentation accuracy in the context of the image segmentation or in the context of speech recognition, we might care about the word accuracy rate. Even though that's often the case, it turns out that in many cases it's convenient for for algorithmic and mathematical purposes to select their model to optimize the same notion of either likelihood or conditional likelihood where we try and predict where we're computing the probability of the Ys given the Xs. And although that likelihood is not always a perfect surrogate for the objective that the specialized objective that we actually care about. It turns out to be mathematically convenient and that's why it's often done. However, it's important to evaluate the model performance on the true objective over test data as opposed to just use likelihood as the evaluation of how successful our learning algorithm was. A third setting where one might want to use PGM learning is actually qualitatively quite different. In this case, we might not care about using the model for any particular inference task, but rather we hear about inferring the structure itself that is what we care about is knowledge discovery or structure discovery where our goal is try and get as close as possible to the generating model at the start. Using PGM learning for this task might help us distinguish between direct and indirect dependencies. So if we see a correlation between X and Y in the data, we want to infer whether that corresponds to a direct probabilistic interaction between them or something that proceeds via a third variable C for example. In some cases, when we are learning a Bayesian network, we might be able to infer the directionality of the edges and thereby get some intuition regarding causality. And, in other cases, when we learn models with latent variables, the existence of those latent variables, their location, and often the way in which the values of the latent variables get assigned to different instances gives us a lot of information about the structure of the domain. In many cases, although not always, when we want to, we solve this learning problem by training using the same ideas that use a likelihood based objective for training. Now, we know that that is not a particularly good surrogate for structural accuracy, but from mathematical and algorithmic perspective, it's a very convenient optimization objective and therefore it's often used in practice, although there are also other ideas out there. However, it's important not to use likelihood, even likelihood of the test set as the sole objective for evaluating model performance. And in many cases, as we'll see for, as we'll see in the context of some examples, the evaluation here needs to be done by comparing to whatever limited prior knowledge we have about the model and star. So, we can compare prior knowledge that was not given to the algorithm and see whether the algorithm was able to adequately reconstruct this. Now, we talked earlier in this module about the fac that that the training likelihood tends to over fit the model. And that, in fact, is a general observation, that when you select the model M to optimize the training set likelihood, then that tends to overfit badly to statistical noise. random fluctuations that happens when we generate our training set. That happens in several different ways that happens by overfitting at the level of parameters so where the parameters fit random noise in the training data and that can be avoided by the use of regularization or parameter priors over the parameters and we'll see how that gets done. It also happens when we overfit the structure. And specifically one can show that if we optimize the training set likelihood, then complex structures always win. That is, we would always prefer the most complicated structure that our model allows. And so if we're training if we're trying to fit structure, it's important to either bound the model complexity or penalize the model complexity so that we don't learn models that are just ridiculously complicated for no good reason. Now, all of these different choices that we talked about are called hyperparameters. So hyperparameters include things like the parameter priors or the regularization over parameters, the strength of the regularization. If we're doing complexity bounds or complexity penalties, that's another hyperparameter. All of these are things that we need to pick before we can actually apply our learning algorithm. And, so how does that happen? Well, we need to figure out a way to select that, and it turns out that, that decision makes a huge difference in many cases to the performance of our learning algorithm. And so, how do we pick these hyperparameters? Well, one obvious choice is to pick them on the training set. A few seconds of thought ought to convince us that that is a terrible idea, because we just talked about the fact that on the training set, the optimal thing to do is to have maximum complexity. And so if we pick these hyper-parameters in the training set they're going to become totally vacculous. Another obvious choice is to pick them on the test site, that turns out to be another terrible idea, because that basically makes us look, makes our performance overly optimistic because we picked these very important parameters so as to optimize performance on our test set. So training set is bad, test is bad and so the correct strategy is to use what is called a validation set, which is a set that is separate from both our training set on the one hand and our test set on the other. A variant on this is to use what is called cross-validation on the training set where we split the training set iteratively into a training and a validation component and use that to pick hyperparameters. And these are all concepts that you've seen before in the context of other learning algorithms and there equally important here. Finally lets talk about why you might, why and when you might want to use PGM learning as opposed to a generic machine learning algorithm. . Pgm learning is particularly useful when what we're trying to do is make predictions not over a single output variables, such as a binary outcome the positive class or the negative class, but rather we're trying to make predictions over structured objects. For example, labeling entire sequences as in when we're trying to do for example, sequence labeling in, in, in speech recognition or in natural language processing. or when we're trying to label entire graphs. For example, in the case of image segmentation. Where we have, there's a grid of pixels. And we're trying to label all the pixels simultaneously. This allows us to exploit correlations between multiple predictive variables often giving us significant improvements to performance. A second reason to use PGM learning is that it allows us to incorporate prior knowledge into our model in a way that many other algorithms have a bit of a difficulty in, in allowing. And finally, this is particularly useful when we're trying to learn a single model. Single state PGM model for multiple different tasks where as traditional learning algorithms you learn a particular XY mapping, here you can learn a single graphical model and use it in multiple different ways for answering different types of queries. And finally, the idea of using learning for knowledge discovery is useful in other is also possible in the context of other learning algorithms but is particularly useful in the context of PGMs, because the form of the knowledge is often particularly intuitive.