Now, we have tackled the problem of learning a model structure or parameters from the case of complete data. We're now going to move to what turns out to be a much harder situation where we're trying to learn when we have only partially observed data. The fact that this arises in a variety of settings. It arises when we have a scenario where we, where some variables are just never observed. They're hidden or latent. It also occurs where some variables are just missing, because some measurements weren't taken. It turns out as we'll see, that these settings provide significant challenges both in terms of the foundations, defining the learning task in a reasonable way and from a computational perspective, where as we'll see the computational issues that arise in this incomplete data setting are considerably more challenging. I mentioned latent variables. Let's try to argue why we might care about latent variables. So one reason is that latent variables can often give rise to the sparser and therefore easier to learn models. So let's imagine that this is my true network. G star, where we have three variables leading into this variable, H, and then the three variables at the bottom, and if all variables are binary, then this is a network that can be parameterized with seventeen independent parameters. But now let's imagine that I've decided that H is latent and I'm just going to learn a network over the observable variables, which are the x's and the y's. And so what is the network that correctly captures the structure of the distribution P over x1, x2 x3, y1, y2, and y3? And it turns out that this network, if you think about it, has, burst, because h is not there. And edge, from every x to every y. And furthermore. Because the y's are no longer conditionally independent given the x's because there only conditionally independent given the H that I don't observe and I have also edges between the y's directly so the spaghetti actually turns out to look like this with a total of 59 parameters in the network. So by dropping this one latent variable, I've created a model that is much harder to learn. Now of course, learning a model with the latent variables is by itself a a problematic situation but it may well be worth the tradeoff. So the other reason why we might care about learning latent variables is because they might be interesting. They might provide us with an interesting characterization of structure in the data, and I'll give you details of that in a later module but for the moment just as a teaser, imagine that we have a data set corresponding to 3D point clouds, that are scanned of a human body and we would like to discover from that what are, what is the limb strcuture of the person, that, to which the scans correspond, that is we want to identify clusters in the data, clusters in the point cloud that correspond to body parts. And so we want to basically end up with an output where each point has a latent variable representing which body part it belongs to. So, having motivated why we might care about missing data, let's think about some of the complexities that arise. So, let's imagine that somebody gives us this sequence over here and says, you know, here's these question marks that correspond to missing data. How do we treat this. And the answer is, if you don't know why these data are missing you have no ideal how to proceed. And so to understand, let's consider two different scenarios. The first one, is an experimenter is asked to toss the coin and occasionally the, the coin misses the table and drops on the floor and the experimenters, you know too tired to, to go crawl under the table to see, what happens is, so they don't record the value of the coin in the cases where it fell on the floor. Case two is the coin is tossed, but, the experimenter doesn't like tails. For some reason, tails are, are, you know, give them the hebeegeebez and so tails are not reported sometimes. Note in these two cases really should give rise to very different estimation procedures if we are trying to learn from this data set. Specifically in the first case, we should probably just ignore the question marks and just learn from the sequence of observed instances, HTHH because the other ones the fact that are missing, it doesn't tell us anything about the point. In case two, on the other hand, we can't really ignore the missing measurements. We need to learn from the sequence H, T, T, T, H, T, H, because ignoring the missing values is effectively ignoring something that is, predominantly or entirely tails, and so we would get incorrect estimates if we just ignored them. So the, in order to correctly learn with missing data, we need to actually consider the da-, the, the mechanism by which the data was made to be missing. And so, how do we model the notion of missing data? So let's imagine that we have a set of random variables that that define our model. And in some instance we may observe some, but not others. And so we're going to define a set of observability variables, which are always observed. And basically, O sub I tells us, is one, if XI is observed. And zero otherwise. And so we always know whether we observe the variable or not. And so OI is always observed. And now we're going to add a new set of random variables, which are also always observed. These are the variables that we're going to call YI, which have the same value as XI, so each one has the same value space as XI, except that there is also the I didn't get to observe it value, and so in the real data case we, in the re, in the real scenario we basically get to observe the Y's, we get to observe the O's and the X's are not observed. Now, the y is our deterministic function of the xes and the o's. So y is equal to, yi is xi when o is observed. N is= to? When o is not observed. So in the cases where, where, the val-, where I have oi is= to one. I can reconstruct the value of xi. But for the cases where I, I don't have the observation, I can't. And so this is a way of just defining the, the observability, pattern that I have. With this set of variables I can now model the two, the two different scenarios that we had before. In this scenario, which corresponds to the coin falling on the ground every once in a while, we have a separate model over here that represents our observability pattern, and we see that a variable is sometimes observed by chance, and that the target and observed value Y depends on X and on O but there is no interaction between the value of the coin and whether I end up observing it or not. By comparison, in the case where the experimenter doesn't like fails, we see that the x, that the true value of the x variable, effects whether it's observed or not. And so we have edge from x to o. So when can, in which of these cases can we ignore the missing data mechanism and focus only on the likelihood of the stuff that I get to observe? And the answer is, one can define a notion called missing at random. Missing at random is the way for me to say, I can ignore the mechanism for the observability and focus only on this place over here. So one can show that it suffices for this question, for focusing only on the likelihood that this distribution over x and y and o have the following characteristics that the observation variables o are independent of the unobserved X's, which we're going to denote h, given the observed values y, which are my data instances. Which means that if you tell me the values that you observe, then the fact that something may or may not have been observed doesn't carry any additional information. And this is a little bit of a tricky notion, so let's try and give an example. Imagine that a doctor, a patient comes into the doctors office, and the doctor chooses what set of tests to perform. For example, the doctor chooses, to perform or not perform, say, a chest x-ray. The fact that the doctor didn't choose to perform a chest X Ray probably in the case that the person didn't come in with a deep cough or some other symptoms that suggested tuberculous or phenomena. And therefore the test wasn't performed. So the observation or lack there of, of a chest x ray, the fact that a chest x ray doesn't exist in my patient record is probably an indication that the patient didn't have tuberculous or pneumonia. So these are not independent. So in that model we do not have the missing it random, assumption holding because we the observe ability pattern tells me something about the disease which is the unobserved variable that I care about, on the other hand if I have in my medical record things like the primary complaint that the patient came in, for example, a broken leg. Then, at that point, given that the primary complaint was a broken leg I already know that the patient likely didn't have tuberculous or pneumonia and, therefore, given that, observed feature, observed variable which is the primary complaint, the observability pattern no longer gives me any information about the variables that I didn't observe. And, so that is the difference between a scenario that is missing at random and a scenario that isn't missing at random. For the for the for the purposes of our discussion we're going to make the missing at random assumption from here on. What's the next complication, with the case of incomplete data? It turns out that the likelihood can have multiple, global maximum. So, intuitively, that's almost, almost obvious. Because if you have a hidden variable. That has two values, zero and one. The values zero and one don't mean anything. We could rename them one and zero and just invert everything. And it would, basically, give us an exactly equivalent model to the one with 01, because the names don't mean anything. And so, that immediately means that I have a reflection of my likely hood function that occurs when I rename the variables. And it turns out that this is not something that happens just in this case, when they have multiple hidden variables the problem only becomes worse because the number of local... The number of global maximum becomes exponentially large in the number of hidden variables. And so now we have a function with exponentially many reflections of itself, and it turns out that this can also occur when you have missing data not just with hidden variables. So, even if all I have are data where, where only some occurences of the variable are missing its value even that can give me multiple local and global maximum. So to understand that a little bit in more depth lets go back to the comparisons between the likelihood in the complete data case and the likelihood in the incomplete data case. So here is a simple model where I have two variables x and y with x being a parent of y. And I have three instances, and if we just go ahead and write down the complete data likelihood it turns out to have the following beautiful form which we've already seen before where we have the product of probabilities for the three instances and each of these can be we've admitted writing the parameters for clarity, and that's going to be equal to here is. The probability for theta X 0Y0 given the parameters, the second instance and the third instance. And the point is this ends up being a nice decomposable function of the parameters. As, in terms of a product, which if we take the log ends up being a sum. Is a likely it decomposes it decomposes without variables in it, it decomposes within the CPD. What about the incomplete data case? Lets make our life a little bit more complicated and where as before we had these complete instances now notice that these, both of these instances have an incomplete observation regarding the variable X. And now let's write down the likelihood function, in this case. Well the likelihood function, is now the probability of Y0, which is the first data instance, times the probability of X0Y1, which is the second data instance, times another probability Y0. So since p(y0) appears twice, we've squared this term over here. And the probability of y0 is the sum over x of the probability of x, y0. That you have to consider both possible ways of completing the data, x, for the different values of x: x0 and x1. And so if we unravel this expression inside the parentheses it ends up looking like this, theta x zero times theta y zero x zero plus theta x one theta y zero given x one. And the important observation about this expression is that it is not a product of parameters in the model which means we can not take its log and have it decompose over a parameter or the summation because a log of a summation doesn't doesn't decompose. And so that means that our nice decomposition properties of the likelihood function have disappeared in the case of incomplete data. It does not decompose by variables, notice that we have a theta. For the x variable sitting in the same, expression as an entry from the p of y given x cpd. It does not decompose within cpds, and even computing this likelihood function actually requires that we do a sum product computation. So it requires effectively a form of probabilistic inference. So what does that imply, both of these properties that we talked about in the previous slides? What does that imply about the likelihood function? Before, our likelihood function has the form of these gray lines over here. So for example like this, this is a likelihood function of a complete data scenario. The, when we have a multi, when we have a case of incomplete data we're effectively summing up, the probability of all possible completions, of, the, unobserved variables, and so, thee, overall likelihoods function, end up being a product, of, So end up being a summation. Sorry. A summation of likelihood functions that correspond to the different ways that I had to complete the data. So this end up being with this as one such summation. So the likelihood function and the being a sum of. Like a some these nice concave likelihood functions, well log concave likelihood functions, but the point is when you add them all up, it doesn't look so nice at all. It ends up having multiple modes and and it's very much harder to deal with. The second problem that we have, in addition to multi modality, is the fact that the parameters start being correlated with each other. So if you remember, when we were doing the case of complete data. we had the likelihood function being composed as a product of little likelihoods for the different parameters. What happens when we have an incomplete, data scenario? So, when you look at this, you can see, for example, that when X is not observed. So, when X is not observed. You have an active V structure that goes from theta YX through Y, all the way to theta X. And so, intuitively, that suggests to us that there is a correlation and interaction between the values that I choose for theta Y given X, and for theta X. And when you think about the intuition for that, it makes perfect sense. Because for example, if theta X chooses to make X0 very, very likely. Then, most of the instances where X is unobserved will be assigned to the X0 case. And that's going to basically, have me, , in, assign the data instances to the XO, and that is going to change the estimates of Sayla Y given XO, relative to the Sayla Y given X1, because most of the instances now correspond to XO rather to X1. Lapse in the correlation between them and to see that correlation manifesting lets look at some graphs so what we're seeing here is actually the correlation between two entries in the CPD theta Y given X so over here we see theta Y given X0 and here is theta one given X1. What you see here is the contour plot of the likelihood function. It has eight data points and zero missing measurements. And you can see that this is a nice, product likelihood function with a nice peak in the middle. And there's no interaction between the two parameters. But as we start to gain more and more missing measurements you can see that the curve starts, that, that the contour plot starts at the form. And even with three missing measurements you can see that there is significant interaction between the value that I end up picking for theta Y given X1, and the value that I end up for theta Y1 given X0. So to summarize incomplete data is actually a something that arises very often in practice and it raises multiple challenges and issues how are how are the missing values generated what makes them missing turns out to be very important the fact that there is certain components of the model that are just unidentifiable because there's several equally good solutions so if you pick the best one you better realize that there's others that are equally good out there. And finally the complexity of the likelihood function is another significant complication when doing this kind of when trying to deal with incomplete data.