We previously discussed a notion of maximum likely destination. And showed how it can be applied in the context of directive networks, Bayesian networks. And everything was really cool. You could do maxim likely destinations in closed form and and it would all be really efficient and really elegant. Now we're going to. Look at the other class of networks, which is, market networks, and we're going to see how max an likely estimation pans out, in the context of market networks and specifically when we use the log linear model representation, because it's a lot easier to think about, parameter estimation in that setting. Let's look at what maximum likelihood estimation would look like in a markup network. So first, let's go back and remind yourselves that we use log likelihood as an objective. So what would log likelihood look like? So first of all, let's remind ourselves, that the probability of relative for distribution defined by a set of factors over this network. For data, instance A, B and C, is going to have the form 5-1. Of a b times five two of d c multiplied by normalizing constant which is the partition function. And so if we now, take that and put it through the log, in order to define the log likelihood and then something that out or multiple data instances, we see that we have the following expression, we have a summation over instances M of, log of. The, pi one of A. B for that instance plus log of phi two of BC for that instance minus log of Z. And here importantly notice that I've included the parameters of the model Theta as an argument to the partition function, which now explains why the partition function rather than a partition constant. It is a function of the parameters and as we change the parameters the partition function changes and and that's going to turn out to be very important. So, now let's continue deriving this expression, and. We can see that just like in the context of Bayesian network we can gather like terms and consider for example how many times a particular entry on the factor phi one of AB is used and it's used in every instance where the variable A takes the value little B and the variable B takes the value little B. And so we can, again, accumulate like terms. And we end up with a summation over here over all possible value combinations. Little a, little b. Of the log of that factor entry, phi1 of ab, multiplied by the counts. A baby. And similarly, we have exactly the same form for the second factor, BC, where for every con, for every entry in the factor phi two of BC, we have a number of occurrences of that, which is simply the counts of BC in our data set. And finally, we have this partition function hanging off the end, which is accumulated once for every data instance. So this looks great so far because we have, as in the complex Bayesian network, we've decomposed the likelihood function or in this case the log like gives function rather, into a bunch of these terms, each of which involved one parameter and a count of how many times the parameter's used. So beautiful decomposition and maybe we can just go ahead and optimize each parameter separately. Except that we have that nasty partition function hanging off the end. And if we look at the form of the partition function, that form is a sum over all possible values of a, b and c of the product of 51ab, 52bc. And one important observation is that when you stick this summation into a log, this, the log doesn't flow through the summation and so you don't get any decomposition of the parameters nicely into isolated pieces. And that is the killer, because the partition function is the term that couples the parameters, and so there's no decomposition of the likelihood into separate independent pieces, there's parts of it that decompose, but the partition function kills that decomposition. And as a consequence, it turns out that there is no closed form solution for this optimization problem. Unlike in the case of Bayesian networks where we had maximum likelihood estimation having a closed form that you can derive directly from the sufficient statistics which are the data count. So, to see this in action, here's a picture of a projection of a proj, of this of this, probability distribution. So here we have, written the probability distribute, distribution as a function of two. log linear features. These are the indicator functions of a one b one. And the indicator function of B0, C1. And so one of them is an entry in the first potential, in this potential, and the other one is an entry in this potential. And obviously, you could have other features than that, but I can only draw the, a two dimensional feature space, and so, so we focused on just two parameters. And so what you see here for the two axis one axis is the, is one of these parameters, and the second axis is the second of these parameters, so the. Thetas that are the coefficient of these indicator functions. And what you see here is the log likelihood. So this is the surface, of the log likelihood. There are several things that can be seen from looking at the surface. First, it seems, and we'll show this in a moment, that there is a single global optimum of this function that happens right here in the middle of this contour. But, we can also see the coupling between the parameters coming up. That it doesn't decompose nicely as a product of or in this case a summation of a function over one of the dimensions. plus a function over the second dimension. So now let's delve into the math a little bit more and consider what the log likelihood function looks like in the general case of a log linear model. So just as a reminder here is the definition of a log linear model and, it's, defined as, the sum, of a set of coefficients which are parameters theta I. Multiplied by a set of features, which could be indicator features as we had on the previous slide. But we also talked about other forms of features for long linear models in a, in a previous module so we're not going to go back there but any function here would work. We add up all of these log linear functions and exponentiate that and then we have the partition function that makes sure. That we have a distribution. So plugging this into the form of the log-likelihood, we. Take this apply it to every single data instance. And, so we end up with the following expression which is what we get after we do this manipulation of accumulating terms, all of which involve a particular parameter, theta i. And so now for each theta I we have a summation over data instances m, of the value of the feature f I, so this is the feature f i apply through the n-th data instance. Now notice that I've changed arguments on you in that. Here the future I have a limited scope. A certain subset of the variables, for example, the pair AB. And here I've given it the entire assignment as an argument. This is a shorthand that, we use where we allow the function FI to forget arguments that it doesn't care about. So if we give the function say, the first indicator feature. all three arguments A, B and C. It's just going to forget about C and look at AB alone. And so this is just a convenience shorthand that allows us to sort of use a consistent and simpler notation without as many indices. So this is the first term, of the log like we have and just as before we have a second term where we accumulate a log partition function for every instance in our data set. And just as a reminder, the log partition function has the following form. It is a log of, unfortunately a summation, over an exponentially large space of this this exponential function. So it has a form log sum x, which is a term that you might hear. In various contexts. Now we can do an analysis on what the log partition function looks like. And in order to do that, we're going to bait was the first and second derivatives of this log prediction function R relative towards perimeter, theta i. And let's first understand what what we're doing here. So this, the partition function, as we said, is a function of the parameter's theta. And what we have here is a vector of first derivatives. One relative to every parameter theta i and And this is a matrix known as the heshin of second derivatives relative to pairs of coefficients theta I theta j. What the first line says, in the theorem, is that the first derivative of the log partition function, is an expectation of, the feature FI, relative, so it's an expectation, that's what the E stands for. All the feature FI relative, with to the distribution piece of theta. That is relative to the distribution that we get by parametrizing graphical model, using the parameters data that defines the probability of distribution. And now I can go ahead and compute the expectations. So this is equal to sum over X. The Theta of X, FI of X. And that is an expectation. In the same way that we treat f I as a random variable over the space x we can now consider the co variance of two random variables f I of x and f j of x. And what this second line tells me is that the matrix of second derivatives is really this co variance matrix of these of these random variables that are defined by a feature f I and a feature f j. Let's go ahead and prove the first of these, statements. The second is more complicated but really is, in terms of a lot more lines and derivation, but the intuitions aren't considerably different and there's no point in going through the details. So, let's go ahead and plug through the the derivative, and so we have a partial derivative of z theta relative to theta i. And and sorry, this is a partial derivative of log theta, log of Z theta relative to theta I, and the derivative of a log, is one o, let's say log of Z, is one over Z times, the derivative of, the expression in the middle. And so, that gives us this expression over here, where we have one over z times and we've pushed in the derivative into the summation, because the derivative of the summation is the sum of the derivatives. And so, that gives us this next line of the expression. We can now continue and recall that the derivative of an exponential is the exponential itself times the derivative of whatever is in the exponent of the exponential and so that's going to give us. Here, the exponential itself ties a derivative of this summation relative to theta i. And now it's very simple. We have a summation over j, jfj. And we're taking the derivative relative to a particular theta i. And that derivative is either zero. So the derivative relative to theta i sorry theta i theta jfj is equal to zero if i is not equals to j. And is equals to theta i, sorry, is equal to 1 or actually is equal to fj, fi, fi f is equal to J. Because the derivative of theta I itself is one and so that gives me, a fi of x over here. And I would just rearrange the expression just a little bit. Moving the fi to the sun, and, what we have here, in this box down at the bottom, is really, just the definition of the log linear model. It's 1 over the partition functionthe times exponential of the sum of the weight of features. And, so what we end up with, is a form, that looks like sum over x, E sub theta of x, times fi of x. Which is exactly what we wrote over there. A similar procedure will get us the form in a second expression, just with a lot more algebraic manipulation. So now let's, that was the log of the partition function, now let's think about the implications regarding, the log likely hood. So if you remember the log like we had, had this term over here, which was linear and the theta eyes, and this additional term which was M times the log partition function and, And now let's think about what are some of the implications. The walk partition function, because it, it's Hessian. Has this form. It turns out that, that means that it's a convex function. What does a convex function mean? It means that it's bowl shaped. So it's nice, bowl. Or, formally, there is a definition of convexity that says that if you have a function, G, of alpha X + 1 - alpha Y, then. that's greater than or equal to alpha times sorry that's less than or equal two. Alpha times g of x plus one minus alpha dy, and so that means if you draw a line from here is. X, y, d of x, d of y. Any point, along this line is greater than or equal to the G value of the pont elemental, excepted formal definition of convexity. and because, the, passion is convex, sorry, because the log partition function is convex. Its negation, which is what we have here, is concave; which means it's an inverted bowl. And this, function is linear in theta. We have a linear function, plus a concave function. So altogether, L of theta is a concave function which means that it has this nice cave shape form. What are the implications of that? It means that there are no local optimum to this function, and therefore is going to be easy to optimize using techniques such as hook lining, which climb up this inverted bowl until, they hit, an optimum, which in that case has to be a global optimum, because the function has no local optimum. So let's think about how to actually perform this maximum likely destination. So let's go back to the log likelihood function, now we're going to add this we're going to multiply by one over m so that we don't have a scaling issue where the log likelihood continues to grow as a number of data instances grows and so that gives us this expression over here. Where a note, that the m has disappeared as a multiplier from the log partition function, and now we have a one over m term in the linear component as well. And now if we go back and plug in the derivatives of theta I. we take the derivative of this expression relative to particular parameter theta I, we have two pieces. We have already said that derivative of the log partition function relative to a parameter theta I is the expectation of Theta i, of fi which is the cul, which is the feature, that Theta I multiplies, in, piece of theta. On the other hand, if we look here, at the derivative relative to theta I, we can see that what we have here, is also an expectation, it's an empirical expectation, it is the expectation, or average of f i In the data distribution. And so the derivative of the log likelihood relative to theta I is the difference of two expectations. An empirical expectation, and, and expectation in the parametric distribution defined by my current set of parameters, theta. The one immediate consequence from this, is that a parameter setting theta half is the maximum likelihood estimate, if and only if, we have equality of expectations, that is the expectation relative to the model. Theta hat is equal to the expectation in the data b. And so the expectation in the model has to be equal to the expectation in the data. And that has to hold for every one of the features, in my log linear model. Bill Froley. Now that we've computed the gradient let's think about how one might use it to actually do the optimization. So it turns out that a very commonly used approach to, do this kind of likelihood optimization is in the context of CRFs is a variant of gradient ascent. So here we have a likelihood surface which is in, in this case a fairly nice function. And what we're going to do is we're going to use a gradient process where at each point, we compute the gradient. We take a certain step in that direction. And continue to climb up. And the gradient that we've computed on the previous slide is what we're using. Now it turns out that plain vanilla gradient ascent is not actually the most efficient way to go. typically one uses a variant of gradient adcent called, LBFGS, which is a quasi Newton method. So very briefly a Newton method is something that, constructs a second order approximation, to the function, at each point as opposed to just a linear proximation which looks only at the gradient, that says we don't want to spend the computational cost in general to compute a second order, or quadratic approximation to the function, a quasi Newton method. It kind of makes, an approximate guess, at what the step would be, if we were to compute a second order of approximation. And we're not going to go into the details of LDFGS here, but there's plenty of literature on that, that that you can find. Now in either case, for computing the gradient in this context. The critical computational cost is the fact that we need the expected feature counts. And the expected feature counts come up in two places in the data. That is when we compute this empirical expectation E relative to D of the feature FI. But also in the second term in this derivative relative to the current model. So E relative to the parameter vector theta in the current model of the feature fi. Now this second piece is really the rate limiting step, because it requires that at each point in the process, we conduct inference on the model in order to compute the gradient. And so this is an expensive computation, because it requires that we run inference in a graphical model, which we know to be a not inexpensive step, depending on the complexity of the graphical model. So to make this concrete, let's look at an actual example. Let's return to our example of the icing model where the energy function of a set of random variables. X1 up to xn is a, is the product of a bunch of pairwise term, is a sum of a bunch of pairwise terms. WIJXIJ. And a, and a bunch of singleton terms, UIXI and, it's an energy function, so it's negated. and so plugging into the equation that we had before the gradient equation, relative to a particular parameter vector. Theta I, and let's now look at what this looks like for the two types of parameters that we have here. The single parameter is UI, and the pairwise parameter is WIJ. So for the singleton parameters UI, if we just go ahead and and plug into this, this expression over here we're going to have for the empirical expectation simply the average of data XI of M, over over the data instances M. So assuming that each variable, xi, has its own parameter ui, we're going to sum up over all data instances m, and the value of the variable xi. Okay. And that's going to be our empirical expectation once we average it over all of the [INAUDIBLE]. Now what about the the model expectation, expectation relative to theta. Well, here we have two two cases. I, we need to consider the probability that xi is equal 1 and the probability that xi is equal to -1/ And, the case where, xi is equal to one, makes a contribution of +one, to the expectation. And the case where xi is -1 makes a contribution of -one to the expectation. And so altogether we have P theta, of, xi1 = 1 minus p theta of xi-1. = -1. And, that's because the two have, these two different coefficients, +one and -1. A slightly more complicated version of this comes up when we have the peerwise derivative of the derivative relative to the parameter WIJ. And here once again the empirical expectation is simply the, average of XI of M, XJ of M over the data instances M, and, the probability term has four different probabilities corresponding to the four cases, XI and XJ are both one, both, negative one or one takes the value of one and the other negative one vice versa. And, the two cases, of plus one, plus one, an minus one minus one, have a coefficient of one, because of the product XI, XJ is one in those cases and the other two have, coefficients of negative one and so, when end up with this expression P, sub theta of XI equals 1X, J equals one, plus p theta of XI equals negative one, XJ equals negative one, minus the probability of the other two cases. So to summarize. We've seen that the partition function in undirected graphical models coupled all of the parameters in our likelihood function. And that introduces complexities into this setting that we did not see in the context of Bayesian networks, specifically we've seen that this that there's no closed form solution for optimizing the likelihood for these undirected graphical models. On the other hand, we have also seen that the problem is a convex optimization problem, which allows it to be solved using gradient ascent usually an, LBFGS algorithm. And because of the. convexity of the function, we're guaranteed that this is going to give us the global optimum regardless of the initialization. The bad news, the further piece of bad news, other than the fact that it has no close form optimization, is that in order to perform this gradient computation, we need to perform probabilistic inference. And we need to do that at each gradient step in order to compute the expected feature counter. And we've already seen that inference is a costly thing, and so this really dramatically increases the cost of parameter estimation and Markov networks, say, in comparison to Bayesian networks. On the other hand, the features that we are computing, the expectation. Are always within clusters in a cluster graph or in a clique tree because of the family preservation property. And and that implies that a single calibration step suffices for all of the feature expectations at once. And so a single calibration say, of the clique tree is going to be enough for us to compute all of the expected feature counts, and and then we can go from there and compute the gradient.