Over the past few lectures we've discussed a range of different learning algorithms. We've discussed learning of directed and undirected models. We've discussed learning of parameters and learning of structure. We've discussed learning with complete and with incomplete data. What we're going to do now is we're going to take a step back and think about these different methods and how they fit together and equally importantly how one might use one or more of these when encountering a practical problem. That is, what is the loop or the process by which we apply a learning algorithm to a learning problem. So first let's reconsider the learning problem from very high up. The learning problem can be viewed in terms of three different components. The first is the hypothesis phase, the classic of models that we're interested in considering. The second is the objective function that we're trying to optimize when we're applying the learning algorithm. And the third is the optimization itself that aims to optimise the objective function to give us some good model. And pretty much all of the learning algorithms, that we've discussed, can be viewed in terms of these three different components with just different combination of objective function or optimization algorithms hypothesis base. So let's look at each of these components in turn. The hypothesis space is what we're searching for, and there's really two different pieces that we can be searching for, the first is the parameters. Of the algorithm, and the second is the structure. And in some cases we're searching for one. In some we can search for the other. And in many cases we're actually searching for both. And this sets up a different depending on which choice we make, it sets up a different learning problem. Some is a, in some cases it's a continuous learning problem when we're searching for parameters. Searching for structure is usually discrete and when we're searching for both, it's actually hybrid search problem. Then the second part of the hypothesis space is when we impose a certain set of constraints on what we're searching for. And this can be done for multiple reasons. We can constrain the hypothesis space for computational efficiency. A smaller space is often easier to search for. And admits different algorithms that might be, computationally more efficient. the second is that we might restrict the search space to reduce the model capacity. That is the, the complexity of the models that we're considering. And that can reduce the risk of overfitting because we're searching over a smaller set of possibilities. And so, less likely to overfit, random statistics in the training set. And then, a third form of con-, constraint is when we have some kind of prior knowledge that we want to encode and that allows us to guide the learning algorithm towards models that are more reasonable than in a totally unconstrained way. The second piece is the objective function. So a most, perhaps the most commonly used objective function is that of a penalized likelihood. so penalized likelihood has a log likelihood component, which measures the likelihood of the log likelihood of both our graph structure and the parameters relative to the training set. But that has to be accompanied in most cases by a second term which is this regularizer, which serves to guide the model of the learning toward simpler models that are less likely to over fit the data. And that can, that regularizer can include one or both of a parameter prior, which guides the parameters usually towards more smooth models. So for example, in the case of MRF's, we have L2 or L1 as the regularizer. And in the case of Basemat, we had some kind of Dirichlet prior in many cases. Those are some examples that we've looked at and both of these tend to smooth the parameters and avoid over fitting to the statistics of the training set. And then the second form of regularization is the structure complexity penalty, which of course, is only relevant if we're also searching over structures. Which aims to push us towards models that have fewer parameters or fewer edges. Again, reducing over fitting. An alternative paradigm, which we also discussed, is what we called the baysian score. The Bayesian score is a score that's use when we're really searching over graph structures alone and integrating out over parameters. So this is a case where a hypothesis space is really just a space of structures and the parameters are kind of implicit or integrated out. And the Bayesian score or, which is the log of the probability of the graph given the data, is equal to the log of this guide, the probability of the data given the graph which as a reminder is called the marginal rhythm and the log of graph prior. Now, importantly, as a, as just as a reminder, the graph prior is a regularizer. So, log of P of G is a regularizer over graphs. But log of P of D given G, that is the log of the marginal likelihood incorporates within it a regularization term as well as we've seen before. And also serves to reduce the over-fitting of the model. To the data. The final component is the optimization algorithm which of course depends both on the space that we're trying to optimize over as well as the objective that we're trying to optimize. So if we're trying to optimize purely over the continuous space, then in some cases we're very lucky. We showed, for example, that for Bayesian networks with multinomial CPD's, we can actually optimize the likelihood in closed form, and finding unique optimum for the parameters. And this also happens when we have a penalized likelihood, or even a full Dirichlet distribution over the parameters. In other cases we're not so lucky and we need to somehow optimize the function when we can't have an algorithm that finds it in close form. So in this case we use gradient ascent or some other kind of perhaps second order method that allows us to hill climb over the likelihood function. And we saw that this happens for example for both MRF learning as well as learning with missing data. For both Bayesian and Cyrus. And of course gradient ascendant's naive form is a fairly slow method. But there are second order methods that build on top of gradient ascent, for example some kind of conjugal gradient, or LBSGS that uses similar computation but that converges much faster. Finally, there is the expectation maximization algorithm, which is an algorithm that's specifically geared for learning likelihoods, learning with missing data. So optimizing the log likelihood function in cases where the log likelihood is multimodal because of the case of missing data and we talked about some of the tradeoffs of. Local optima, for example, in this method. If we're trying to do discrete optimization, trying to search over the parameter space, again in some cases, we can be lucky. So when we're trying to do optimization over tree structures, we saw that a very nice simple algorithm that's very computationally efficient, that of finding a maximal weight expanding tree, can allow us to find the optimal scoring tree in a very efficient, very efficiently, in polynomial time. In other cases this space is more complicated, and we can do an optimization that's guaranteed to give us, the optimal answer. And, in this case, we typically end up doing some kind of local hill climbing, where we And in this case we typically end up doing some local hill climbing where we do some things like add delete or move edges. And And that gradually gets us to a better to a better optimum, there are also algorithm's that takes larger steps in the space. And finally an interesting combination is when we have to search over both the discrete and continuous Have space together when we're trying to optimize over both parameters and structures and that ends up being in many cases quite complicated because when your taking a space on, on the discrete side you also have to optimize the parameters for that new structure before you can score that structure to figure out whether that was a good move to take. So this tends to be computationally expensive in many cases and there's some tricks that one can do to reduce the cost on that. Now, every learning algorithm that we've discussed has some set of hyper-parameters that we need to set before we can run the learning algorithm. So, these are include for example, if we're doing say a Dirichlet prior over the parameters it includes the equivalent sample size. And perhaps the the prior itself, more broadly. That usually a hyper parameter that we need to set. if we're doing say, CRF or MRF learning, we need to set the regularization strength for either the L1 or the L2 prior, . If we're doing expectation maximization, we need to figure out when to stop. And we already talked about the fact that early stopping is a useful strategy, because it reduces over fitting. For doing structure learning, we need to figure out the strength of the structure penalty. How much do we want to penalize models that are more complex? If we're doing, say, MRF or CRF learning. We have to figure out how many features we want to add. And which of those are, we should add initially. And when we're doing learning with latent variables then there's the there's the decision of how many values a latent variable ought to take for example when we're doing clustering how many clusters should we consider. Each of these is an important decision that we need to think about and. The question is, where does this come from? Did we just invent this from thin air, or do we where do we get the answers to this. one bad answer that we shouldn't do is try and estimate these on the training set. That is, find the hyper parameters that maximize the objective function on the training set. Because if we do that, that is effectively calling for a choice of these hyper parameters that are going to over fit to the statistics of the training set. And so a common strategy, perhaps the one that's most commonly used, is to have a separate validation set on which these hyper parameters are set. Which means that we we train. On the training set. And then we evaluate on the value, on the validation set. And we pick the high pro perimeters that give us the best evaluation on the validation set and not on a training set. Now. If computational cost allows us its good to do this not on a single training set and a single validation set and at that point we end up usually doing something like cross validation where we split up the training set into a training set and a validation set in several different ways and pick the hyper parameters that give us the best cross validation scores and then we use those hyper parameters on the each try to train on the entire training step. This of course gives rise to the question of what it is that we're actually evaluating. So what are some of the evaluation criteria that we might use to figure out whether a mall is good or not. So one obvious one is log likelihood on a test set. So if we're actually trying to fig, to find a model that gives us good predictive performance on unseen data, then log likelihood of the model on the test set is a good way of measuring. But in many cases that's not the objective that we actually care about. And so in that case we might often evaluate the learned model using a task-specific objective. For example segmentation accuracy for trying to do image segmentation or Or speech recognition accuracy. Word error rate. it's called W, E, R. Would be another task specific objective on which we can evaluate a model, and, if you really want to get a hyper forming model, it's useful to, think about optimizing the objective that you actually care about, in the context of task. Finally when we're trying to do knowledge discovery, it's often useful to think about the extent to which the model that you learned matchs, with prior knowledge, so if you're doing clustering for example, if you have some notion of, clusters, that you, that you think exist in your data tryin to see, what that the clustering algorithm. Is, more or less compatible with those, is a good idea to do a sanity check, on the model as well. As well as for example, if we have some notion on edges, that ought to exist in the network, trying to see a match with those is also a useful thing. So now we've learned and now let's think about some. Possible error modes that might occur and how we might diagnose them and how we might fix them. So one error mode that occurs often is under-fitting. Under-fitting means that the model is not sufficiently expressive to capture the patterns and data. And this we can recognize when the training and the test performance are both low. And in that case, it suggests that we just haven't put enough expressive power into the model, and it can't even get good performance on the training set. at that point, we can have several different solutions. We can decrease the amount of regularization, because maybe the regularization is driving us towards a regime where we can't fit the patterns in the training set. We can reduce. Structure penalties, to allow more edges and more interactions to be learned. And then, a very important one is to add features into the model. Often done by a very targeted error analysis. In which we look at the errors the model is making, and say, wait a minute. In this kind of, in this kind of instance. I would have expected to see this answer, and I didn't. So how do we improve the model by adding features, so as to give us, better results for, for this set of instances. I, the complimentaryt error mode is over fitting, so over fitting can be diagnosed in cases where the training performance is high, often ridiculously high, where as the test performance is low, and that indicates the model has fit, patterns that are in the training data that release the statistical noise and don't generalize to other, unseen instances. And in this case the solutions are complementary. You increase the regularization. You impose capacity constraints by reducing the by forcing the model to pick within the subclass. And you reduce the set of features so as to eliminate or reduce the capacity of the model to overfit. A different error mode happens when our optimization algorithm is not doing a good job of optimizing the objective that we've laid out for it. So that happens when the optimization might not be converging to a good or the global optimal and. People tend to associate this error mode in cases where the problem is multi-modal, so we're converging to a poor local optimum but in fact, it, this can happen depending on whether we carefully, design our optimizational algorithm and can happen even if the problem is a convex problem, it can for example when we don't set our learning rates appropriately and we never actually converge to, to an optimal. So a way to try and diagnose that is to compare models learned with different learning rates with different random initializations and basically see whether these models give rise to very different values of the objective function. And if they do it suggests that we should be very careful about how we optimize and maybe try multiple random starting points or set the learning rate more carefully. Now. This, previous. analysis called for comparison of the objective function. That is the objective that we're trying to optimize. A very different error mode happens when our objective function is not actually a good match for the performance method that we are trying to optimize. So for example, one important thing to check for, is cases where we have two models say that we learned in some week say as we discussed for the previous slide and in mod, the objective function for model one is significantly better than the objective function for model two that the performance objective that we care about say segmentation accuracy for model one is considerably lower from the performance for model two and this suggest the case where the, we are optimizing the objective is not. a good surrogate for optimizing the performance that we care about and that happens quite often when we are using a regularize likelihood as an objective where regularize likelihood is not necessarily a good match for the performance for the objective that we actually care about. And an important mistake that people often make is to say, well, let's try and change the optimization algorithm to give us better results in this case. That is a bad idea because it's. You can't fix a bug in the objective function by breaking the optimization algorithm. The right thing to do is to redesign the objective function to match better the desired performance criterion as, so as to get a better alignment between those two. So to summarize let's think about the typical learning loop that we use when trying to apply machine learning in practice. First we design the model template which tells us for example the set of random variables that are involved and whether the model is directed or undirected. And perhaps some notion of the kind of features that we might include. We then select the hyperparameters via cross validation on the training set. We train on the training set. With the chosen hyper parameters. And then we evaluate performance on a held out set. To see which of the error modes, if any, that we've have discussed on the previous slides, is actually occurring . This gives rise to a process of error analysis. What kinds of errors are we seeing? And how do we redesign the model or the objective function or the optimization algorithm, in order to address those problems? . That gives rise to a new model template and perhaps some other changes. And this cycle then repeats. When we're done and only when we're done we can now report results on a separate test set which is not the same. As the held out set because we've effectively used that held out set to set a lot of model parameters and reporting results on that held out set would be misleadingly optimistic in regards to the performance of the learned model on real, unseen data because this held out set is not actually unseen data at this point.