Hi and welcome to the lecture on generalized linear models. In this lecture we're going to change gears quite a bit and switch from linear additive response models to the much broader class of generalized linear models. These models take care of some instances that linear models might struggle with. But maybe at the expense of some of the nicety and the mathematical neatness that exists for linear models. Before I begin talking about generalized linear models, I'd like to review a little bit about linear models and maybe some of their benefits and bad tough spots. So to me, the linear model is the single most useful applied statistical technique. But again, they're not without their limitations. The assumption of additivity is one that is kind of hard to apply when you have, for example, binary data, it's weird to think of binary data as being additive, the error structure would have to be kind of weird in that case. In addition if your data is strictly positive, then having additive Gaussian errors presents a problem because that allows for positive mass on negative values. Now that may not be a problem and I often say that it's not a problem to have. Strictly positive data where the arrows look out seen and there's just very low probability down towards the negative values. That's often okay. But, on the other hand, if the normal distribution is putting a lot positive probability on negative values even though you know your response has to be positive then that's problematic for your model. On the other hand, you might try transformation. A common transformation when your outcome has to be strictly positive is to take the natural log of it. A natural log, in my mind, is perhaps the most interpretable transformation possible. Putting that one to the side. Lots of other transformations that are available to try to make our data more normal. Such as for binomial data, people often will take a so called arc sine square root transformation. That often destroys a large amount of interpretability of our model coefficients, which is a real problem. The log transformation I would say is in a class of its own, and in many ways can make the coefficients more interpretable. So we'll discuss that in it separately. There's another reason to perhaps approach generalized linear models rather than doing a transformation of the outcome or to do approximate things with a linear model is just nice and pleasant to have. A model on the scale in which the date it was recorded without having a transformation and that really honors the known assumptions about the data. So, if we have binary data, a model that really honors the fact that the data is binary and really doesn't require us to transform it. It has a lot of pleasantness to it and it makes a lot of internal sense. And then I mentioned here on this last point, the natural log transformation which is probably the most common transformation is applicable for negative or zero values. Now there's some fixes to that, but then that then harms some of the nice properties of that transformation, some of the really nice interpretable properties. So generalized linear models were from a 1972 paper by Nelder and Wedderburn, a kind of famous paper that a few are PhD statistician you've for certain read. A generalized linear model has three components. First of all, the distribution that describes the randomness has to come from a particular family of distributions called an exponential family. This is a large family of distributions that includes things like the normal and the binomial and the poisson. So that's the random component right, that's the exponential family is the random component. Then the systematic component is the so called linear predictor. That's the part that we're modeling. And we've done this very much so in linear models already. The random component was the errors, the systematic component was the linear. Component with the co variance, coefficients and then we need some way to connected to and the link function it connects some important meaning from the exponential family distribution to the linear predictors so there's three things we need. We need the distribution. Which is going to be an exponential family for generalizing your model. We need the systematic component which think of this as the linear predictor, think of this as the set of regression code or regression variables and coefficients and then we need a link function that links it to. Okay, so let's try an example. And we'll go to our familiar example which is linear models. The subject that we've been covering the entire class up to this point. So, in this case we are assuming that our Y is normal with the mean Mu and this happens to be an exponential family distribution and then we're going to define the linear predictor, this a to i, okay, to be the collection of co variant axis, times their coefficients, beta, okay? And in this case, our link function is just going to be the identity link function. It just says that the mu from the normal distribution is exactly this collection, this sum of co variants and coefficients. And so this just leads to the same model that we've been talking about along. We could just write it again as Y, Yi is equal to the Mu component which is summation xi beta plus the error component. So, we've simply written out the linear model in a separate way. We've talked about. The fact instead of saying, the error is normally distributed we say, the why is normally distributed, which is a consequence of our other specification. We specify a linear predictor kind of separately. Okay. And then we just connect the mean from the normal distribution to the linear predictor. And you might think this seems like a crazy thing to do when just righting it out as an additive sum with additive errors. Seems like such an easy thing to do but as we move over to different settings like the plus sign and the binomial setting it'll be quite a bit more apparent why we're doing this. So let's take in my mind perhaps the most useful variation of generalized models logistic regression. So in this setting we have data that are zero one so binary. And so it doesn't make a lot of sense to assume it come from a normal distribution. So the natural, the only distribution available to us for coin flips for zero one outcomes is a so called Bernoulli distribution so we're going to assume that our data, our outcome Ys, follow a Bernoulli distribution, where the probability of a head, or a so called expected value of the y, is mu y. Okay, so we're modeling our data as if it's a bunch of coin flips. Only the probability of a head may switch from flip to flip, and it's necessarily .5. Okay, so the probability is given by this parameter Mu sub i. The linear predictor is still the same. It's just the sum of the covariance times the coefficients. Now the link function in this case, the most famous and most common one, is the so called logistic link function, or the log odds. So in this case the way in which we're going to get from the mean from the probability of the head to the linear predictor is to take the log of the odds. So the odds are, the probability over one minus the probability, so in this case we have written it is mu over one minus mu. We're going to take the natural logarithm of it. So the logarithm, what that basically means, is that we're going to transform our mean, our probability of getting ahead. We're going to transform it, and then on that transform scale is where the co-variance and their coefficients, the kind of standard part of our linear model is going to appear. So notice, we're transforming the mean of the distribution. We're not transforming the Ys themselves, okay, right? That's a big distinction and that's the neat part of generalization in your models. What we're transform, we're assuming that the coin has a probability the head, if we're modeling our data, is coin. That coin has a probability of getting a head and that probability, if we transform it in a specific way, then, it relates to our co-variance and coefficients, okay? So we can go backwards from the log odds to get back to the mean itself, okay? And so the inverse logic, I guess I often call it the X pit, though I don't know how standard that is. Is e to the eta i over 1 plus e to the eta i, and that gets you back to mu, okay? So going forward, you take log of mu over 1 minus mu, that gives us eta. If we take e to the eta over 1 plus e to the eta, that gets us back to mu. And by the way, 1 minus mu, the probability of the tail, is 1 over e to the eta. So we could write out our likelihood as the binomial likelihood right there like this. And I think you can see then, it's through this likelihood, like we have talked about in our stat inference class, it's through that likelihood that we're going to optimize, we're going to maximize that likelihood to obtain our parameter estimates. [NOISE] Okay let's go through another example Poisson regression or I like to say Poisson regression. I know I'm not pronouncing it correctly but I like to say it that way. So assume that Y is Poisson mu i where the mu i is the expected value of each of the Poisson random variables. In this case the mu has to be larger than zero. So Poisson is extremely useful for modelling count data, or that's really what it is for, for modelling counts. So if you have a bunch of positive counts that are unbounded, right? So not like binomial counts where they're bounded by the number of coin flips we take, Poisson counts are unbounded, and so it's a very useful model. Let's suppose you want to count the number of people that show up at a bus stop, you don't have an upper limit on that, or sure, there is some upper limit, but you don't really know what it is, so you might want to model that as Poisson. Our linear predictor is again, the same as it was in every case, it's just the sum of the co-variance times their coefficients. The link function in this case is the log link. The most common link function for the Poisson case is the log, the log link. And remember, we go from the mean, mu, to the linear predictor, eta, by taking the log of the mean. Okay so again we're not logging the data, we're logging the mean from the distribution that the data is assumed to come from. And then remember the inverse of the natural logarithm is e to that thing. So we can go backwards from eta back to mew by taking e to the eta. So, by doing that we can just simply write out what our likelihood is, and again the way GLMs work, is they obtain the parameter estimates by maximizing the likelihood. So, I give some technical facts here and basically we're just saying that the likelihood simplifies quite a bit in all these cases because of the particular link function that we've shown. But I want to point out this final point here which says that, the maximum likelihood looks like sort of an equation that we would want to solve, not unlike least squares. So think way back to our initial lectures on least squares. We found our estimates by minimizing the sum of the squared vertical distances between the fitted line and the outcome. Well if you wanted to opt to minimize that function in a sort of automated way, you might take a derivative, so then that function will no longer be squared, the two would come down. And so you get to try to find the root of that derivative, you just get a linear equation. Well this generalization in generalizing your models by solving the likelihood, trying to maximize the likelihood you get a very similar equation that you want to set equal to zero and solve that gives you your estimate, and I give it right here. And it's basically not very similar to the linear model case only there's a set of weights and a variance in the denominator, that doesn't go away like it does in the least squares case. So again this is not for this class, it's just if you're interested in some of the details of the fitting. Basically the point of this slide is to say that it's very similar to what's going on in least squares, just how we get to that point is a little bit more circuitous. For most people, in most settings, this is all going to be very transparent to you. You're going to mostly concern yourself with the interpretation of your generalizing of your model, you're not going to concern yourself too much with how the specifics of how it was fit. There is an important point I'd like to note about generalized linear models especially for the some of the cases like the Poisson case and the binomial case. For the linear model, remember we have this assumption of a constant variance. However, for the Bernoulli case, the variance of a coin flip is p*(1-p), and in the notation we're given here it's mu(1- mu). But remember, our mu depends on i, so what we're saying is, the variance actually depends on which observation you're looking at. Unlike the linear model case where the variance is constant across I. Same thing in the Poisson case. Variance of a Poisson is it's mean, so in this case, the Poisson has variance that differs by I. This is a modeling assumption that you can check, right? So if you have Poisson data, you can, let's say you have several Poisson observations at the same level of co-variance so the mean should be the same. Then the variance of those should be roughly equal to the mean. If your data doesn't exhibit that, then that's a problem. So this is a highly consistent, an important, practical consideration in generalized linear models, is that the modeling assumptions often put a restriction on a relationship between the mean and the variance, and that relationship may not hold in your specific data set, so. What can you do? Well there was a way to address this by having a more flexible variance model even though you lose some of the assumptions of generalized linear models. And these are all standard options and are and so all you look at our so called quasi blank options in the family, in the distribution. So we're going to go through lots of examples, but the point is is that's, so if you see an R that you see that you can fit a model that's Poisson, with its GLM function, but then you see there's another option called quasi-Poisson. The same thing with binomial. You can see an option where you'd fit binomial, but then you see another option where it's quasi-binomial. What that's referring to is a slightly more flexible variance model in case your data doesn't adhere to the GLM variant structure. Okay? And so these are all standard options in every statistical language that fits GLMs, also automatically will fit these quasi versions of GLMs with a more flexible variant structure, and I give, here I actually give the equation that it's trying to solve in order to get the fancier more flexible variant structure. So, just some odds and ends about the fitting before we go through the specific cases, we're going to do separately the Poisson case and the Binomial case, we're not going to go through a full treatment of GLMs just Poisson and Binomiali. But these equations have to be solved iteratively, so unlike the linear model where you can just do strict linear algebra to find solutions, GLMs actually have to be optimized which means sometimes the program fails. For example if you have a lot of zeros in a binary regression or zeros and ones, these things can happen. But other than that, then I think most of the analysis should be pretty familiar to us. If we want to get our predicted response, we're just going to take our coefficients, our estimated coefficients beta hat, multiply them times our regressors, and that we'll give us our predictive response. Now notice this is going to be on the logent scale if you're doing, for example, logistic regression, or the log scale if you're doing Poisson regression, so you'll have to convert it back to the natural scale if you want it to be on the same scale as the original data. So if you're modeling coin flips and you get your regression coefficients and out and you come up with predicted response those will be on the logent scale and if you want them to be back on the scale of the coin flip the zero or one between zero or one you're going to need to take an inversion logent. And again we're going to go through a lot of examples I just want to outline these facts before we do the examples. The coefficients are interpreted very similarly to the way that our coefficients were interpreted in linear aggression. They are the expected change in the, the change in the expected response per unit change in the regressors, holding the other regressors constant, the only difference is now, this interpretation is done on the scale of the linear predictor. So the binomial cases on the scale of logent, on the Poisson case it's on the scale of the log mean, and so on. So it's a slightly more complicated interpretation, but again, we are gaining the benefit of modeling our dot data naturally on its own scale, and we haven't had to transform the outcome at all. So the inference, we also lose the nice collection of closed form, normal inferences that you get. We don't get t-distributions any more. But largely, this is transparent. There's a body of mathematics where statisticians and mathematicians have figured out what the right distributions are to compare all the coefficients. And from the output of your GLM too in order to get things like p values and all of those are going to be tested or hypothesis tested the coefficients are just going to be tested and interpreted in the same way as in our linear regression just the background that's going on is a little bit harder. One thing I would say though is all of these results are based on asymptotics which means that they require larger sample sizes. So if you have a GLM setting with a very small sample size, then it's possible that things like your p values aren't really applicable. And so many of the ideas can be brought over from GLMs, so this was just a whirlwind overview of GLMs Now for the next two lectures let's just dig in to the two most important cases of binomial and Bernoulli regression via logistic regression and Poisson regression. So we're going to do spend a lot of time with those and then if you want further material on GLMs there's some more advanced classes that you can take. Okay, well thank you for attending this lecture and I look forward to seeing you in the next one where we're going to cover logistic regression for binary outcomes.