Hello everyone. Welcome to this week's foundation of spore Analytics MOOCs. In the previous weeks, we have learned to conduct descriptive and summary analysis. We have also introduced Data Visualization in Jupiter notebook. We also have learned to use correlation analysis to detect relationship between variables. Correlation analysis, in particular, correlation coefficient measures the extent to which the variables cluster around the line. We've correlation coefficient we can only determine if two variables are moving in the same or opposite direction. Regression analysis, on the other hand, enable us to identify relationships among variables beyond pure linear relationships. Regressions are the work horse of data analysis, not only in sports. More importantly, if specified correctly, regression analysis can allow us to identify causal relationships. In regression analysis, we have a dependent variable or an outcome variable, y and a set of independent variable x. We're trying to find how the independent variables x affect a dependent variable y. The most fundamental regression analysis is a linear regression. In any regressions, we try to determine the best fitting line for our data and use it to understand the relationship among variables under consideration. If two variables, x and y exhibit some linear relationship, we will be able to express them in the following equation, y equals to a plus bx. In this equation, y is the dependent variable or the outcome variable, and x is the independent variable, and both a and b are constants that we want to estimate. Notice that x and y will only fit in a line if they're perfectly linearly related. In other words, the correlation coefficient between these two variables should be equal to either 1 or 0. But most of the time we don't observe such perfect linear relationship, will introduce an error term epsilon in our specified linear function between x and y. In this case, y equals to a plus bx plus an epsilon. This error term epsilon does not necessarily mean that there's a mistake in our regression. Instead, it refers to the difference between the actual data point in our estimated regression line. There is an important criterion about this error term epsilon, that nature be satisfied, in order to obtain the best fitting line. We require that the average value of y equals to a plus bx. In other words, we require that the average of the error term to be 0. This means that the error term should be random or the data point should be randomly distributed around the linear regression line. The most common linear regression is the ordinary least squares regression. The goal of all regressions is to find the best fitting line for our data. In ordinary least squares regression, we try to find this line by minimizing the sum of squares of the error term. To do so, we first take the difference between the actual data point and the value of y based on the regression line. We would then take the square of this error term and find the sum of the squares of the error term. Finally, we'll find the parameters a and b that minimize this sum of square errors. Now let's turn to our Jupiter notebook and I will demonstrate to you how to run regressions in Python. Please open the Nobel Introduction to regression analysis. As a first stab, we import some useful libraries, including Pandas as pd, numpy as np, and matplotlib.pyplot as plt. We'll use the NHL teams data set that we compiled in the assignment for Week 2 to demonstrate regression analysis in this lecture. Let's import the two data set that we clean in this assignment. The NHL_Team_Stats dataset and the NHL_Team_R_Stats dataset, which is a dataset that only contain information for the regular season. At the end of the assignment in week 2, we observed that there's a linear relationship between the total number of goals for and winning percentage for the NHL teams. So we want to use regression analysis to examine this relationship further. To run regressions in Python, we need to introduce a library statsmodels. This is a library that provides a functions for estimating many different statistical models as well was conducting statistical tests. It can also do some statistical data exploration. We will import statsmodels.formula.api as sm. This will allow us to shorten our code later. Let's run a regression of winning percentage as a function of the total number of goals for during the regular season. That is in this regression, winning percentage is the dependent variable, and the total number of goals for is the explanatory variable. We can use the function "ols" in the statsmodels library to indicate that we want to run an ordinary least squared regression. Inside the "ols" function, we will specify a formula using a quotation mark. We write our dependent variable first. In this case, it's 'win_pct', and we would then write our dependent variable, 'goals_ for'. We use a tutor to separate the dependent variable and independent variable. We also need to specify the dataset, NHL_Team_R_Stats, because we're only looking at the games during the regular season. We will also like to add a function fit at the end, which will allow us to obtain the estimated coefficient of our regression models. That's called this regression, reg1. After we run a regression, we can use the summary function to obtain a number of statistics of our regression. We can further use the print function, so that our statistics will be presented in a nicely structured table. From this result table, we can see that the dependent variable is the winning percentage, and there are 181 observations used in this regression. The independent variable which is presented here is goals_for, and an intercept is also included in the regression. So now let's talk about how to interpret our linear regression results. In a simple linear regression where Y equals to a plus bX, the constant a is a vertical intercept, which estimates the value of the dependent variable Y, when the independent variable X is zero. The constant b estimate the slope of the linear line. We usually call b the estimated coefficient of the independent variable X. This means that b measures the impact of X on Y. In other words, when X increases by one, Y is estimated to change by an amount of b. In our first regression, the estimated coefficient on goals_for it's 0.003. This means that an additional goals score by the team will increase the team's winning percentage by 0.003 or 0.3 percent. The estimate on the intercept is negative 0.1781. This means that without scoring any goal, the winning percentage for a team will be negative 0.1781 or negative 17.81 percent. As we know the winning percentage cannot be negative. Part of the reason we obtain a negative estimate on the intercept, it's because in our sample there's not a single game, where a team score zero goal. After we run a regression, we would like to learn how good our specify regression function fits our data. There were two measures that are particular importance that we'll introduce.