[MUSIC] In the last lecture, I introduced the simple integration model. This model captures the linear relationship between an independent variable x and a dependent variable y. In addition. I explained how this model was fitted to data. Now we expand the simple linear model into the multivariable linear regression model. For illustrative purposes I only show a model with two independent or explanatory variables, but there's in principle no limit to how many independent variables that one can include into the model. The figure shows the hypothetical relationship between labor market earnings, as a dependent variable. And grade point average from high-school GPA, and years of education as the two independent variables. The idea behind the figure is that there's a positive linear relationship between the person's grade point average, and earnings. The higher grade point average, the higher the earnings. There's also supposed to be a positive relationship between years of education and earnings. The more years of education, the higher the earnings. A natural consequence of this is that when comparing four different persons, two with the same high GPA but varying years of education, and two persons with same years of education, but varying GPA. The persons with both high GPA and many years of education, earned more than the two with only high GPA or years of education respectively. Finally the person who has both the lowest GPA and least years of education also earns the least. Obviously, when we take the modular data to estimate the parameters of the model the regression coefficients only a few, if any, observations rely on the hyper plane, spanned by the three axis, therefore, the regression model also has an error term that captures the effect of other variables over and above the effect of GPA and years of education on earnings. But when you've estimated the model with real data and presumably it established positive values of b1 and b2. The model implies that on average people with higher GPA and years of education, also have higher earnings compared to people with lower GPA and, or years of education. When there are two independent variables in the model, interpretation becomes more intrigue as we saw before high values of both Z and X, implies higher values of Y, than high values of either just X or Z. However, what if X and Z are interrelated? We can easily imagine that people with high GPA also get more years of education. So what's the total effect of GPA and years of education? Both the effect that goes through getting more years of education with a high GPA, and the effect over and above use of education on earnings. First we have this simple model of GPA on earnings. The effect here is b. Then you have the multi-variable model, that includes the effect from both GPA and use of education on earnings. Here the effect of GPA beta is over and above the effect of the years of education. Further, GPA and years of education are related through the regression coefficient theta, implying that the covariance between GPA and years of education is not zero. Using the way x and z are related, and inserting this for Z in the multivariate regression equation and rearranging, we obtain again, a simple linear relationship between x and y. But now we see that the simple effect of x and y is comprised of the direct effect beta, and the indirect effect via Z through the parameters gamma and theta. In order to further understand the mechanism behind the multi-variable regression model consider the diagram, the area from X to Y indicates a causal mechanism from X to Y. The effect is conveyed through the ratio coefficient B. The error term denotes other variables influencing Y. The arrow indicates causality and imply that when you manipulate x, y changes accordingly. Conversely, when y changes, x does not change. In this sense, we say x causes y. This does not mean that x is the sole cause of y, but that the direction of change goes from x to y. When you get a high GPA, you also end up high, with higher earnings. However, higher earnings does not lead to higher GPA. Now even though X crosses Y, the mechanisms to which this happens may be, be through several or one of, of the variables. The figure indicates this. In this figure we have introduced a third variable z. The variable X affects Y both directly and indirectly through Z. The direct effect is b and the indirect affect is beta. However, in order for the indirect effect to reach Y, it has to travel through Z. So first X affect, affects Z. Theta and then Z affects Y, gamma. Therefore when you get higher GPA two thing happens. You get more years of education Z, and you get higher earnings. However, in addition to the direct effect on your earnings for higher GPA, beta, you also get even higher earnings. Because the years of education increases theta, which, again, increases your earnings, gamma. So when we manipulate x, this causes a direct change it z and y, and indirect change in y through z. The duality between the figure and the mathematical notation of the regress model is shown in the figure by the two equations. Variables that mediate causal effects, are called mediators. So Z is a mediator in the figure, therefore the total causal effect of X and Y is B. The direct effect controlling for or holding the Z constant is beta. In addition, the indirect effect is theta multiplied by gamma. However, the total effect is the indirect effect plus the direct effect. The direct effect of GPA and earnings, are the effect of having different GPA for two individuals with the same years of education. Therefore, if there's a direct effect of GPA on earnings this implies that for two university graduates, the one with the highest GPA also has the highest earnings. Therefore, if we're interested in the total effect of X and Y, we do not want to control for Z. If we're interested in how GPA from high school, in total, affects earnings later in life, i.e. how different people's lives are affected due to different GPA's, we do not want to control for intermediate variables. Therefore, we do not want to control for any intervening Z. If we do, we only get to know a direct effect – not a total effect. However, we might want to control for other variables because observed, or estimated, total effect might be completely or partly spurious. We all know famous examples of spurious relations. Both geographically and temporally there's a very strong correlation between sales of ice cream and the incidence of snake bites. But ice cream does not cause snake bites, nor do snake bites cause consumption of ice cream. In our example with GPA, education, and earnings, we can think of whether GPA really causes persons to get higher earnings? Or whether there's really due to some other variable like innet IQ? People with a high innet IQ presumably have higher GPA's, and probably also higher earnings. To study this we'll look at a different version of the previous diagram. Now the arrow between x and z has changed direction. The course now runs from z to x; x, IQ, causes GPA, theta, and GPA causes earnings, b, and IQ also causes gamma, or causes earning, gamma. However, the important thing to note here is that if IQ is unobserved, and hence, not included into the model, we may erroneously believe that the total effect of GPA is the causal effect of GPA. This is not true, only the direct effect of GPA is a causal effect of GPA. It may become clearer if you look at the equations. The system of equations are the same as before and when we write z as a function of x, even though the causal relation runs the other way around from z to x, however substituting for z in terms of x in the top equation and collecting terms show us how spurious relations may cause wrong causal inference. Even if beta is zero, there's a relationship between x, GPA, and y, earnings, when we're not controlling for IQ, z. This is because IQ both has an effect on GPA theta, and an effect on earnings, gamma. Therefore, if we do not include IQ in the model, we may erroneously conclude that better GPA generates higher earnings because theta and gamma are non-zero, but in reality it does not because beta is zero. When the causal effect runs from X, Z toward X, we call Z a confounder. In sum, when Z is a confounder, we would like to control for the effect of Z to infer the causal effect of x. However, when Z is not available in our data, we cannot do this, but at least we now know what happens if we nevertheless estimate the relationship between y and x, ignoring the effect of z. We may find a significant effect of x and y, beta, plus theta, multiplied by gamma, even though there's no effect, because beta is zero. A different way to conceive this is to look at our equations again. If Z is unobserved, the effect of Z becomes part of the error term, in the relationship between x and y. In addition, from the second equation, we can see that Z and X are correlated. So when X and Z are correlated, the effect of X and Y also involves the effect of the unobserved confounder. So in obtaining the causal effect of X and Y, it's essential that there are no unobserved confounders. Offering differently that X, and the error term is uncorrelated. When we estimated the slope of the regression model, we saw in the previous lecture that we did so by dividing the covariance between X and Y with the variance of X. This slide proves that if the true model is, indeed, linear, and the error term is uncorrelated with X, no confounders. Then the slope is easily equal to the causal effect of x on y. This is a really, really neat result. In the absence of confounders, we know how to estimate the causal effect of x on y. How to deal with a very realistic case of confounders are the theme of the following lectures. However, for now, we just celebrate the new understanding of the relationship between linear regression and causal effect. We also know what happens when, when there are confounders and the error term x are correlated and then linear regression coefficients measure both the causal effect of x, which may be 0, and the effect of the confounders. This is shown in the last equation on this slide. The slope of the regression model is equal to the causal effect of x b, and the covariance between x and the error term, which include the confounder z. Divided by the variants of x, and we know the expression for this last term, it's a product of theta and gamma from the previous slides. In the next lecture, we'll learn about researcher science at a lower costal inference in the case of unobserved confounders. Thank you for your attention, I'm looking forward to seeing you next time [MUSIC]