[MUSIC] Hi, my name is Anders Holm. I'm a professor at the Faculty of Social Sciences, at the Department of Sociology. In this lecture, you'll get an insight into what we more precisely mean by causal effects and how causal effects are measured. Often data just give you correlations; There's a positive correlation between substance abuse and criminal behaviour. There's a negative correlation between smoking longevity and so on. But does that mean, that it's use of substances that lead to criminal behavior, or is there a third factor? For instance, ability to postpone desirable outcomes that determines both substance abuse as well as criminal behavior. In this course we'll learn how to distinguish between statistical and causal relations using quantitative data, and also when it's not possible. Causal effects may warrant decisions on what to do. Statistical relationships can not guide decisions. Much social science research use very complicated theory to explain very complicated mechanisms. This is not the aim when we empirically try to establish causal relations between different variables. We ask very simple questions, and the answer is often very simple. Do class size influence math achievement? Do gender affect promotions? Do travel time affect journey frequency? Or do teacher education influence students performance? Do class size management skills affect student learning? Do corporal punishment cause depression? Will not ask what causes y, unemployment, learning, promotion, wages, et cetera, but rather does a particular x, training program for unemployed, class size, gender, have an effect on y? So we do not explain all the variation in y, only the bit that might be attributable to x. Furthermore, we are not satisfied, with multiple interpretations of research findings. We want to test whether one hypothesis is supported by evidence. This hypothesis may warrant further investigation, deeper analysis, etc., but we would like to apply research design that is in principle able to either accept or reject a given research hypothesis. More complicated questions concerning policy, social class, substance abuse and learning are more fluffy in the sense that they're more difficult to measure. Research questions on these topics could be, do poverty affect educational achievement? Does social origin cause social class destination? Do substance abuse cause criminal behavior? Innate intelligence influence learning? Why more complicated? Well, while class size and math achievement are relatively straightforward to measure. You count the number of students in the class, and you give a student a standardized math test. Things like social class, and criminal behavior do not have a unique and simple measure. Researchers even disagree as to whether there exists such a thing as innate ability. So while some questions seems very natural to ask because it seems part of an everyday discussion, more straightforward, and generally agreed measurements are available. So the choice of measurement is a discussion in itself. So quite often, precise research questions require obilization of more abstract terms into something that's actually measurable. Something that materialize in the real world, as for example, in the case of poverty, we can use low annual income as measurement. Social class, we can use an agreed class scheme, or substance abuse, alcohol consumption, soft drug consumption or hard drug consumption. Or learning, we can use grades and test scores as outcomes. As an example consider a relationship between social class origin, parent's class, and social class destination, and children's class destination. We use data from the UK cohort from the children born in the mid 1950's. The table shows the world frequency of ending up in one of the three social classes. The service class, the middle class and the working class. Each row on the table reflect parents with a different social class. The child's class origin. For instance, if the parents belong to the service class, almost 60% of the children end up in that same class. Whereas less than 9% ends up in the working class. Conversely, if the parents belonged to the working class, only a third ends up in the service class. Whereas 20% ends up in the working class. So while there's a far from deterministic relationship between class origin and destination, the risk of ending up in a particular class seems to depend crucially on class origin. The question, however, is whether this is a causal relationship. To answer the questions as to whether the relationship between class origin, and class destination, is causal. We have to think more carefully about what we mean by causal. There are many definitions, but researchers seem to narrow down on the concept of counterfactual or what if. The idea here is that if someone manipulates X, then Y changes. If I turn the switch and the light goes on. Then I tend to believe that turning the switch causes the light to go on. If I replicate this many times and the light goes on every time, I will tend to believe that the relationship is actually deterministic. If the light tends to go on more often when I turn the switch on, than when I turn it off, I would tend to think that the relations still might be causal, but that other factors that I do not control matters too. Now what if the light sometimes goes on if I turn the switch off? So in the example, we're interested in whether manipulating class origin changes class destination, or whether the relationship is just spurious, being caused by other stuff. The arrows in the figure indicate causal relationships. So, from the figure, we see that there's a casual relationship between class origin, and class destination. But there are also a casual relationship between class origin and parental mental ability, and the arrow indicates that parental mental ability causes parental class, class origin, and not the other way around. Parental mental ability also causes class destination. Parents with better mental ability get children with better mental ability and children with better mental ability reach higher class destinations. Now imagine that there is no direct causal relationship between class origin and class destination, but that parental mental ability causes both parental class, class origin, and the class destination. In this case, switching class origin on and off does not change class destination. But switching parental mental ability on and off causes changes in both class origin, and class destination. The crucial message is that both causal mechanisms can generate the table of the relationship between class origin and class destination, even though they have completely different interpretations. The previous exposition has somewhat mixed causal mechanisms and our ability to detect causal mechanisms. Causal mechanisms are there, whether we can infer them or not. But we use data and statistical methods to detect them. We therefore review some of the work horses and analyze it inreferring relationships between variables. The Linear Regression Model. We assume some basic familiarity between the statistics and the regression models but we reviewed the Linear Regression Model with one of several independent variables and emphasized the concept of statistical control in terms of regression analysis. In regression models the variables that we may think causes something is usually called the independent, the expendatory, or the exultance variable and denoted x, while the variable being caused by another variable is usually called the dependent or the indulgence variable, and is denoted y. In the figure, we see two types of relationships in the dependent and the independent variable. A linear relationship, and a non-linear relationship. While both very relevant, we'll confine ourselves to study linear relationships between x and y. This is only for x position. Non-linear position models tend to be much more complicated. The linear relationship is captured by a constant a that measures the level of Y when the value of X is zero and another constant usually referred to as a slope B that measures the change in Y for one unit change in X. If for instance X is class size and Y is test results. Then B measures how much test scores are affected by one extra student in the class. Here we do not expect that B is negative. More students, yield lower test scores. Let's now take some data to the regression model. Data is shown as dots in the figure. Each dot is a pair of numbers that designate the corresponding x and y values. So if x's income measured in US Dollars, and Y is is child, as an adult income, also measured in US Dollars. Then the dot in the figure is a corresponding income of a parent and a child. As can be seen from a figure, there's not a deterministic relationship between income of parent and child. At best, there's a trend. The larger the value of the parent income, the larger the value of the child income. So the slope b is positive. One reason the nondeterministic relationship between parent and child income could be that other factors than parent income is related to child income. For instance, parent parental education, parental wealth, parental mental ability, and so forth. Anyway, we define residual term as a difference between the observed child income and what would be implied if the child income followed the linear relationship between the parent and the child income. The lower error, the better the observed data point fits the simple linear model. In fact, A and B are chosen as to make the sum of errors as small as possible. Given that we want to minimize the error, how would we estimate B, the slope? The slope is important because it indicates how y varies with x. If the slope is zero, changes in y are not related to a change in x. It turns out there's a formula for the slope. The covariance between x and y, divided by the variance of x. And the slope is a measure of linear associations between x and y. But the measure's sensitive to both the scale of y and x. The slope is related to, but not identical to, the correlation between x and y. If you can calculate the slope from a sample of data. You might also be interested in the precision of this estimate that is given by the next equation. It says that the precision or standard error of the slope is proportional to the residual sums of squares, RSS, and inverse proportional to the number of observations “n”. This makes sense. The worse the model fits the data, the larger the resistor sums of stress, the RSS, and the smaller the standard error for slope, also, the more observations, the smaller standard error of the slope. Perhaps a little less intuitive, the larger the variance effects, the smaller the standard error of the slope. But this does, in fact, make sense. The more x varies, the better we're able to judge the slope. We can further assess the fit of the model by comparing residual sums of squares, RSS, to the total sums of squares, TSS. The difference is the variation in y explained by x. Hence, model fit for R squared is the percentage of the variation in y that is explained by x. In the next lecture we'll expand the regression model to allow for several independent variables and I'll show the logic of statistical control, i.e., what happens to the effect of one independent variable when we control for other variables. Thank you for your attention. I'm looking forward to see you next time. [MUSIC]