Some years ago, a study found that Popes lived longer than artists. The results were taken us plausible. Indicative of lifestyle differences. Popes had a peaceful life with access to good food and care. While artists were characterized by having social instability and high-risk behaviors. Although some artists try really hard to fit in this stereotype, this is still quite prejudiced, right? Probably you also have your own thoughts on this. But before speculating more, let me tell you something. You can only become a pope if you are old enough, and an artist can be of any age. Design and data analysis were not well aligned in this study. So the reported result was just wrong. In this lecture, we will first show that data itself does not tell enough and that you need background information about the study design when performing and interpreting any kind of data analysis. Second, we will discuss two courses of misleading data analysis. Confounding and selection bias. We will see that statistics can help you with these problems but not always. Suppose we have this dataset, where we want to know whether treatment A is greater than treatment B regarding mortality. Here you could simply perform logistic regression with mortality as an outcome and treatment as an explanatory variable. The result, an odds ratio of 0.9 with a 95 percent confidence interval not containing the value one, indicating that the risk of dying using treatment A is lower. Which means that treatment A seems better than treatment B. Here is another database which has exactly the same variables. However, when you analyze this data, you would an effect with the opposite direction. Treatment A, now seems harmful. So how's this possible? You wouldn't know by just looking at the data. But if I tell you that the left dataset is from an observational study and the right from a randomized control trial, you would likely to understand what is happening. In normal clinical practice, treatment B is given to patients with severe conditions. This is confounding. Only in a randomized control trial can we estimate the true effect of treatment A with a simple univariate data analysis. There are multiple ways in which a true effect is masked. We say that the observed effect or estimate is then biased. In practice, bias can mean that regression coefficients are too low or too high or even indicator associations in the wrong direction. You can only identify bias when you know the background of the collected data. So how can we prevent these biases? Fortunately, statistics can help you sometimes. Recall our previous example about comparing two treatments using data from an observational study. As we discussed looking at the crude association would lead to a biased estimate of the treatment effect. This is because severity of the disease is a confounder, and the need to take this into account. You can solve this bias at least to some extent using regression. If you do not only use treatment as an explanatory variable, but you also add severity to your model, you will see that the effect of the treatment changes and that the bias has likely been reduced. But keep in mind that you should always ask yourself, is this correction enough? For example, we can use disease staging to measure severity, but this may not completely reflect the confounding mechanisms and other unmeasured factors such as comorbidities can influence both the treatment and the mortality. That's why causality is very difficult or actually impossible to claim in observational studies. Now, have a look at the following example. Obstetricians in a hospital in Alaska found that there were a lot more complications in deliveries during the winter. It was speculated that lack of sunlight could influence fatal conditions. Change something in the biology and hence deliveries were more complicated. Wait, can you point out a problem in the study? The answer is simple. Not all babies are born in the hospital. Winters in Alaska, harsh snow everywhere, therefore, many deliveries happen at home especially in winter. People only come to the hospital if it's really necessary. In winter, only the most difficult deliveries were at the hospitals. So again, the results were biased. Why? Because the sample was not representative. In the previous lectures of this course, we discussed that statistics are useful but only when you have a representative sample. Representative samples are random subsets of the population of interest, well-defined by your research question. With them, if we increase the sample size we can be confident that our sample summary will become closer and closer to our target parameter. However, if the sample is not a random subset of the population of interest, it does not matter how big it is. You will get biased estimates. The winter affecting delivery complications or the long live popes are examples of results-based on non-representative samples. We call this selection bias. Sophisticated statistical methods can still help you in some cases of selection bias. But the number of assumptions increase, so a careful and thoughtful discussion is then needed, and sometimes it's just not possible to correct the bias of your study. In this lecture, we have discussed why data alone does not tell the whole story. We would like to stress that knowing the background of a study is of crucial importance to make any sense of the data and to avoid systematic errors also called bias. Some forms of bias can be corrected using statistics but not all. Remember that the presence of unmeasured confounding can never be dismissed in observational studies and that if selection bias is present, there is often not much to. Don't worry if all this was overwhelming. In the next reading activity, you will learn a structured way of thinking about bias, and you will practice with identifying it in the practice quiz.