In the next set of videos, I'm going to talk a little bit about ANOVA, or analysis of variance. So let's dive in. So what is ANOVA or analysis of variance? It's a suite or a statistical methods to determine whether there are differences in means among two or more categorical groups or two treatments. In the previous set of videos with AB testing, we only looked at one group or maybe two groups and that was it. Here, we can look at multiple groups. Some typical applications I have listed is, does an ice-cream product have different sales in the four seasons. So the groups of seasons, so I have ice-cream sales in the spring, ice-cream sells on the summer, winter, and fall. Are you selling the same amount in each of those four seasons, or are they varying in difference amounts? It's based on a comparison of the variances not a comparison of the means, so that's how you got the name Analysis of Variance. So first, let's look at the structure of the data, and that helps us to understand what's going on. So here's a completely randomized design, each y is your data points so it could be sales. You have different treatments, in the ice-cream example is the four seasons. Then, if you look across, what are average sales for treatment one or season one, season 2, 3, 4? Here, the null hypothesis is that, the means for all the groups are the same. In the ice-cream example, the way to think about it is, did we sell the same amount of ice-cream in each of the four seasons? The alternative hypothesis is that, at least two of the groups are different. It doesn't mean that all the groups are different, it just means at least two of the groups are different. So you can have three groups, and A and B are the same, and B and C are different, so there you go. The other thing to consider is the notation. So let's zero in on that for a moment. Here is mu_i, and i is the group indicators, so group 1, 2, 3, whatever it is, and t is the total number of groups, so i ranges from one to t, and n_i is the number of subjects within each group. So you have groups, how many groups do you have, and how many elements are in each group? So those are the key elements you will want to keep in mind. So what are the assumptions of the ANOVA model? First is normality, that the samples of each of the categorical groups are taken from a population that's distributed normal. In the normal, we call this the bell-shaped curve. Independence. That means every sample is taken independently of the other samples, so the existence of one element does not affect the existence of another, they're just independent of each other. We also assume that the variances are the same across the groups. So that's a key component, and the dependent variable is continuous. So the target variable in the ice-cream example that I talked about is the sales of the cream, and that is divided by seasons. So here is the mathematical representation of the ANOVA model. Y_ij is the each of the subjects. So subject resides in some group i, and here is the jth observation in group i. So person one in group one, person two in group one, person three in group one, etc., and then you go to group two and you have person one in group two, person two in group two, etc. Mu is what's known as the grand mean. So if you take all your data points and just calculate the average, that's your grand mean right here, and that's your average of all the data points. Alpha i indicates the effect of group i. So each of these groups have a different treatment and Alpha i is the effect of that treatment, and then e_ij is the random error that occurs for the jth observation in the group i. We also assume that e_ij has a normal distribution of a mean of zero, and some common and standard constant variance or standard deviation, and the error terms are independent of each other. N_i indicates the total number of observations in group i. So this is the general form of the model. So the ANOVA model implies that, the jth subject or the jth response in group i is normally distributed with a mean of the grand mean mu plus Alpha which is the impact of that treatment in that group plus some constant variance for the error terms. So the way to think about it is that, you have the grand mean which is some number and then you are very away from the grand mean by some amount, some group of fact i and that's how you get to the group average. So to get to the group average, first you find the overall average, and then plus or minus something to get to the group average. So let's decompose this in terms of sum of squares, and hopefully, that will start to clarify things. The key thing to recall is that, j is the subject and i is the group. So let me write that down here so we have that handy as we go through these formulas. I is group. There we go, group i and j is the response. I is group, G-R-O-U-P, and j is equal to the number of the subject or the response. So total sum of squares, recall that these are just distances now. You're taking the sum of the distance from y_ij, so that's one data point, to the grand mean. So wherever you are on your graph, you're going to find the grand mean and that's the distance that you're interested in. This distance can be broken down in the following way. Let's look at this person here, y_ij, that's the same, y_ij. So that's a data point, and you see this notation y_i. That mean the group mean of the group i. So that's what this is. So first, this is the distance between the data point and the group mean and then here, you have the grand mean estimate here, y.. and then here's the group mean. So to go from y_ij to y.. you take the difference between the grand mean and the group mean, and the group mean and the individual data point. So data point to group mean to the grand mean. This n_i represents the fact that in a year, you only have group mean minus the grand mean, and you have to do that n times for each of the data points just to balance out the equations. When n_i is the same for all treatment groups, you can factor out the n, which is done here, and you're left with this equation. So to go from y_i, the individual data point, the grand mean, you go from a distance. That distance is equal to the distance from the grand mean to group mean and the group mean to the data point, which is this piece here. So in terms of the sum of squares, your total sum of squares, that's your distance from your data point to your grand mean is made up of two parts. The first part is the "between treatment sum of squares or SST," Sum of Squares treatment. That measures the variability due to the differences in the treatments, and that's your group, you can see that's the distance between your group mean or your treatment mean and your grand mean. So each group has their own group mean, and how far away is that from the grand mean, and then you multiply by n because there are n subjects that you have to account for. Then here your sum of squares error measures the variability that's not explained by the difference in the treatment mean. So the treatment explains the difference from the grand mean to the treatment, and these are the differences at the individual level. So here's the individual subject, y_ij, and the difference or the distance away from the group mean, and that's your sum of squares errors. So are the sum of squares significant? So the test statistic here is you get the mean square of the treatments between groups. So that's, you take your sum of squares treatment and you divide by basically the number of groups minus one, because you've calculated the group means, so that's your degrees of freedom there, and then your mean square error within group, that's your average of the sum of squared errors divided by the number of elements within each group minus k for the degrees of freedom. Under the null hypothesis for the test, you're assuming that all that means in the population are the same. So the treatment for A, B, C, D are all the same, and if two of them are different, then you know something happened with one of the treatments. The degrees of freedom it's an F distribution, there it is. The numerator has a degrees of freedom k minus 1 and the denominator has degrees of freedom N minus k. You can look this up, I will just provide you the statistics. The key thing to note is that it's not a symmetrical distribution like the normal distribution. It's a distribution that resides on the positive end of the line scale, say, ranges from zero to positive infinity, and it's not completely symmetrical, there's some weight to it on one side. But like a normal distribution interpretation, you want to find the five-percent areas. So the shaded value there is a p-value of 0.05. So if you find an F statistic, which is the ratio of the mean squares for the treatments and the mean square for the errors, then you find its statistics that's way out there, that means that more of the total variance is explained by the treatment, mean square treatment, than was by the mean squared error. So that's what makes this number big. If you want to think of it another way, and look at this ratio, if mean squared error was really big and mean squared treatment is small, then this total ratio becomes small and it goes down the number line and you'll be away from this p-value area or the reject region of the null hypothesis. So our statistic it's the mean square for the treatment, the mean squared for the errors, your degrees of freedom for your F statistic. If your P value is to the right of your statistic, is way out there on the right here, then you can reject, and the reject is that all the means of the groups are the same and Alpha is your significance level and generally we use a number like 0.05.