Okay. Welcome back, troops. We're going to start talking about student or as I like to say Gosset's t distribution. So, the reason I call it Gosset's t distribution is, it's usually called student's t distribution because Gosset published under the pseudonym's Student in 1908. So it was actually Gosset's distribution. He laid the found work for the actual distribution, and then I believe that Fisher actually proved the mathematics, some of them, the finer scale mathematics. But I wanted to talk a minute about Gosset, because he's a pretty interesting character in the analogs of Statistics. So Gosset was a researcher and he worked at the Guinness Brewery in Ireland. And when he created the t distribution, he was actually working for Guinness. And at the time, he had actually several really brilliant researchers working for him and he wouldn't let them publish under their real names. So that's how Gosset wound up publishing under a pseudonym and make this sort of landmark discovery as a researcher. It's interesting, so the reason he came up with this distribution is because for him, the central limit thereom just was simply not rich enough to describe the problems he was looking at. So he was working with these small batches in the science of brew making and it wasn't adequate to assume that things were heading to infinity. So he came up with this distribution and we're all the more fortunate for it. One thing I really like about Gosset is whenever you read about him, he was apparently a tremendous nice guy, extremely humble, and he made several major discoveries in Statistics. He came up one of the first uses of Poisson distribution. And then he also, in the Guinness company, he rose up pretty high. He was a head brewmaster at its, I think, its London Brewery by the time he had retired. So anyway, he's a really interesting character. And if you get a chance, you should read about him. Any rate, so he came up with this wonderful distribution called, as far as I would like it to be called, Gosset's t distribution. And the t distribution is really kind of, used when you have smaller sample sizes. It assumes your data is Gaussian but it tends to work even if your data is non-Gaussian. And so the t distribution, it has degrees of freedom. It's indexed by something called degrees of freedom and the t distribution which looks like a normal distribution where someone kind of squashed it down at its tip and all the extra mass went out into its tails. Well, it looks more and more like a standard normal as the degrees of freedom get larger and larger. So, how do you get a t distribution? Say, you wanted to simulate it on a computer, you would take a standard normal, say, Z here. And you would divide it by an independent Chi-squared divided by its degrees of freedom. So where Z and Chi-squared here are independent standard normal and Chi-squared random variables , that's how you wind with a t distribution. So how is this useful? On the next slide, we'll look at how we apply this. Well let's suppose that X1 to Xn are iid normal mu sigma squared, then x bar minus mu divided by sigma over square root n is, of course, standard normal, right? Because a linear combinations of normal random variables are themselves normal. So in this case, X bar is normal. And because they're iid, we know exactly what the standard deviation of x bar is. It's sigma over square root n and we know that its mean is mu. And so when we shift and scale our non-standard normal by mu and divide it by its standard deviation sigma over square root m, we get a standard normal. Hopefully, this should not be news to you at this point in the class. And then, we also know from earlier on in today's lecture, that n minus 1s squared over sigma squared is Chi-squared with n minus 1 degrees of freedom. So, if we take n minus 1 S squared over sigma squared, and divide it by an additional n minus 1 and square root the whole thing, we get S over sigma and we've taken a Chi-squared and divided it by its degrees of freedom. So S over sigma is the square root of a Chi-squared divided by its degrees of freedom. Therefore, if we take X bar minus mu divided by sigma over square root n, and then divide the whole thing by S over sigma which if we do the arithmetic works out to be X bar minus mu divided by S over square root n. We wind up with a standard normal divided by a square root of a Chi-square divided by its degrees of freedom. Now there's one small thing that we're kind of fudging over. We haven't shown that the X bar and s are independent, right? They're from the same data, so it doesn't seem obvious that they're independent. They are, it's just not immediately clear and let's sweep that under the road. So, forget about that for the time being, take my word for it. x bar and S are independent so this exactly has Gosset's t distribution with n minus 1 degrees of freedom and notice what a basically accomplish is. So, we saw previously in constructing confidence intervals that X bar minus mu divided by sigma over n, that that's, you know, a nice kind of pivotal statistic to work with. It's useful for generating confidence intervals, we'll see that it's useful for doing hypothesis tests. And all we've done is replaced sigma by S. And it's basically saying that we can take the unknown population variants and replace it with the known sample variance. And we get a statistic whose distribution we know, okay? And by the way, this statistic X bar minus mu S over square root n, it also limits to a standard normal as n goes to infinity. Which, you know, of course, the Gosset's t distribution is the degrees of freedom goes to infinity. If you look at it, if you plot it it looks more and more like a normal distribution as n goes to infinity. So we haven't violated the central limit theorem or anything like that in the process of doing this stuff. So, let's actually use this distribution to create a confidence interval. It's a statistic who under the assumption of normality of the underlying data, does not depend on the parameter mu that we're interested in. And therefore we can use it to create a confidence interval for mu. So let's let tdf alpha be the alphath quantile of the t distribution. So, t n minus 1, 1 minus alpha over 2 is, say, the upper quantile from the relevant t distribution and tn minus 1 alpha over two is the lower quantile from the relevant t distribution. And so this probability statement here, 1 minus alpha is equal to the probability that this statistic lies between those two conference intervals is then, of course, true, right? So the probability that this t random variable lies between the alpha over 2 lower quantile and the 1 minus alpha over two upper quantile is exactly 1 minus alpha. Oh, and I should not here, by the way, because the t distribution is symmetric, the alpha over 2 lower quantile is equal to the negative of the 1 minus alpha over 2 upper quantile. This is because the t distribution is symmetric about zero. So that's why here instead of writing alpha over 2, I wrote -tn minus 1, 1 minus alpha over 2. And you'll see why I do that in a second. So anyway, this probability statement applies here, so we can just rearrange terms and keep track of flipping our inequalities around when we multiply by a negative sign. And we get that X bar minus a t quantile times a standard error is less than mu and X bar plus a t quantile times a standard error is bigger than a mu, that random interval contains mu with probability 1 minus alpha. But if you look at the form of this interval when I wrote it out this way, that happens to be X bar plus and minus the upper quantile from the t distribution times the standard error. And that's why I took only the upper quantile, that way, we can write it as plus minus. Okay, so that's how we wind up with these intervals. Estimate plus or minus quantile times standard error. And that's where it comes from. This interval assumes that the data are iid normal, though it's very robust to this assumption. You know, whenever the data is kind of roughly symmetric and mound shaped, the t confidence interval works amazingly well. And if you want, you know, if you have paired observations, people before and after a treatment, for example, you can subtract them and then create a t confidence interval on the difference. So often, paired observations are analyzed using this exact confidence interval technique by taking differences. And often, differences tend to be much more Gaussian-looking, they tend to be nice and symmetric about zero. And then for large degrees of freedom, the t quantiles become the same as the standard normal quantiles. And so this interval just converges to the same interval that you get as the CLT. Some more notes. For skewed distributions, the kind of spirit of the t interval assumptions are violated. You could probably show that it still works kind of okay. And the reason is because those quantiles, the t n minus 1, 1 minus alpha over 2, the quantiles are so far out there, you know, the t distribution is a very heavy tail distribution that shoves those quantiles way out there that makes the interval a lot wider. And then it tends to work kind of conservatively in a broad variety of settings. But for skewed distributions, you're kind of violating the, you know, the spirit of the t interval. And you're often better off, you know, trying some things like taking a natural log of your data, if it's positive, to get it to be more Gaussian-looking before you do a t confidence interval. And we'll spend an entire lecture on the consequences of logging data, so you can wait for that. But, you know, I would just say, for skewed distribution, it kind of violates the intent of the t interval, so maybe think of things, like doing logs to consider it. And also, I'd say for skewed distributions, maybe it doesn't make as much sense to center to the interval around its mean, in the way that we're doing with this t interval. We're centering it right around the mean. And then, the other thing, for discrete data, like binary data, you know, again, you could probably, I bet you could do simulation studies and show that the t interval, you know, actually probably works okay for discrete data like binary data. But, you know, we have lot of techniques for binary data that make direct use of the binary data. And you better off for those using, for example, things based on Chi-squares or exact binomial intervals and that sort of thing. Because, you know, you're so far from kind of the spirit and intent of the t interval that is not worth using regardless, the t interval is an incredibly handy tool. And I'm sure actually, in some of these cases, it probably works fine but you're so far from the kind of assumptions at that point. And you better off using all these other techniques that have been developed for these other cases. And that's enough discussion about the t confidence interval, let's go through an example. So maybe take a break, go have a Guinness and well be back in a second. Okay, so welcome back. So we're going to talk about Guinness' original data, which involve sleep data. So try not to fall asleep while we're talking about it. So, Gosset's original data appeared in this journal called Biometrika, with a k. And Biometrika, interestingly enough, was founded by a person called Francis Galton. So Gosset was an interesting character. If you really want to read up on another, you know, absolutely brilliant, interesting character, read up on Gosset. He was Charles Darwin's cousin. He invented the term and the concept regression. He, you know, invented the term and the concept correlation. And he invented lots of other things, some good, some bad. And he, he was just generally rather interesting character. So any rate, Biometrika was founded by Francis Galton and that is where Gosset's original paper appeared and that's where the sleep data occurred. So at any rate, the sleep data shows the increase in hours slept for ten patients, on two sleeping drugs. So, R treats the data as two groups rather than paired, and I have to admit, I haven't taken the time to go through and figure out exactly why, there's a discrepancy between when you read Gosset's Biometrika paper, which treats the data as paired, and R treats it as two groups. And, anyway, I haven't gone through the details so I'm going to treat it exactly like Gosset's data. So here is what it looks like as Gosset's data. So we have patient one, two, up to ten. We have the two drugs and the difference. And here's the code I used to get it that way. So, the mean would be the mean of the difference and here, I just put, in a comment, the value that it comes out to be. It turns out to be 1.67. And this is, remember, everything is normalized relative to ten hours. And then the standard deviation, was the standard deviation of the collection of differences, is 1.13. And we had 10 subjects, so our confidence interval is the mean plus or minus the t quantile at 0.975, if we want a 95% confidence interval, because remember we put 2.5% in either tail, times the standard error, which is s divided by square root n. And this will give you our confidence interval manually. But if you want to go the easier way, R actually has a function, of course, to do the t confidence intervals because it's one of the most popular statistical procedures. So if you t.test and here, difference is the name of the vector that contains the differences. And then this dollar sign grabs the relevant output. So in this case, I want the confidence interval so its $conf.int. If you omitted the dollar sign when you hit Return, it would give you lots of information including the confidence interval. Here it just returns exactly the confidence interval and you get 0.7 to 2.5 basically. We've talked a lot about likelihoods so I wanted to talk about how you can use the t distribution to create a likelihood for 0 mu. So remember, we're in this kind of hard setting where we have a data that have two parameters mu and sigma, you know, the likelihood inherently is a two-dimensional object. And we showed oh, where you can get a trick and figure out how to get a likelihood for sigma. And here I'm going to say, well here, you can do another trick and get another likelihood for a single parameter but the single parameter is a function of the two parameters. So in this, the single parameter is mu divided by sigma, which is actually quite an important parameter. Mu divided by sigma is the mean in standard deviation unit. So it's a unit for equantity and it's often called the effect size and this is a nifty little trick to create a likelihood for the effect size. So if x is normal mu sigma squared and then this Chi-squared random variable is a Chi-squared random variable with df degrees of freedom, then if you take x divided by sigma and divide it by the square root of a Chi-squared divided by its degrees of freedom, then notice we forgot to subtract off mu in the top. So, x over sigma still has a mean, in this case, its mean is specifically mu over sigma. So we have not taken a standard normal and divided by an independent square root of an independent Chi-squared divided by its degrees of freedom. We took a non-standard normal and divided by a square root of an independent Chi-squared divided by its degrees of freedom. Well so it can't work out to be a t random variable because we haven't satisfied the definition of a t random variable. So it's what's called a non-central t random variable. And in the specific case when mu is zero, we wind up with a t random variable. And this non-central t random variable also has degrees of freedom but then it has a second parameter called the non-centrality parameter. In this case, the non-centrality parameter is mu over sigma. So just to put this in context, x bar is normal mu with variants sigma squared over n. n minus 1S squared over sigma squared is a Chi-squared with n minus 1 over n minus 1 degrees of freedom. So square root n X bar divided by S works out to be a non-central t with non-centrality parameter square root n mu divided by sigma. At any rate, we can use this fact to create a likelihood for mu over sigma. The effect is basically what is called the effect size. And we can plot a one-dimensional likelihood without having to do any further tricks. And then after this, I'll talk about how you can do a trick to avoid all these little, you know, figuring out a likelihood of the effect size, and figuring out the likelihood of the variance or sort of you'd have to kind of take a Math Stat class to figure those out. What we'll talk about next, in profile likelihood is a quick way to always generate a likelihood in any setting in a fairly automatic way. So anyway, here's how you would create the intervals. Here our, our t statistic is square root n times the mean divided by s just using the code from the sleep data. The effect size values that we want to plot is, let's say, we go from 0 to 1, and our length is a 1,000. Our likelihood values are then the t density, in this case, R's dt function, t density function, has an argument ncp, which stands for non-centrality parameter. So here are, we have our t density. We plug in our t statistic. Our degrees of freedom are n minus 1, and then we loop over all of our non-centrality effect sizes, and that creates a collection of likelihood values. And then, we want our likelihood values to be peaked at one so, instead of figuring out what the exact maximum likelihood is, let's just divide by the maximum when we grid searched over 1,000 points. And let's plot our effect size values by our likelihood values, let's make sure it's a line by doing type equals l, and let's draw lines at 1 8th and 1 16th. And we wind up with this plot. So this plot is a likelihood, not for mu, but for mu divided by sigma. And it has all of the same interpretations of any likelihood. Higher values indicate better supportive values of mu over sigma. Lower values of likelihood indicate worse values, and the, where the two lines intersect, the likelihood represent intervals for the effect size. So that's maybe enough about the effect size, and that's a fairly specific thing. That's not something you'll see in many, say, introductory stats textbooks. I just thought I'd give it because it's kind of neat. And next, we'll talk about a way for creating a likelihood when you have a multivariate likelihood creating a univariate likelihood to look at.