[MUSIC] Right, welcome back. In these videos, I'd like to talk about a vast topic of interpolation and approximation, or more generally modelling data. But let's see, what's a setup? Suppose we have some way of measuring something. Say we have a set of data. I'll call them Y. They can be, in real life they can be multidimensional although keep things univariate for simplicity. They're just set of scalar numbers. So then this measurements are taken at fix values of say independent variables, call them x. And we want to model this set of data, somehow. Then from somewhere we have a functional form of our model, I'll call it f. And we believe this model explains the data in some sense. This model is parameterized by, again, some parameters. I'll call them beta. And the task is to find the values of betas which fit where the model fits the data in the best way. What does it mean fit and what's best? There are two broad categories. One is that we expect that our model coincides with the measured data exactly. So that's interpolation. A different perspective is that we expect that our data is noisy. And what we want to do is to filter out the signal from the noise. What's this noise? Well basically we say there is a gazillion of various parameters, various factors which influence our data. And we believe we captured the important ones and that's our model. And the rest, we don't know anything about, we can't control them anyhow, so we're going to model them as random variables. And that's broadly speaking is approximation. So formally we write that our model represents the data plus this noise. Okay, that's the general idea. Now to find the best model, we need to come up with a quantitative measure of what's good and what's bad. So let's discuss some most general or most common ways, not most general, most common ways of quantifying that. Right, so probably the simplest one or most common one is they're called, least squares. Basically you're seeing this. Okay, I have my model that's this straight line. I have my data which deviates from this because of the noise. And I want to minimize the sum of deviations between the model and measured data. I cannot minimize the sum of deviations because they have both signs. So then you say well, I'm going to minimize the sum of squares. One possible interpretation of this again is probabilistic. That this deviations is like I was saying, is noise. It should have zero expectation value. Zero expected value at all points because if this expectation's non zero, we absorb it into the model. And these random unknown factors have something or deviations which are roughly of the same order of magnitude or roughly the same for all data points. And then we minimize this, this is sometimes known as a Chi-square with respect to beta. How we do it specifically, we'll talk about it a bit later. But now I want to mention that that's certainly not the only choice. One other possible choice is, or rather variation of the previous one is if the assumption that these standard deviations of my newest variables are the same at all points. They might be very different. Say for example, this point has a very large standard deviation and this one has small standard deviation. So I cannot really tell whether the first point, this one, is further away from the curve than the second one, because the natural way of measuring distances is in terms of the standard deviations. Well this idea is usually known under a funny name of heteroscedasticity. Now this naturally leads to the cost function which we will be minimizing being again in the least squares form only with weights, where the weights are these inverse standard deviations. Basically, geometrically it says, let's measure the distance in terms of the expected spread for each point, Okay? A possible alternative is this. I mentioned that you cannot minimize the sample of differences because they have either signs, so we took a square. But instead of taking a square we can take the absolute value. So we are still minimizing, The deviation or we're minimizing of sum of the lengths. In general, this is not better or worse than minimizing the squares. It's a little bit more complicated numerically because the absolute value is not differentiable at zero. Therefore, well, this region is a bit problematic for numerics. But still there is no fundamental reason why this, also known as L1 regression or least absolute deviations is better or worse than least squares. One property which makes it sometimes useful or sometimes attractive is, in general, this L1 regression is less sensitive to outliers. What's outliers? Well suppose there is one point which is very, very far from the curve. Now the least squares algorithm, we're minimizing least squares will be fairly sensitive, will be skewed by this outlier quite a bit. And this L1 regression is much less sensitive. So if you expect your data to contain these outliers, in real life we very often have those outliers, something goes wrong in our measurements, this can offer some advantages, okay? The last but not least, I think I'll measure the so called total least squares also known as orthogonal distance regression. Essentially, there is no reason why we need to minimize the vertical distances between our measurements and the curve in the model. We might want to take the orthogonal distances, And minimize either their sum or the sum of their squares. This leads to slightly different procedures. It's again numerically a little bit more complicated and a little less common than least squares. But it can be appropriate for example in those cases where not only my measurements have errors but also independent variables are not known exactly and are themselves noisy. Okay, there are these possibilities, there are many other ones. But for now let's focus on the most common one on the minimization of least squares. And that's what I'm going to turn on to in the next video. [SOUND]