So logistic regression is one of the simplest algorithms that one can consider in machine learning. But in many ways it's not overly simple it's widely used and it provides a nice introduction into machine learning. It's therefore worthwhile to look into logistic regression a little bit more deeply and to try to use it as a model to interpret what machine learning is doing. So to do that, let's look at a particular problem. So what is shown here are some examples of handwritten digits from 0 to 9 and this is a very classic data set, it's called the MNIST Data Set. What we would like to do is to build a predictive model that could look at a picture of one of these images and then can map it, map that picture to a digit from 0 to 9. So this as one could imagine would be very applicable for analysis of human writing, the use of 0 to 9 is just an example of course we can use this basic construct for the analysis of any hand written language. So the question is, how do we do this and how do we do this within the context of the tools we have at our disposal, particularly logistic regression. So again the goal is take a picture, here is a picture of 8 on the left, and what we would like to do is to map that to a prediction and we would like the machine to look at that picture and say that that is an 8. The way that we do this is that we take the image it is pixelized into digits, so that's what you see on the right and so each one of those pixels is the strength of the image at that point in space. We then take those pixels and those pixels now are going to correspond to the vector x that we talked about previously and that we'll talk about as well. Now to make this concrete, remember that the tools we currently have at our disposal is logistic regression which assumes that the output of our model is binary. So what we're going to do is we're going to consider a binary problem, in particular we're going to consider the problem where a digit is either a 0 or a 1 and we would like the machine to automatically look at the image and correctly say whether it is a 0 or a 1. So this is a very simple binary task, of course we can generalize this to more than binary. The binary is chosen because it's consistent with the assumptions of the logistic regression model that we're currently looking at. So recall that the way that we do this is that we give a set of training data, so on the left you see some examples of 0s and 1s and so the training data is manifested by images and the true label. We have our Logistic Regression Model, so here the pixels of the image are going to correspond to the data x. We weigh each of the pixel values, by multiplying by the parameters b1 through bm, we sum them up, we then get the parameter Z and then Z is sent through the sigmoid function. And then we can say with a certain probability that we are looking at a 1 digit 1, or a digit 0, so what we mean by learning is to infer the set of parameters b0 through bm. So to give an example of what this model is doing. This is what is learned in the context of training a Logistic Regression Model on those 1s and 0s. To try to understand this more closely, let's look at this example, again of a picture that corresponds to a 0, and the parameters of our model which are shown in the right with the blue and the red. So if we do the pointwise multiplication, we do for every pixel on the left, we multiply it by the corresponding pixel on the right. We do that multiplication pixel by pixel and then we sum all that together, notice that the only place where we get a strong contribution is where the white intersects the blue. The white of the picture intersects the blue of our filter and what happens is is that that yields a very negative value, which after being sent through the sigmoid function, yields a very low probability that this corresponds to a 1. By contrast, when we look at the case in which the digit corresponds to a 1, the black and white image corresponds to a 1 and then we multiply, if we look at the bottom example here. If we take the picture of a 1 and we multiply it by the set of parameters b which again is shown by the image of blue and red, if we take every pixel of the image and multiply it by the corresponding pixel of the set of parameters B. The only place where there is a strong contribution is where the positive components of the parameters B overlap with the positive components of the picture. We then get a strong positive value that is then sent through the sigmoid function, which yields a high probability that this is the number 1. And so this concept, I want to provide a little bit of notation which will be useful later. So to provide a little bit more detail, remember, what we're doing is we're taking the vector xi, which corresponds to our data, a set of numbers, and we're multiplying each component of x by the corresponding component of what we call a vector. A vector which is characterized by parameters of our model b1, b2 through bm and so what we're doing is, we can visualize this as the data is a vector xi which is characterized by a set of pixels. The parameters of our model is a vector b which represents parameters b1 through bm. We multiply every pixel of x with the corresponding pixel of b, we then sum those all together and that implements the equation on the top. This expression is called an inner product, it's an inner product between the vector xi and the vector b. This concise notation, xi dot with a circle b, is called inner product, that compact notation will be useful later whenever we look at more sophisticated models. And, so, this model, what we're doing is, we're looking at images, we're showing you examples of cases whenever the true label is yi = 0. Other cases where the true label is y = 1, the filter is shown at the right. What we're doing is we're taking an inner product of the data xi with b and that gives us our z and then that is sent through a logistic function. So the way that we can think of this is that the parameters of our model b are like a filter and you can see that the filter looks a lot like a one. And so what the model is doing what logistic regression is doing, is it's learning a filter of parameters b and then it multiplies that filter parameters on the data. If there is a strong match between the filter and the data then we say, there's a high probability of outcome yi equal to 1. If there is a low match between the filter and the data, we say that there is a low probability of a match between the filter and the data. And so again this is ultimately sent through a logistic function, the sigmoid function which then converts that into a probability. So this concept of data in a product with a filter and then sent through a logistic function or the sigmoid function is at the heart of logistic regression. What we'll do as we move forward is to is to use this concept from logistic regression to move to more sophisticated models that we'll find in deep learning.