In this lecture we're going to talk about how to instantiate vector space model so that we can get very specific ranking function. So this is to continue the discussion of the vector space model, which is one particular approach to design a ranking function. And we're going to talk about how we use the general framework of the the vector space model as a guidance to instantiate the framework to derive a specific ranking function. And we're going to cover the symbolist instantiation of the framework. So as we discussed in the previous lecture, the vector space model is really a framework. And this didn't say. As we discussed in the previous lecture, vector space model is really a framework. It does not say many things. So, for example, here it shows that it did not say how we should define the dimension. It also did not say how we place a document vector in this space. It did not say how we place a query vector in this vector space. And, finally, it did not say how we should measure the similarity between the query vector and the document vector. So you can imagine, in order to implement this model, we have to say specifically how we compute these vectors. What is exactly xi? And what is exactly yi? This will determine where we place a document vector, where we place a query vector. And, of course, we also need to say exactly what should be the similarity function. So if we can provide a definition of the concepts that would define the dimensions and these xi's or yi's and namely weights of terms for queries and document, then we will be able to place document vectors and query vectors in this well defined space. And then, if we also specify similarity function, then we'll have a well defined ranking function. So let's see how we can do that and think about the instantiation. Actually, I would suggest you to pause the lecture at this point, spend a couple minutes to think about. Suppose you are asked to implement this idea. You have come up with the idea of vector space model, but you still haven't figured out how to compute these vectors exactly, how to define the similarity function. What would you do? So, think for a couple of minutes, and then proceed. So, let's think about some simplest ways of instantiating this vector space model. First, how do we define the dimension? Well, the obvious choice is to use each word in our vocabulary to define the dimension. And show that there are N words in our vocabulary. Therefore, there are N dimensions. Each word defines one dimension. And this is basically the bag of words with Now let's look at how we place vectors in this space. Again here, the simplest strategy is to use a Bit Vector to represent both the query and a document. And that means each element, xi and yi will be taking a value of either zero or 1. When it's 1, it means the corresponding word is present in the document or in the query. When it's 0, it's going to mean that it's absent. So you can imagine if the user types in a few words in the query, then the query vector will only have a few 1's, many, many zeros. The document vector, generally we have more 1's, of course. But it will also have many zeros since the vocabulary is generally very large. Many words don't really occur in any document. Many words will only occasionally occur in a document. A lot of words will be absent in a particular document. So now we have placed the documents and the query in the vector space. Let's look at how we measure the similarity. So, a commonly used similarity measure here is Dot Product. The Dot Product of two vectors is simply defined as the sum of the products of the corresponding elements of the two vectors. So, here we see that it's the product of x1 and y1. So, here. And then, x2 multiplied by y2. And then, finally, xn multiplied by yn. And then, we take a sum here. So that's a Dot Product. Now, we can represent this in a more general way using a sum here. So this is only one of the many different ways of measuring the similarity. So, now we see that we have defined the dimensions, we have defined the vectors, and we have also defined the similarity function. So now we finally have the simplest vector space model, which is based on the bit vector [INAUDIBLE] dot product similarity and bag of words [INAUDIBLE]. And the formula looks like this. So this is our formula. And that's actually a particular retrieval function, a ranking function right? Now we can finally implement this function using a program language, and then rank the documents for query. Now, at this point you should again pause the lecture to think about how we can interpreted this score. So, we have gone through the process of modeling the retrieval problem using a vector space model. And then, we make assumptions about how we place vectors in the vector space, and how do we define the similarity. So in the end, we've got a specific retrieval function shown here. Now, the next step is to think about whether this retrieval function actually makes sense, right? Can we expect this function to actually perform well when we used it to rank documents for user's queries? So it's worth thinking about what is this value that we are calculating. So, in the end, we'll get a number. But what does this number mean? Is it meaningful? So, spend a couple minutes to sort of think about that. And, of course, the general question here is do you believe this is a good ranking function? Would it actually work well? So, again, think about how to interpret this value. Is it actually meaningful? Does it mean something? This is related to how well the document matched the query. So, in order to assess whether this simplest vector space model actually works well, let's look at the example. So, here I show some sample documents and a sample query. The query is news about the presidential campaign. And we have five documents here. They cover different terms in the query. And if you look at these documents for a moment, you may realize that some documents are probably relevant, and some others are probably not relevant. Now, if I asked you to rank these documents, how would you rank them? This is basically our ideal ranking. When humans can examine the documents, and then try to rank them. Now, so think for a moment, and take a look at this slide. And perhaps by pausing the lecture. So I think most of you would agree that d4 and d3 are probably better than others because they really cover the query well. They match news, presidential and campaign. So, it looks like these documents are probably better than the others. They should be ranked on top. And the other three d2, d1, and d5 are really not relevant. So we can also say d4 and d3 are relevant documents, and d1, d2 and d5 are non-relevant. So now let's see if our simplest vector space model could do the same, or could do something closer. So, let's first think about how we actually use this model to score documents. All right. Here I show two documents, d1 and d3. And we have the query also here. In the vector space model, of course we want to first compute the vectors for these documents and the query. Now, I showed the vocabulary here as well. So these are the end dimensions that we'll be thinking about. So what do you think is the vector for the query? Note that we're assuming that we only use zero and 1 to indicate whether a term is absent or present in the query or in the document. So these are zero,1 bit vectors. So what do you think is the query vector? Well, the query has four words here. So for these four words, there will be a 1. And for the rest, there will be zeros. Now, what about the documents? It's the same. So d1 has two rows, news and about. So, there are two 1's here, and the rest are zeroes. Similarly, so now that we have the two vectors, let's compute the similarity. And we're going to use Do Product. So you can see when we use Dot Product, we just multiply the corresponding elements, right? So these two will be formal product, and these two will generate another product, and these two will generate yet another product and so on, so forth. Now you can easily see if we do that, we actually don't have to care about these zeroes because whenever we have a zero the product will be zero. So when we take a sum over all these pairs, then the zero entries will be gone. As long as you have one zero, then the product would be zero. So, in the fact, we're just counting how many pairs of 1 and 1. In this case, we have seen two, so the result will be 2. So what does that mean? Well, that means this number, or the value of this scoring function, is simply the count of how many unique query terms are matched in the document. Because if a term is matched in the document, then there will be two one's. If it's not, then there will be zero on the document side. Similarly, if the document has a term but the term is not in the query, there will be a zero in the query vector. So those don't count. So, as a result, this scoring function basically measures how many unique query terms are matched in a document. This is how we interpret this score. Now, we can also take a look at d3. In this case, you can see the result is 3 because d3 matched to the three distinctive query words news, presidential campaign, whereas d1 only matched the two. Now in this case, this seems reasonable to rank d3 on top of d1. And this simplest vector space model indeed does that. So that looks pretty good. However, if we examine this model in detail, we likely will find some problems. So, here I'm going to show all the scores for these five documents. And you can easily verify they're correct because we're basically counting the number of unique query terms matched in each document. Now note that this measure actually makes sense, right? It basically means if a document matches more unique query terms, then the document will be assumed to be more relevant. And that seems to make sense. The only problem is here we can note that there are three documents, d2, d3 and d4. And they tied with a 3 as a score. So, that's a problem because if you look at them carefully, it seems that the d4 should be ranked above d3 because d3 only mentions the presidential once, but d4 mentioned it multiple times. In the case of d3, presidential could be an dimension. But d4 is clearly above the presidential campaign. Another problem is that d2 and d3 also have the same score. But if you look at the three words that are matched, in the case of d2, it matched the news, about and campaign. But in the case of d3, it matched news, presidential and campaign. So intuitively this reads better because matching presidential is more important than matching about, even though about and the presidential are both in the query. So intuitively, we would like d3 to be ranked above d2. But this model doesn't do that. So that means this model is still not good enough. We have to solve these problems. To summarize, in this lecture we talked about how to instantiate a vector space model. We mainly need to do three things. One is to define the dimension. The second is to decide how to place documents as vectors in the vector space, and to also place a query in the vector space as a vector. And third is to define the similarity between two vectors, particularly the query vector and the document vector. We also talked about various simple way to instantiate the vector space model. Indeed, that's probably the simplest vector space model that we can derive. In this case, we use each word to define the dimension. We use a zero, 1 bit vector to represent a document or a query. In this case, we basically only care about word presence or absence. We ignore the frequency. And we use the Dot Product as the similarity function. And with such a instantiation, we showed that the scoring function is basically to score a document based on the number of distinct query words matched in the document. We also showed that such a simple vector space model still doesn't work well, and we need to improve it. And this is a topic that we're going to cover in the next lecture. [MUSIC]