[SOUND] This lecture is about the vector space retrieval model. We're going to give an introduction to its basic idea. In the last lecture, we talked about the different ways of designing a retrieval model, which would give us a different arranging function. In this lecture, we're going to talk about a specific way of designing a ramping function called a vector space retrieval model. And we're going to give a brief introduction to the basic idea. Vector space model is a special case of similarity based models as we discussed before. Which means we assume relevance is roughly similarity, between the document and the query. Now whether is this assumption is true is actually a question. But in order to solve the search problem, we have to convert the vague notion of relevance into a more precise definition that can be implemented with the program analogy. So in this process, we have to make a number of assumptions. This is the first assumption that we make here. Basically, we assume that if a document is more similar to a query than another document. Then the first document will be assumed it will be more relevant than the second one. And this is the basis for ranking documents in this approach. Again, it's questionable whether this is really the best definition for randoms. As we will see later there are other ways to model randoms. The basic idea of vectors for base retrieval model is actually very easy to understand. Imagine a high dimensional space where each dimension corresponds to a term. So here I issue a three dimensional space with three words, programming, library and presidential. So each term here defines one dimension. Now we can consider vectors in this, three dimensional space. And we're going to assume that all our documents and the query will be placed in this vector space. So for example, on document might be represented by this vector, d1. Now this means this document probably covers library and presidential, but it doesn't really talk about programming. What does this mean in terms of representation of document? That just means we're going to look at our document from the perspective of this vector. We're going to ignore everything else. Basically, what we see here is only the vector root condition of the document. Of course, the document has all information. For example, the orders of words are [INAUDIBLE] model and that's because we assume that the [INAUDIBLE] of words will [INAUDIBLE]. So with this presentation you can really see d1 simply suggests a [INAUDIBLE] library. Now this is different from another document which might be recommended as a different vector, d2 here. Now in this case, the document that covers programming and library, but it doesn't talk about presidential. So what does this remind you? Well you can probably guess the topic is likely about program language and the library is software lab library. So this shows that by using this vector space reproduction, we can actually capture the differences between topics of documents. Now you can also imagine there are other vectors. For example, d3 is pointing into that direction, that might be a presidential program. And in fact we can place all the documents in this vector space. And they will be pointing to all kinds of directions. And similarly, we're going to place our query also in this space, as another vector. And then we're going to measure the similarity between the query vector and every document vector. So in this case for example, we can easily see d2 seems to be the closest to this query vector. And therefore, d2 will be rendered above others. So this is basically the main idea of the vector space model. So to be more precise, vector space model is a framework. In this framework, we make the following assumptions. First, we represent a document and query by a term vector. So here a term can be any basic concept. For example, a word or a phrase or even n gram of characters. Those are just sequence of characters inside a word. Each term is assumed that will be defined by one dimension. Therefore n terms in our vocabulary, we define N-dimensional space. A query vector would consist of a number of elements corresponding to the weights on different terms. Each document vector is also similar. It has a number of elements and each value of each element is indicating the weight of the corresponding term. Here, you can see, we assume there are N dimensions. Therefore, they are N elements each corresponding to the weight on the particular term. So the relevance in this case will be assumed to be the similarity between the two vectors. Therefore, our ranking function is also defined as the similarity between the query vector and document vector. Now if I ask you to write a program to implement this approach in a search engine. You would realize that this was far from clear. We haven't said a lot of things in detail, therefore it's impossible to actually write the program to implement this. That's why I said, this is a framework. And this has to be refined in order to actually suggest a particular ranking function that you can implement on a computer. So what does this framework not say? Well, it actually hasn't said many things that would be required in order to implement this function. First, it did not say how we should define or select the basic concepts exactly. We clearly assume the concepts are orthogonal. Otherwise, there will be redundancy. For example, if two synonyms or somehow distinguish it as two different concepts. Then they would be defining two different dimensions and that would clearly cause redundancy here. Or all the emphasizing of matching this concept, because it would be as if you match the two dimensions when you actually matched one semantic concept. Secondly, it did not say how we exactly should place documents and the query in this space. Basically that show you some examples of query and document vectors. But where exactly should the vector for a particular document point to? So this is equivalent to how to define the term weights? How do you compute the lose element values in those vectors? This is a very important question, because term weight in the query vector indicates the importance of term. So depending on how you assign the weight, you might prefer some terms to be matched over others. Similarly, the total word in the document is also very meaningful. It indicates how well the term characterizes the document. If you got it wrong then you clearly don't represent this document accurately. Finally, how to define the similarity measure is also not given. So these questions must be addressed before we can have a operational function that we can actually implement using a program language. So how do we solve these problems is the main topic of the next lecture. [MUSIC]