In the previous module, we trained a model to perform text classification, discuss the problems with treating words as categorical features and demonstrated how embedding layers can be added to models to improve the quality of representations and their usefulness. However, the embedding layers that we trained required labeled data because we train them as part of our classification model. Labeled data is expensive and precious. Often, we simply don't have enough of it. In the previous course on image models, we talked about a number of strategies for what to do when you don't have enough data. Which of these were techniques for dealing with data scarcity? Everything except ensembling is a valid technique for dealing with data scarcity. In this module, we'll talk about how one of these techniques transfer learning applies to natural language. You'll learn about how researchers in other disciplines have historically constructed embeddings of words without training on a supervised task, recent techniques called GloVe and Word2Vec that are inspired by those techniques, how you can easily make use of pre-trained Word2vec embeddings using TensorFlow Hub, and how your task and the amount of data you have determine how you should use Wrd2vec and GloVe in your model. There's a long history of people quantifying word meaning and constructing some sort of embedding. When I say quantifying word meaning, I mean mapping words to numbers. What's interesting is, if you consider the extent to which these approaches have relied on domain knowledge, you can see a similar story to the one we've discussed about feature engineering within machine learning. While earlier on, researchers attempted to impose their beliefs about words on the representations they created. Over time, they seeded this power to the models that they created. The average represents words quantitatively began in the 1950's in the psychology community. Psychologists acted just like ML practitioners did and took a prescriptive view about the way words varied. After coming up with a set of 50 dimensions, each of which was a scale between two adjectives, they then asked a set of human subjects to rate words along each one. For example, one of the skills they came up with was small to large. Then they averaged the ratings to get their vector representations. For example, for the word "polite", the subjects gave it a 4.9 on a seven-point scale from angular to rounded. But human subjects are expensive and this process was hard to scale. Beginning in the late one 1980's, researchers began exploring methods of creating numerical representations of word meaning that didn't require any human labeling. At the core of their approaches was an idea called the distributional hypothesis, which stated that the meanings of words can be found in their usage. It's an idea that is very intuitive. When you hear an unfamiliar word in conversation, you look to see how it was used to figure out its meaning, and every other word becomes evidence. So for example, you'll know the syntactic role that mooped is playing immediately. Then you see child and you think, what sorts of things that children do? Then you think, what types of things do people do in yards? When you intersect all these pieces of evidence, you arrive at a best guess for the meaning of the word. One of the first to these approaches came from researchers who were trying to solve the problem of ranking documents relative to a query. Their approach was called Latent Semantic Analysis and involved two steps. The first step was to compile a term-document matrix. A term-document matrix is a table whose rows are terms, whose columns are documents and its contents are the frequency that a particular word occurs with a particular document. One thing you could do is to take a row in this matrix called a term vector, and treat it as the representation of that word. However, researchers didn't do that. Why are term vectors poor word embeddings? The correct answer is both. The term vectors weren't of high enough quality because they stemmed from the limited sample of documents that the researchers had access to. Think about the similarity between two terms that don't co-occur in the sample. They will look completely dissimilar. However, those two terms might co-occur in a document outside the sample. Additionally, think about the size of these vectors. They grow with the number of documents in the sample. Consequently, researchers wanted a lower-dimensional, higher-quality set of vectors. So what they did was to use a technique from linear algebra called matrix factorization. How this works is beyond the scope of this course. However, what's important for you to know are two things. Firstly, matrix factorization takes a matrix like the term-document matrix and creates two matrices called factors that can be treated as lower dimensional representations of the two domains, which in this case are terms and documents. Multiplying these two smaller matrices results in an approximation of the original matrix. Secondly, matrix factorization is useful in a variety of machine learning scenarios. We'll use matrix factorization again to build our recommendation systems in the next course. Let's say our term-document matrix is called x. The idea is to find two factor matrices such that the difference between the product of the two factors and the original matrix is as small as possible. We refer to this sort of error as reconstruction error. The greater the number of dimensions in our factors u and v, the more information they contain and thus the closer there product will be to the original term-document matrix x. What that meant for researchers was that they had a way to trade off quality against usefulness. Later, researchers would change this approach so that instead of a term-document matrix, they created a term-term matrix, where every value corresponding to the number of times the two words co-occured. They constructed these matrices by sliding a window over a corpus and treating words in that window at the same time as co-occurring. This was consistent with the idea that the context necessary for understanding a word was located in its immediate surroundings more so than in the document in which it occurs.