Word2Vec

21 Feb 2019 in Machine_learning / Machine Learning on Word2vec

The performance of ML algorithms depends on the quality of the input data. That is, we want the input representation to be sufficient to represent some properties of the input itself. Problems with input representation are especially evident in the case of discrete features. The classical approach for dealing with discrete features is one-hot embeddings.

When we are working with text, the text itself is a collection of discrete features. The problem with one-hot embeddings swiftly becomes evident because the set of unique words is very large and that will lead to a very sparse high dimensional representation of words. Moreover, such representation is not very meaningful on its own.

History of Word Embeddings

An alternative representation that captures more semantic information is co-occurrence matrix. Consider a corpus that consists of the three following sentences

I like deep learning.
I like NLP.
I enjoy flying.

Define context as the central word plus its immediate neighbours. Then we construct the following co-occurence matrix

counts	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	11	0	0	1	0	0
enjoy	1	0	0	0	0	0	1	0
deep	0	1	0	0	1	0	0	0
learning	0	0	0	1	0	0	0	1
NLP	0	1	0	0	0	0	0	1
flying	0	0	1	0	0	0	0	1
.	0	0	0	0	1	1	1	0

Each row of the matrix can be used as a representation of the corresponding word. As opposed to a one-hot encoding, these rows are distributed representations of words. When we have a large corpus to fill this matrix, the representations become meaningful enough to perform semantic clustering of the words. After applying some transformation to such matrix, Rohde et al (2005) were able to perform hierarchical clustering of words

Source: Rohde et al (2005)

Further analysis shows that in co-occurence representations, semantic similarities emerge

Source: Rohde et al (2005)

The dimensionality of co-occurrence matrix is very high. A common way to reduce it is by applying SVD, or other dimensionality reduction techniques. After reducing dimensionality, we get distributed representation of words. However:

Dimensionality reduction is expensive to compute for large matrices. Wikipedia has several million unique tokens. This does not account for n-gram tokens like Innopolis University.
Hard to add new words

The problem disappears if we try to learn distributed representations without constructing co-occurence matrix first. This is what Word2Vec (SkipGram) tries to achieve.

Relevant “Historic” Articles:

Learning representations by back-propagating errors. (Rumelhart et al., 1986)
A neural probabilis4c language model (Bengio et al., 2003)
NLP (almost) from Scratch (Collobert & Weston, 2008)
A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)

Word2Vec

The main idea: Capture co-occurence relationship indirectly during training without storing the co-occurence matrix. Word2Vec implements Skip-Gram model.
Goal: Predict surrounding words in the context window
Objective: Maximize log probability of context words given the central word of the context.

Objective

Language Model

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length T, it assigns a probability \(P(w_{1},\ldots ,w_{T})\) to the whole sequence.

Log likelihood can be written as \[L = \log p(w_1, ..., w_T)\]

During training out objective is to maximize the probability of observed text. Such joint probability is very hard to compute. It is more practical to work with factorized representation \[L = \log \prod_{i=1}^T p(w_i| w_{i-1} .. w_1)\]

Here each word depends on all the previous words in the sequence. The next step in simplifying the objective would be to reduce the number of variables in the conditional distribution. Foe example, we can assume that the next word is perfectly defined by the last \(k\) observed words. There could be other ways to perform this factorization.

The reason we want to factorize our probability distribution, is because calculating the summation of logarithms is much faster than computing product.

Skip-Gram Language Model

Assume the text corpus \(\mathcal{C} = [w_1, w_2, ..., w_T]\), where \(T\) is the number of tokens in the corpus. \(V\) is the vocabulary of the corpus, i.e. the set of unique words, and \(\vert V \vert\) is the size of the vocabulary. The context is defined as a word with its \(m\) adjacent neighbours on the right, and \(m\) adjacent neighbours on the left. Overall, the size of the context window is \(2m+1\). The example of the context windows is

Source

Assume a text is a collection of independent contexts, and the contexts are conditioned by their central words. The log probability of the corpus is \[L = \log \sum_{w_t \in |V|} p(C|w) p(w)\]

where \(C\) is the context. This objective requires the knowledge of the prior \(p(w)\), which requrires us estimate it from the corpus. Additionally, this model is hard to factorize due to the log outsize the summation. Instead of doing through these tropubles, Skip-Gram relaxes this language model and defines the objective as expected conditional log-probability of context \[\begin{aligned} J(\theta) &= \mathbb{E}_{w\sim P_\mathcal{C}} \log p_\theta(C|w) \\ &= \frac{1}{T}\sum_{t=1}^T \log p_\theta(C|w_t) \end{aligned}\]

where \(P_\mathcal{C}\) is the distribution of words based on the current corpus. Now assume the words in the context are independent from each other given the central word in the context, and we get the Skip-Gram’s objective \[\begin{aligned} J(\theta) &= \frac{1}{T}\sum_{t=1}^T \log p_\theta(C|w_t) \\ &= \frac{1}{T}\sum_{t=1}^T \log \left( \prod_{-m \leq j \leq m, j \neq 0} p_\theta(w_{t+j}|w_t) \right) \\ &= \frac{1}{T}\sum_{t=1}^T \sum_{-m \leq j \leq m, j \neq 0} \log p_\theta(w_{t+j}|w_t) \end{aligned}\]

To summarize, our assumtions are

Corpus is a collection of independent contexts
Context is specified by its central word
Words in the context are independent from each other given the central word

There is a question of how we should handle boundary cases where we are in the beginning or in the end of the corpus string. For practical purposes, any approach for handling these situation is fine.

NN Model

Definition

Skip-Gram model is implemented as a neural network. Define the input to the neural netword as \(\mathbf{x} \in \mathbb{R}^{ \vert V \vert}\), the output is \(\mathbf{y} \in \mathbb{R}^{\vert V \vert}\), and the hidden layer \(\mathbf{h} \in \mathbb{R}^{N}\). The conditional probability \(p(w_y \vert w_x)\) is calculated as \[\hat{\mathbf{y}}(x) = \text{Softmax}(W' \text{proj}(W\mathbf{x}))\]

the corresponding NN architecture is presented below

Source: Backpropagation for Word2Vec

Here, \(\mathbf{x}\) and \(\mathbf{y}\) are one-hot encodings of corresponding words in the vocabulary. The first layers is called projection layer, because it performs projection from \(\mathbb{R}^{\vert V \vert}\) to \(\mathbb{R}^{N}\).

Loss

Since \(\hat{y}_i\) is the estimate of the probability of i-th word in the vocabulary, the context probability is calculated as \[p(C|w_t) = \prod_{-m \leq j \leq m, j \neq 0} \hat{y}_{t+j}(\mathbf{x}^{(w_t)})\]

where \(\mathbf{x}^{(w_t)}\) is one-hot encoding with 1 at the position of word \(w_t\) in vocabulary. the Cross entropy loss for this case would be \[l = \frac{1}{|T|} \sum_{t}^{T} \frac{1}{|C|} \sum_{-m \leq j \leq m, j \neq 0} \sum_{i=1}^{|V|} y_{i,j} \log \hat{y}_i(\mathbf{x}^{(w_t)})\]

where \(\vert C \vert\) is the number of words in the context, and \(y_{i,j}\) is a correct label \[\begin{equation} y_{i,j}=\begin{cases} 1, & \text{if word at position } w_j \text{ is i-th in vocabulary}\\ 0, & \text{otherwise}. \end{cases} \end{equation}\]

Alternative Definition of Conditional Probability

Remember that \(W\) is the projection matrix, \(W'\) is the matrix of weights for the output layer of the neural network. Then, conditional probability can be computed as

Alternative and equivalent way to define conditional propbability is \[p(w_o|w_i) = \frac{\exp(u_{w_o}^T v_{w_i})}{\sum_{w'=1}^{|V|}\exp(u_{w'}^T v_{w_i})}\]

where \(v\) is a column of a corresponding projection matrix \(W^T\), and \(u\) is a corresponding column of the output lauyer weight matrix \(W'\).

Embeddings

By the end of the leaning we will have two sets of representations for words:

Projection matrix \(W\). Weights in this matrix are referred to as IN embeddigns
Weights of the output layer referred to as OUT embeddings

Problem with Classical Skip-Gram

For classical Skip-Gram, Softmax is a computational bottleneck, which is especially true for large vocabularies. For every \(t\) in the outer summation of the cost function, we compute gradients for the whole matrix \(W'\) and for a single row of \(W\). That impedes the speed of the update of \(W\) which is directly responsible for what \(W'\) should be.

Negative Sampling

New Objective

Instead of computing the whole softmax, try to classify whether two words appear in the same context. The corpus provides us only with the information about the positive examples. We sample negative examples from some noise distribution \(P_n\). Thus for every word position in the corpus we have a set of positive examples defined by the context, and the set of negative examples sampled from \(P_n\).

Given that \(k\) is the number of negative samples , the objective will have the following form \(J(\theta) = \frac{1}{T} \sum_{t=1}^T \sum_{-m \leq j \leq m, j \neq 0} \log\sigma(u_{w_j}^T v_{w_t}) + \sum_{i=1}^{k} \mathbb{E}_{w \sim P_n(w_t)} \left[ \log\sigma(-u_{w}^T v_{w_t}) \right]\)

The expectation can be estimated with Monte Carlo method \(\mathbb{E}_{w \sim P_n(w)} \left[ \log\sigma(-u_{w}^T v_{w_t}) \right] \sim \frac{1}{|S|} \sum_{w_i \in S} \log\sigma(-u_{w_i}^T v_{w_t})\) where \(S\) is a sample from \(P_n(w)\). For the current model, assume the sample size is equal to 1.

Negative sampling is shown to have positive impact on accuracy on analogical reasoning task.

Noise Distribution

Noise distribution \(P_n\) is a hyperparameter of the model. The original paper found the best result on analogical reasoning are achieved when \[P_n(w) \propto U(w)^\frac{3}{4}\]

where \(U\) is a unigram (proportional to frequency count) probability. Scaling to the power of 3/4 reduces the probability weight of frequent words, and boost the probability of the infrequent. This way negative sampling is more representative of the whole vocabulary.

Negative sampling can be implemented with the help of numpy.random.choice

unigram = word_frequencies / sum(word_frequencies)

modified_unigram = numpy.power(unigram, 3/4)
modified_unigram_weighs = modified_unigram / sum(modified_unigram)

numpy.random.choice(word_ids, k, p=modified_unigram_weighs)

You also need to implement a check that the input word does not appear in the list of negative words.

Loss

We relaxed the problem from multiclass classification to binary classification of tuples. A tuple consists of a par of words. The classification model should return 1 when the words are in the same context, and 0 for negative examples. The new loss is \[loss = - \frac{1}{T} \sum_{t}^{T} \frac{1}{|C|+k} \left[ \sum_{-m \leq j \leq m, j \neq 0} \log \sigma(u_{w_j}^T v_{w_t}) + \sum_{i=1}^k \log \left(1 - \sigma(u_{w_i}^T v_{w_t}) \right) \right]\]

where \(w_i\) are negative samples, and \(\vert C \vert\) represents the number of positive examples per word. You can see that the summation inside square brackets is similar to binary cross-entropy loss, where positive pairs of words have the label 1, and negative pairs - 0. This computation can be assisted with sigmoid_cross_entropy_with_logits.

Subsampling

In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g., “in”, “the”, and “a”). Such words usually provide less information value than the rare words. For example, while the Skip-gram model benefits from observing the co-occurrences of “France” and “Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as nearly every word co-occurs frequently within a sentence with “the”. This idea can also be applied in the opposite direction; the vector representations of frequent words do not change significantly after training on several million examples.
When discussing noise distribution for negative samples, we observed that the method works better when frequent items are supressed. This is true not only for negative samples, but also for positive. To balance the word distribution in the training data (positive examples), each word \(w_t\) in the training set is discarded with probability

\[P(w_t) = 1 - \sqrt{\frac{\alpha}{f(w_t)}}\]

where \(\alpha\) is some threshold value (\(10^{-5}\) for the dataset with one billion tokens), and \(f(w_t)\) is the frequency of a word \(w_t\).

The code that checks whether a word should be discarded would look like this

should_discard = numpy.random.rand() > p_w_t

Word2Vec Analysis

Analogy

Main metric for Word2Vec paper was the Analogy task. The most interesting finding is that the word representation exhibit linear compositional structure. Thus, a vector calculated as \(v_{Beijing} - v_{China} + v_{Russia}\) is approximately equal to \(v_{Moscow}\).

Source: Distributed Representations of Words and Phrases and their Compositionality

Compositionality

One of the artifacts of the learned vectors is their ability of linear composition.

The original paper has the following description of compositionality

The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.

IN and OUT vectors

After training the model, one ends up with two sets of vectors: IN and OUT (matrices \(W\) and \(W'\)). Most common approach is to average them. Some research tried to explore the relationship between IN and OUT projections, where \(IN = Wx\), \(OUT = W'^Tx\).

Define IN-IN as a KNN query, where we use a representation IN to search for closest IN representations, IN-OUT as a KNN query where we use IN projection to look for closest points among OUT projections.

In the table below, we can see that KNN queries within projection matrices are closely related. In the first example, both IN-IN and OUT-OUT queries provide a list of universities. However, the result of IN-OUT query is more similar to topically replated words.

Source: Mitra et al. (2016)

Drawbacks of Skip-Gram Vectors

Hard to interpret
Only one representation per word, even though many words have several meanings
Cannot handle OOV

Other Popular Word Vectors

Regular vectors

Glove Have similar performance as Word2Vec
FastText Fast training, considers morphological infromation, can handle OOV

Multiprototype vectors

Adagram Each word can be assosiciated with several meanings (vector). The number of meanings is inferred from the corpus

Contextual Embeddigns

ULMFiT, ELMO Embeds words with their context. Better for endstream task like text generation.

Clusters for Years Names Chemistry terms

Word2Vec