The one of the difference in NLP(Natural Langugage Processing) comparing to other tasks is the vector representation.
For example, RGB Image can be represented like $28 \times 28 \times 3$.
Then how about the text?
Can you represent the sentences like “I eat an apple”, “I ate an apple”?
In this post, i will introduce the way to represent word to embed into vector space.
Then, Sentence can be represented by the combination of words.
Word representation
To processs sentences on machine, we need to convert sentence to vector.
one-hot representation
One of wiedly used word representation is one-hot representation
labelling as a binary vector that is all zero except the index of integer.
Using this representation, you can distingiush the words but can’t calculate distance between the words. The word distance between man
and women
might be closer than the word distance between women
and orange
.
Word embedding
The idea of word embedding
is to featurize the word vector by word embedding.
The word can be represended with the featurized factors. One word can be used with some other words having similar meaning.
\[\begin{array}{r|rr} \text{} & man & woman & king & queen & apple & orange \\ \hline \text{Gender} & -1 & 1 & -0.96 & 0.97 &0.00 & 0.01 \\ \text{Royal} & 0.01 & 0.02 & 0.93 & 0.95 & 0.00 & 0.01 \\ \text{Age} & 0.03 & 0.00 & 0.7 & 0.69 & 0.03 &-0.02 \\ \vdots &\vdots & \vdots &\vdots& \vdots & \vdots \\ \text{food} & 0.09 & 0.01 & 0.01 & 0.02 & 0.95 &0.96 \\ \vdots &\vdots & \vdots &\vdots& \vdots & \vdots \\ \end{array}\]Normally T-sne
is one of the effective way to visualize the embedded words, $300D \to 2D$
Transfer learning and word embeddings
- Learn word embeddings from large text corpus(or Download pre-trained embedding online)
- Transfer embedding to new task with smaller training set
- (Optional) Continue to finetune the word embedding with new data set.
Analogies
The distance between man and woman is similar to the distance between king and queen.
\[\begin{align} e_{man} - e_{woman} =& \begin{bmatrix} -2 \\ 0 \\ \\ \vdots \\0 \end{bmatrix} \\ e_{king} - e_{queen} =& \begin{bmatrix} -2 \\ 0 \\ \\ \vdots \\0 \end{bmatrix} \end{align}\]Lingustic regularities in continuous space word representations
$e_{man} - e_{woman} \approx e_{king} - e_{?}$
\[sim(u, v) = u^Tv\]Embedding matrix
with 10,000 vocabulary size
Embedding for $j$ th word.
\[E \cdot O_{j} = \begin{bmatrix} a_{1,1}& \dots & a_{1, 10000}\\ a_{2,1} &\dots & a_{2, 10000}\\ &\vdots \\ a_{j,1} &\dots & a_{j, 10000}\\ &\vdots \\ a_{300,1} & \dots & a_{300, 10000}\\ \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ \vdots \\1 \\ \vdots \\0 \end{bmatrix} = e_{j}\]But, multiplying with one-hot vector is not efficient.
In practice, use specialized function to look up an embedding instead of multiplying with one-hot vector.
Language model
How can we use word embedding
for language model?
the meaning of $\hat y$ is probabiltiy that if you randomly pick a word nearby “ants”, that it is a “car”.
Word2vec
Let’s see the more detail for word embedding with word2vec
model.
There are two ways to train word2vec
.
- CBOW
- Skip gram
Before explaining word2vec for multiple context words, let’s see the detail for one word context.
One-word context
- $V$: Vocabulary size
- $N$: Hidden layer size
- $h$: hidden layer matrix
- $W$: input to hidden weight matrix
- $v_{w_I}$: row $i$ of $W$, the vector representation of the input word $w_I$
- $v^\prime_{w_j}$: $j$-th column of the matrix $W^\prime$
- $W^\prime $: hidden to output weight matrix
- $E$: loss function
- $EH$: $N$-dim vector, it is the sum of the output vectors of all words in vocabulary, weighted by their prediction error
- $e$: prediction error
Forward propagation
\[\begin{align} & h = W^T x = W^T_{(k, :)} = v^T_{w_I} \\ & u_j = {v^\prime_{w_j}}^T h \\ & y_j = softmax(u_j)\\ \\ & x_k \xrightarrow[{W_{V \times N}}]{} h_i \xrightarrow[W^{\prime}_{N \times V}]{} u_j \xrightarrow[softmax]{} y_j\\ \\ & P(w_J \vert w_I) = y_j = \frac{ \exp({v^{\prime}_{w_j}}^Tv_{w_I}) } { \sum_{j^\prime=1}^V \exp\left( {v^\prime_{w_{j^\prime}}}^T v_{w_I} \right) } \end{align}\]Backwoard propagation
Define loss function for derivation
\[\begin{align} \max p(w_O \vert W_I) = & \max y_{j^*} \\ = & \max \log y_j^* \\ = & u_j^* - \log \sum_{j^\prime = 1}^V \exp(u_{j^\prime}) := -E \\ & j^* : \text{the index of the actual output word in the output layer} \end{align}\]Update $W^\prime$
\[\begin{align} & t_j= \begin{cases} 1 & \text{where }j = j^* \\ 0 &\text{otherwise} \\ \end{cases}\\ \\ &\frac{\partial E}{\partial u_j} = y_j - t_j := e_j\\ \\ & \frac{\partial E}{\partial w^\prime_{ij}} = \frac{\partial E}{\partial u_j} \cdot \frac{\partial u_j}{\partial w^\prime_{ij}} = e_j \cdot h_i \\ \\ & {w^\prime_{ij}}^{(new)} = {w^\prime_{ij}}^{(old)} - \eta \cdot e_j \cdot h_{(i,:)} \\ &or \\ & {v^\prime_{w_{j}}}^{(new)} = {v^\prime_{w_j}}^{(old)} - \eta \cdot e_j \cdot h &\text{for }j=1,2,\dots, V \\ \end{align}\]Update $W$
\[\begin{align} & \frac{\partial E}{\partial h_i} = \sum_{j=1}^V \frac{\partial E}{\partial u_j} \cdot \frac{\partial u_j}{\partial h_i}= \sum_{j=1}^V e_j \cdot w_{ij}^\prime := EH_i \\ & \frac{\partial E}{\partial w_{ki}} = \frac{\partial E}{\partial h_i} \cdot \frac{\partial h_i}{\partial w_{ki}} = EH_i \cdot x_k\\ & \frac{\partial E}{\partial W} = x \otimes EH = x \cdot EH^T \\ & v^{(new)}_{w_I} = v^{(old)}_{w_I} - \eta EH^T \end{align}\]CBOW(Continuous Bag-of-Word) Model
The
quick
[brown] fox
jumps
over the lazy dog.
With 2 window size, last four words on left and right are selected for target word.
Context | Target |
---|---|
the | brown |
quick | brown |
fox | brown |
jumps | brown |
Skip gram Model
When the sentence is below and selected context word is brown
.
The
quick
[brown] fox
jumps
over the lazy dog.
With 2 window size, last four words on left and right are selected for target word.
Context | Target |
---|---|
brown | the |
brown | quick |
brown | fox |
brown | Jumps |
The goal of skip-gram
is to embed from selected context word to nearby being two words behind and to words ahead.
Model
- Vocabulary size: 10K
- Embedding size: 300
Normally, we can use softmax
for loss function to tarin emebedding matrix.
But, the problem with softmax
classification is the value is small
Optimization
With huge bag of words, the training process is too slow because the pair of words only can be used to train self.
Negative sampling
Negative sampling is the idea to train the other words that are not closer.
Sampling negative case is helpful to improve training time by updating the words that are not
Context | Word | is_target |
---|---|---|
orange | juice | 1 |
orange | king | 0 |
orange | fox | 0 |
orange | Jumps | 0 |
If we don’t apply negative sampling, we only can train the pair, (orange, juice). But with negative sampling we can train other pairs, [(orange, king),(orange, fox),(orange, jumps)]
with negative label. It increases the distance between orange
and the words, king
, fox
, jumps
.
Subsampling
The most frequent words like (the
, and
, or
… ) may not helpful to train the context of words.
Because these words can be appeared in most of sentences.
\[P(w_i) = (\frac{z(w_i)}{\epsilon} + 1) \times \frac{\epsilon}{z(w_i)}\]- $z(w_i)$ is the frequency of the word, $w_i$
- $\epsilon$ can be $0.01, 0.001, \dots$
$P(w_i)$ means the probability to sample a word, $w_i$.
In code implementation, We will pick up the word, $w_i$, when the random probability is bigger than $P(w_i)$
Conclusion
In this post is about the basic idea of word vector representation.
The goal of word vector is the word can be represented as a vector and we expect the similar vectors can have similar meaning.
And embedding word can be a proper solution to avoid curse of dimensions.