A powerful representation of high-dimensional data: Embeddings

Why Embeddings?

Embeddings has gained many popularity in Deep Learning world, since it outperforms many traditional data representation. One of the most prominent applications of embedding is in Natural Language Processing. Word Embeddings has proven superior results than the good old Discrete Bags of Words (e.g one-hot-encoding). Not only in NLP, due to its powerful presentation of high-dimensional data, the concept of Embeddings has extend to other domains such as E-commerce, and Web Search. For instance, Airbnb embeddeds their Click data to represent their listing data, and inject that embedded data to improve their Pricing tools, or make real-time personalization in their Search algorithm.
Alright, now you have a sense how powerful and exciting this algorithm is, let’s dive into its meat and bones.

Traditional Vector Representation of High-dimensional Data

Image pixels, audio spectrogram, and words are normally encoded as “discrete atomic symbols” (similar to lookup table), but this representation has big shortcomings:

encodings are arbitrary
no semantic relationship betweet words are lost
sparsity: usually need more data to successfully train statistical models

That’s when Embedding comes to the rescue

Featurized Vector Space Representation: Word Embedding

A. What is Word Embedding?

Embed words as “continuous vector space” where semantically similar words are mapped closed together. This representation assumes that words frequently appearing together in a sentence would probably share some statistical dependency.

B. Questions you might ask about Word Embeddings?

1.How Embeddings represents multi-dimensional data?

alt text
d-Dimensional Embeddings:

Assumes user interest in movies can be roughly explained by d aspects
Each movie become a d-dimensional point where the coordinate value in dimension d represent how muhc the movie fits that concept
Embeddings can be learned from data

2.How original data compress into the embedded space?

In Deep Network, Embedding layer is just another hidden layer with 1 unit per dimension. Take Skip-gram as a detailed example
alt text
Let take text data as an example
Training Procedure:

Define nearby context:
- in text data, “nearby” is “window size” param. E.g: window size = 5 => take 5 words behind & 5 words ahead
- training samples in form of (context, target)

alt text

Input node
- Target words: Convert target word “ants” as one-hot vector
- Context words:
Input weights
Hidden layer: each node corresonds to a dimension (latent feature), and has size [input size , embedding size].
Output layer: a vector (same length as input) of probability.
- E.g: node containing $w_i$ represents Pr( a randomly selected nearby word is $w_i$ )
- E.g: if input is “Soviet”, then output prob of “Union” > output prob of “watermelon”
Optimization function: while a simple MLP network user MSE, Skip-gram use negative log likelihood of a word given a set of context : -log (Pr( $w_o | w_i$ ) )

alt text

Activation function:
- hidden layer’s neurons: linear function
- output layer’s neurons: Softmax

alt text
After training:

Goal: extract the hidden layer weight matrix (and discard the output layer)
Back to answer the question, we can look up the weight at word $w_i$ to map its representation on the embedding space

3. The input matrix of users vs movies is extremely sparsed and inefficient in term of memory, how do you exploit the sparsity and find a better reprentation for input matrix?

Build a dictionary (aka. lookup table) that map each feature (in this case user) to indices of movies they interacted in the past.
Ex: {user1: [1,38, 802], user2: [63, 982, 789}]

4. Word2Vec requires a huge network (weight matrix = input x embedding features). Is there anyway to reduce the weight matrix for faster convergence?

Some common methods is:

Treating common words or phrases as single “words”
Subsampling frequent words to decrease the number of training examples
Modifying the optimization objective with Negative Sampling, which causes each training sample to update only a small percentag of the model’s weights

5. How do know the optimal number of Embeddings Dimension?

Higher-dimensional embeddings can capture more accurately the relationship b/w input values (minimize loss of original data)
But more dimension increases the chance of
- Overfitting
- Slower training
Emperical rule-of-thumb (good starting point but should be tuned using the validation data):
$dimensions ~= \sqrt[4]{possibleValues}$

C. Theories behind VSM

Distributional Hypothesis: words appeared in the same context share semantic meanings
Laten Semantic Analysis (count-based method):
- compute statistics of how often a word co-occur with its neighbor words in a large corpus
- map these statistics to a small, dense vector for each word
Neural Probabibilistic language model: predictive methods, also called Word2Vec which is known for computationally efficient

	CBOW	Skip-Gram
Main Idea	predict target word from context words	predict context words from target word
Training unit	treat an entire context as 1 observation (=> smooth over many distributional info)	treat each context-target pair as 1 observation
Illustration
Advantage	1. good for small dataset 2.low on memory 3. probabilistic in nature	1. good for larger dataset 2. can capture 2 semantics for a single word. i.e. it will have 2 vector representations of “Apple” 3. Combined with Negative Sub-sampling outperforms every other method generally
Disadvantage	1. take average of all context of a word 2. can take forever to train if not properly optimized	good for larger dataset

Scaling up with Noise-Contrastive Training

	Softmax Classifier	Noise Classifier
Concept	1. Use `Maximum Likelihood` to maximize Pr (next word given previous words). 2. Use `softmax function` to compress score ( $w_{t}$ , h) to sclae [0,1]	Use `Logistic Regression` to discrimiate the real target word $w_{t}$ from k noise word $w'$
Objective	max log-likelihood of (w, h) on training set by max $J_{ML}$	max $J_{NEG}$ by assigning high Pr to the real word & low Pr to noise words (`Negative Sampling`)
Formular
Illustration
Computation	Expensive: have to normalize the score over entire V in current h at every training step	Efficient: scale only w/t noise word k, not all word V

Skip-gram model

Noise-Contrastive Estimation (NCE)

(ongoing)

Resources

[1] http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
[2] http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
[3] https://www.tensorflow.org/tutorials/representation/word2vec
[4] https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
[5] https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture