A powerful representation of high-dimensional data: Embeddings

Why Embeddings?

Embeddings has gained many popularity in Deep Learning world, since it outperforms many traditional data representation. One of the most prominent applications of embedding is in Natural Language Processing. Word Embeddings has proven superior results than the good old Discrete Bags of Words (e.g one-hot-encoding). Not only in NLP, due to its powerful presentation of high-dimensional data, the concept of Embeddings has extend to other domains such as E-commerce, and Web Search. For instance, Airbnb embeddeds their Click data to represent their listing data, and inject that embedded data to improve their Pricing tools, or make real-time personalization in their Search algorithm.
Alright, now you have a sense how powerful and exciting this algorithm is, let’s dive into its meat and bones.

Traditional Vector Representation of High-dimensional Data

Image pixels, audio spectrogram, and words are normally encoded as “discrete atomic symbols” (similar to lookup table), but this representation has big shortcomings:

That’s when Embedding comes to the rescue

Featurized Vector Space Representation: Word Embedding

A. What is Word Embedding?

Embed words as “continuous vector space” where semantically similar words are mapped closed together. This representation assumes that words frequently appearing together in a sentence would probably share some statistical dependency.

B. Questions you might ask about Word Embeddings?

1.How Embeddings represents multi-dimensional data?

alt text
d-Dimensional Embeddings:

2.How original data compress into the embedded space?

In Deep Network, Embedding layer is just another hidden layer with 1 unit per dimension. Take Skip-gram as a detailed example
alt text
Let take text data as an example
Training Procedure:

alt text

alt text

alt text
After training:

3. The input matrix of users vs movies is extremely sparsed and inefficient in term of memory, how do you exploit the sparsity and find a better reprentation for input matrix?

Build a dictionary (aka. lookup table) that map each feature (in this case user) to indices of movies they interacted in the past.
Ex: {user1: [1,38, 802], user2: [63, 982, 789}]

4. Word2Vec requires a huge network (weight matrix = input x embedding features). Is there anyway to reduce the weight matrix for faster convergence?

Some common methods is:

5. How do know the optimal number of Embeddings Dimension?

C. Theories behind VSM

CBOW Skip-Gram
Main Idea predict target word from context words predict context words from target word
Training unit treat an entire context as 1 observation (=> smooth over many distributional info) treat each context-target pair as 1 observation
Illustration alt text alt text
Advantage 1. good for small dataset 2.low on memory 3. probabilistic in nature 1. good for larger dataset 2. can capture 2 semantics for a single word. i.e. it will have 2 vector representations of “Apple” 3. Combined with Negative Sub-sampling outperforms every other method generally
Disadvantage 1. take average of all context of a word 2. can take forever to train if not properly optimized good for larger dataset

Scaling up with Noise-Contrastive Training

Softmax Classifier Noise Classifier
Concept 1. Use Maximum Likelihood to maximize Pr (next word given previous words). 2. Use softmax function to compress score (wtw_{t}, h) to sclae [0,1] Use Logistic Regression to discrimiate the real target word wtw_{t} from k noise word ww'
Objective max log-likelihood of (w, h) on training set by max JMLJ_{ML} max JNEGJ_{NEG} by assigning high Pr to the real word & low Pr to noise words (Negative Sampling)
Formular alt text alt text alt text
Illustration alt text alt text
Computation Expensive: have to normalize the score over entire V in current h at every training step Efficient: scale only w/t noise word k, not all word V

Skip-gram model

Noise-Contrastive Estimation (NCE)

(ongoing)

Resources

[1] http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
[2] http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
[3] https://www.tensorflow.org/tutorials/representation/word2vec
[4] https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
[5] https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture