Embeddings has gained many popularity in Deep Learning world, since it outperforms many traditional data representation. One of the most prominent applications of embedding is in Natural Language Processing. Word Embeddings has proven superior results than the good old Discrete Bags of Words (e.g one-hot-encoding). Not only in NLP, due to its powerful presentation of high-dimensional data, the concept of Embeddings has extend to other domains such as E-commerce, and Web Search. For instance, Airbnb embeddeds their Click data to represent their listing data, and inject that embedded data to improve their Pricing tools, or make real-time personalization in their Search algorithm.
Alright, now you have a sense how powerful and exciting this algorithm is, let’s dive into its meat and bones.
Image pixels, audio spectrogram, and words are normally encoded as “discrete atomic symbols” (similar to lookup table), but this representation has big shortcomings:
That’s when Embedding comes to the rescue
Embed words as “continuous vector space” where semantically similar words are mapped closed together. This representation assumes that words frequently appearing together in a sentence would probably share some statistical dependency.
d-Dimensional Embeddings:
In Deep Network, Embedding layer is just another hidden layer with 1 unit per dimension. Take Skip-gram as a detailed example
Let take text data as an example
Training Procedure:
(context, target)
Input node
Input weights
Hidden layer: each node corresonds to a dimension (latent feature), and has size [input size , embedding size].
Output layer: a vector (same length as input) of probability.
Optimization function: while a simple MLP network user MSE, Skip-gram use negative log likelihood
of a word given a set of context : -log (Pr() )
After training:
Build a dictionary (aka. lookup table) that map each feature (in this case user) to indices of movies they interacted in the past.
Ex: {user1: [1,38, 802], user2: [63, 982, 789}]
Some common methods is:
Negative Sampling
, which causes each training sample to update only a small percentag of the model’s weightsWord2Vec
which is known for computationally efficientCBOW | Skip-Gram | |
---|---|---|
Main Idea | predict target word from context words | predict context words from target word |
Training unit | treat an entire context as 1 observation (=> smooth over many distributional info) | treat each context-target pair as 1 observation |
Illustration | ![]() |
![]() |
Advantage | 1. good for small dataset 2.low on memory 3. probabilistic in nature | 1. good for larger dataset 2. can capture 2 semantics for a single word. i.e. it will have 2 vector representations of “Apple” 3. Combined with Negative Sub-sampling outperforms every other method generally |
Disadvantage | 1. take average of all context of a word 2. can take forever to train if not properly optimized | good for larger dataset |
Softmax Classifier | Noise Classifier | |
---|---|---|
Concept | 1. Use Maximum Likelihood to maximize Pr (next word given previous words). 2. Use softmax function to compress score (, h) to sclae [0,1] |
Use Logistic Regression to discrimiate the real target word from k noise word |
Objective | max log-likelihood of (w, h) on training set by max | max by assigning high Pr to the real word & low Pr to noise words (Negative Sampling ) |
Formular | ![]() ![]() |
![]() |
Illustration | ![]() |
![]() |
Computation | Expensive: have to normalize the score over entire V in current h at every training step | Efficient: scale only w/t noise word k, not all word V |
(ongoing)
[1] http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
[2] http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
[3] https://www.tensorflow.org/tutorials/representation/word2vec
[4] https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
[5] https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture