by Roopal Garg
Deep Learning is a buzz word that gets thrown around a lot these days. It’s thought of the “next big thing” which has already started turning many heads and convinced many who initially thought of it as a bubble. It’s an active area of research and its applications are being explored in just about anything and everything you can think of.
Deep Learning and Neural Networks are not new, they have been around for decades but the lack of accessible, affordable computational power as well as available data was a major bottleneck. The advent of more sophisticated algorithms, computational powers from GPUs becoming cheaper and data literally flowing in from all directions have led to what can be called a renaissance for Deep Learning.
The major advantage that Neural Networks provide over traditional learning algorithms can be described in terms of performance with the increase in amount of data:
With increasing amount of data, at some point traditional learning algorithm become stagnant as if they can not learn more while Neural Networks become better and better as the data and their own size increases.
I assume you have some basic understanding of how Neural Networks work, what the significance of a Neuron is and how layers of these neurons can be connected together to build complicated and powerful architectures. If not, I would highly recommend reading this blog first to get some idea.
Deep Learning for NLP is relatively new as compared to its usage in say Computer Vision which employs it to process images and videos. Before we dive into how DL works for NLP let’s try and think how the brain probably interprets text.
Think about the following sentences:
1. “Hi, what’s up?” – The sentence, though very simple, has some overall meaning to it which the brain maps to say a point in its infinite space of understanding.
2. “Hi, how are you?” – Most of us will agree the above two sentences are very similar in terms of meaning and context and so potentially the brain has a very similar mapping for the two, i.e, it maps the points representing their meaning very close to each other.
3. “Trying to understand DL!” – This sentence has a meaning very different from the previous two and so we can think that its mapping would be somewhat apart from the previous two points, the distance representing the difference in the overall meaning of the sentences.
In order to interpret and represent the mapping the brain just did, we would say that it has an infinite dimensional space, i.e., our brains are capable of processing and understanding infinite number of minutely distinct concepts (recalling them is a different matter altogether) and each concept or sentence or document or anything that can carry a meaning can be represented as a point in it. Which would mean it’s an infinite dimensional vector, i.e., a pair of infinite numbers.
Lets trim that thought a bit to say that this particular brain can only represent concepts in a 5-dimensional vector space (not too smart huh!). So that means every point becomes a 5-dimensional vector, a way of embedding data.
The smallest way to represent meaning in text is through words. Let’s try to understand how we embed words.
Simply put word embeddings let us represent words in the form of vectors. But these are not random vectors, the aim is to represent words via vectors such that similar words or words used in a similar context are close to each other while antonyms end up far apart in the vector space.
From the diagram above:
- cat and dog: both cute animals, both can be pets, 2 eyes, 4 legs, one cute little nose! Different in their own way but similar in a lot of ways.
- Audi and BMW: both powerful expensive German automobile companies
- USC and UCLA: both premier universities located in Los Angeles
The above pairs within themselves(cat and dog) are very similar to each other and thus should be mapped close while the pairs themselves((cat and dog) vs (Audi and BMW)) are very different from each other and thus are mapped apart.
Word embeddings also learn relations such as:
- KING – MAN + WOMAN ⇒ QUEEN
PARIS – FRANCE + ENGLAND ⇒ LONDON
There are a few techniques to determine these embeddings given a big enough corpus(eg, english wikipedia), the most prevalent being: word2vec, glove (and fastText, a relatively newer one)
Comes from the house Google and has two flavors to it:
1. Continuous Bags of Words aka CBOW:
The aim is to fill in the missing word given its neighboring context. From the example: given “When”, “in”, “____”, “speak”, “French”. The algorithm learns for “France” to be the obvious choice.
Given a word, predict its context. So from the example: given “France” predict “When”, “in”, “speak”, “French” as its neighboring words.
Let’s see what’s really happening in the above skip-gram diagram:
- Input Layer:
- Well we want to understand how to represent words, so words should be the input. But we can’t feed a word in the string form to a neural network!
- The way we represent individual words is through a unique index mapping, ie, each word has a unique index. If we have say V distinct words, then our objective is to learn the representation of these V words/indexes in the form of some D dimensional vector each.
- We one-hot encode the word indexes, ie, each word from being an index now becomes a V dimensional vector of zeroes being 1 only at the index it represents.
- So “France” is represented by something like: [0, 0, 1, 0, 0, ……..0]1*V where the index for “France” was 2.
- Projection Layer:
- Our vocabulary size is V and we want to learn a D dimensional representation for each word in the vocabulary, the projection layer is a V*D matrix
- Output Layer:
- This layer takes the output of the Projection layer and creates a probability distribution using a softmax function across the V words. The learning phase, tunes the projection layer such that eventually words like “When”, “in”, “speak”, “French” have a high probability compared to most of other words in V when “France” is the input.
After the training phase, it’s the projection layer that is picked up and used as the word embeddings for the V words. The projection layer simply becomes a lookup table where each ith row represents the embeddings for the word with index i
Comes from the house Stanford. The researchers here argue that the ratio of word-word co-occurrence probabilities have the potential to encode some form of meaning.
From the example:
Target words: ice, steam
Probe words: solid, gas, water, fashion
Now as you would expect ice co-occurs with solid way more often than with gas, similarly steam co-occurs more frequently with gas than with solid and both ice and steam co-occur more frequently with water (being a common feature) and seldom with an unrelated word like fashion.
Talking the ratios as in 3rd row in the diagram really starts to clear out the noise from non-deterministic features like water and fashion here. Really large values correlate well with the properties of the numerator (ice) and really small values correlate well with the denominator (steam). The authors just demonstrated how simple ratios can represent meaning, thermodynamic phases in this case: solid with ice and gas with steam!
The usage is same as word2vec just a different take on generating the V*D projection layer.
Word Embeddings can be considered as the building blocks for using NN to do NLP. Its an unsupervised learning technique and thus can be trained on any corpus without the need for any human annotation. They provide a nice starting point for training any NN taking text as its input(you will have to convert that to indices first) since they capture similarity and relations like we saw in the examples above.
Putting it all in one line:
Given a word, we represent it using an index and learn to represent it using a D-dimensional vector such that the mappings using those vectors captures some sort of relation and similarity in words.
The next post will pick up from here and talk on how to process sentences and documents using Recurrent Neural Nets (RNNs).