Quantcast
Channel: Machine Learning | Towards AI
Viewing all articles
Browse latest Browse all 819

Understanding Attention In Transformers

$
0
0
Author(s): Shashank Bhushan Originally published on Towards AI. Introduction Transformers are everywhere in machine learning nowadays. What started as a novel architecture for sequence-to-sequence language tasks such as translation, question answering, etc. can now be found in virtually all ML domains, from Computer Vision to Audio to Recommendation. While a transformer has multiple components, the core piece is undoubtedly its use of the attention mechanism. In this post we will start by going over what attention is and how transformers use it, next, we will go over some theoretical reasoning for why it works/scales so well and finally, we will look at some shortcomings of the original Transformer proposal and potential improvements. What Is Attention Suppose you are given the sentence “Today’s date went well” and need to figure out what the word date means. As the word itself has many meanings, fruit, calendar date, etc, there isn’t a universal meaning that can be used. Instead, we will have to rely on the context, or in other words attend to the context, “went well” to understand that it probably means a romantic date. Now let’s see how we would mathematically a) find these context words and b) then use them to arrive at the correct word meaning. First, we will break the sentence down into words or tokens (in practical applications the tokens are generally sub-word) and then replace each word with its corresponding embedding representation from a pre-learned system. If you are not sure what embeddings are, just think of them as an n-dimensional vector that semantically represents a word. These representations also maintain relational properties between words. E.g. the distance between the representation of King and Queen would be the same as the distance between Man and Woman. Now let’s get back to attention. Given that we have these embeddings that capture semantic information, one way to figure out the context words would be to compute the similarity between the word embeddings. Words with high similarity are most likely to be found together in text making them the right candidates to provide the contextual information. The similarity can be computed using functions such as cosine similarity or dot product. Once we have computed the similarity of the target word with all the words in the sentence (including the target word) we can do a weighted sum of word embeddings using the similarity as the weights to get an updated embedding for the target word. If it is not clear why doing a weighted sum would work. Think of the initial embedding of the target word “date” as an average representation that captures all different meanings for the word and we want to move this representation in a direction that is more aligned to its current context. The embedding similarity tells us how different words should affect the final direction of “date”’s embedding. The weighted sum thus allows us to move the embedding in the appropriate direction. Note: The similarity weights should be normalized so that they sum up to 1, before doing the weighted sum. How is Attention Used in Transformers Now, we are ready to examine how attention is defined in the original Transformer paper, Attention is all you need. Here Q, K, and V are all an NxD matrix called the query, key, and value matrix respectively. N is the number of tokens/words each of which is represented by a D size vector. There is also a scaling factor and optional masking in the equation which I am ignoring for simplicity. While this may look very different from what we just looked at, it does the same thing. If we assume Q, K, and V are all the same NxD matrix. Then the Q*Kᵗ is a matrix operation that does the similarity computation for all pairs of tokens at once. The output of this operation will be an NxN matrix where the ith row represents the similarity of the ith with the remaining tokens in the sentence. Using the earlier example of “Today’s date went well” the 1st row (assuming 0 indexed matrix) would store the similarity of the word “date” with all the other words in the sentence. Softmax would then normalize these similarity values. Finally, the matrix multiplication of the softmax output with V computes the weighted sum of the embeddings for all the tokens/words at once. So why are there 3 different matrices and what’s their significance? Query, Key, and Value matrices are transformed versions of the input embedding matrix. Before being passed to the attention mechanism, the input embedding goes through 3 linear projections, the reason for doing this is to add more learning power to the attention mechanism. The terminology used is borrowed from database systems. Q/query represents the values for which we want to compute the attention, K/keys represents values over which attention can be computed (keys in the database). Finally, V/values are the output values. Multi-Head Attention Each transformer block has multiple attention mechanisms or heads running in parallel. The outputs of the individual heads are concatenated together and then run through a linear layer to generate the final output. The figure below shows the overall setup. Multi Head Attention, Source: Attention is all you Need There are two important things to call out about the multi-head setup: Each attention head has its own matrices that create Q, K, and V matrices. This allows each attention head to focus on different properties The per token embedding dimension in the Q, K, and V matrix, K is smaller than the original embedding dimension E such that K = E/n where n is the number of attention heads. This is the reason why the output of attention heads gets concatenated, thereby making the output dimension size E again. The reason for doing this is to ensure that multi-head attention is computationally equivalent to a single attention whose embedding size is E. What Makes Transformers Powerful Unlike a Convolutional or Recurrent Neural Network, Transformers or rather the attention setup does not […]

Viewing all articles
Browse latest Browse all 819

Trending Articles