Author(s): Saif Ali Kheraj Originally published on Towards AI. As large language models become more prevalent, it is essential that we study and concentrate on attention models, which play an essential role in both Transformer and language models. First, let us get a better understanding of the Sequence to Sequence Encoder Decoder Network. After that, we will proceed to the most important “Attention Model” and examine it in greater detail. Traditional Sequence to Sequence: Encoder-Decoder Network Let us see this particular translation and let us see how it is represented in the traditional seq-to-seq model. Traditional sequence-to-sequence models face difficulties due to their fixed context window. In the classic sequence-to-sequence approach, the encoder’s last hidden state vector is extremely critical. That vector captures the complete input representation, which is then used during the decoding process. Let’s look at each component of the diagram above: Figure1 by Author: Traditional Seq to Seq with Fixed Context I have not shown you all of the encoder cells in the diagram above, but the general idea is that we are only using the encoder’s last hidden state and passing it to the decoder to process the translation of this English sentence. As shown in the diagram above, the Encoder’s final hidden state is intended to encapsulate all of the information from the English input sequence in a fixed-length vector. This final hidden state can then be processed further to generate an output sequence in the decoding process. The issue, and a major problem, is that it does not scale well. All input sequences must be compressed into a single vector, which results in information loss, particularly for longer sequences. Longer sequences lead to decreased performance. More advanced architectures, such as the Transformer model (used in BERT, GPT, and so on), do not rely on a single hidden state vector to transfer information between the encoder and the decoder. Instead, they use mechanisms such as attention to allow each component of the decoder to access the entire encoder output, thus addressing the context limitation issue. In this post, we will go over basic concepts of attention mechanisms, such as alignment scores and attention weights, which are used by decoders to accurately predict the next word by focusing on the right hidden vector of the input sequence in the encoder. We will cover the fundamentals of scaled dot products, teacher forcing, and pre-attention decoders, as well as their connections. Is using all the hidden states a solution? One solution to using only the last hidden state is to pass all of the hidden states to the decoder and perform some sort of point-wise addition, but this is another problem because the network still does not know which word in the encoder to focus on more. Figure 2 by Author: Trying Point Wise Addition Attention is all you need!! Alignment Scores The alignment score calculates the similarity between each encoder’s hidden state and each decoder’s hidden state. Let me give you an example. In this example, the source sentence is “Its time for coffee” and the target sentence is “C’est l’heure du café” in French. Let’s call h1 (hidden state) “Its”, h2 “time”, h3 “for”, and h4 “coffee”. We will calculate alignment scores step by step for each word in the sentence. When translating or predicting a specific word, the decoder examines all of the encoder’s hidden states and attempts to determine which English words are most relevant to produce the first word in French, which is C’est. Figure 3 by Author: Adding Attention Mechanism The scores in green are essentially normalized alignment scores after applying softmax. This is what the attention is all about. For the decoder to predict the first word “C’est”, it must now decide which English word to focus on more. As we can see in this example, the first word should be more focused on “Its” (probability 0.8). These probabilities are referred to as attention weights. These attention weights are for translating the first word. Attention weights are denoted as αij, with i representing the decoder (output word) and j representing the encoder (input word). The figure above shows attention weights α1j. Alignment scores are essentially a scoring system used by the model to determine which words in the input sentence should be prioritized when generating each word in the output sentence. The attention mechanism ensures that the translation is contextually appropriate, even when the sentence structure varies between languages. Now that you have understood a bit of intuition, let us understand how attention weights are calculated using the attention mechanism. Figure 4 by Author: Attention weights calculation The above is a very simple architecture of the attention mechanism. Each encoder’s hidden state (hj) represents the input words in an English sentence (h1, h2, h3, and h4). Si-1 is the decoder’s hidden state. Both are fed into feedforward neural networks before being processed by softmax for weight normalization. To summarize, the softmax operation converts alignment scores into weights that quantify the importance of each encoder state to the decoder’s current state. αij is the attention weight for the jth input word’s influence on the ith output word, while hj is the jth encoder hidden state. These weights are used to generate a context vector for the decoder. The context vector for the current output word is computed by multiplying each encoder hidden state (hj) by its corresponding attention weight (αij). ci represents the context vector for the ith word in the output sequence. Figure 5 by Author: Context Vector, https://arxiv.org/pdf/1409.0473.pdf Now that we have the context vector, let’s call the first step c1 and the initial decoder hidden state s0. We will combine these two using a concatenation function, followed by tanh or another non-linear activation function. This combination provides a rich signal for the formation of the first word in the target language. What is our learning here? When translating a sentence from English to French using machine translation, the model does not always translate each word in the correct […]
↧