Understanding the Mechanics of Neural Machine Translation

Author(s): Saif Ali Kheraj Originally published on Towards AI. As large language models become more prevalent, it is essential that we study and concentrate on attention models, which play an essential role in both Transformer and language models. First, let us get a better understanding of the Sequence to Sequence Encoder Decoder Network. After that, we will proceed to the most important “Attention Model” and examine it in greater detail. Traditional Sequence to Sequence: Encoder-Decoder Network Let us see this particular translation and let us see how it is represented in the traditional seq-to-seq model. Traditional sequence-to-sequence models face difficulties due to their fixed context window. In the classic sequence-to-sequence approach, the encoder’s last hidden state vector is extremely critical. That vector captures the complete input representation, which is then used during the decoding process. Let’s look at each component of the diagram above: Figure1 by Author: Traditional Seq to Seq with Fixed Context I have not shown you all of the encoder cells in the diagram above, but the general idea is that we are only using the encoder’s last hidden state and passing it to the decoder to process the translation of this English sentence. As shown in the diagram above, the Encoder’s final hidden state is intended to encapsulate all of the information from the English input sequence in a fixed-length vector. This final hidden state can then be processed further to generate an output sequence in the decoding process. The issue, and a major problem, is that it does not scale well. All input sequences must be compressed into a single vector, which results in information loss, particularly for longer sequences. Longer sequences lead to decreased performance. More advanced architectures, such as the Transformer model (used in BERT, GPT, and so on), do not rely on a single hidden state vector to transfer information between the encoder and the decoder. Instead, they use mechanisms such as attention to allow each component of the decoder to access the entire encoder output, thus addressing the context limitation issue. In this post, we will go over basic concepts of attention mechanisms, such as alignment scores and attention weights, which are used by decoders to accurately predict the next word by focusing on the right hidden vector of the input sequence in the encoder. We will cover the fundamentals of scaled dot products, teacher forcing, and pre-attention decoders, as well as their connections. Is using all the hidden states a solution? One solution to using only the last hidden state is to pass all of the hidden states to the decoder and perform some sort of point-wise addition, but this is another problem because the network still does not know which word in the encoder to focus on more. Figure 2 by Author: Trying Point Wise Addition Attention is all you need!! Alignment Scores The alignment score calculates the similarity between each encoder’s hidden state and each decoder’s hidden state. Let me give you an example. In this example, the source sentence is “Its time for coffee” and the target sentence is “C’est l’heure du café” in French. Let’s call h1 (hidden state) “Its”, h2 “time”, h3 “for”, and h4 “coffee”. We will calculate alignment scores step by step for each word in the sentence. When translating or predicting a specific word, the decoder examines all of the encoder’s hidden states and attempts to determine which English words are most relevant to produce the first word in French, which is C’est. Figure 3 by Author: Adding Attention Mechanism The scores in green are essentially normalized alignment scores after applying softmax. This is what the attention is all about. For the decoder to predict the first word “C’est”, it must now decide which English word to focus on more. As we can see in this example, the first word should be more focused on “Its” (probability 0.8). These probabilities are referred to as attention weights. These attention weights are for translating the first word. Attention weights are denoted as αij, with i representing the decoder (output word) and j representing the encoder (input word). The figure above shows attention weights α1j. Alignment scores are essentially a scoring system used by the model to determine which words in the input sentence should be prioritized when generating each word in the output sentence. The attention mechanism ensures that the translation is contextually appropriate, even when the sentence structure varies between languages. Now that you have understood a bit of intuition, let us understand how attention weights are calculated using the attention mechanism. Figure 4 by Author: Attention weights calculation The above is a very simple architecture of the attention mechanism. Each encoder’s hidden state (hj) represents the input words in an English sentence (h1, h2, h3, and h4). Si-1 is the decoder’s hidden state. Both are fed into feedforward neural networks before being processed by softmax for weight normalization. To summarize, the softmax operation converts alignment scores into weights that quantify the importance of each encoder state to the decoder’s current state. αij is the attention weight for the jth input word’s influence on the ith output word, while hj is the jth encoder hidden state. These weights are used to generate a context vector for the decoder. The context vector for the current output word is computed by multiplying each encoder hidden state (hj) by its corresponding attention weight (αij). ci represents the context vector for the ith word in the output sequence. Figure 5 by Author: Context Vector, https://arxiv.org/pdf/1409.0473.pdf Now that we have the context vector, let’s call the first step c1 and the initial decoder hidden state s0. We will combine these two using a concatenation function, followed by tanh or another non-linear activation function. This combination provides a rich signal for the formation of the first word in the target language. What is our learning here? When translating a sentence from English to French using machine translation, the model does not always translate each word in the correct […]

Understanding the Mechanics of Neural Machine Translation

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112