Author(s): Ingo Nowitzky Originally published on Towards AI. For the past two years, ChatGPT and Large Language Models (LLMs) in general have been the big thing in artificial intelligence. Many articles about how-to-use, prompt engineering and the logic behind have been published. Nevertheless, when I started familiarizing myself with the algorithm of LLMs — the so-called transformer — I had to go through many different sources to feel like I really understood the topic.In this article, I want to summarize my understanding of Large Language Models. I will explain conceptually how LLMs calculate their responses step-by-step, go deep into the attention mechanism, and demonstrate the inner workings in a code example.So, let’s get started! Table of contents Part 1: Concept of transformers 1.1 Introduction to Transformers1.2 Tokenization1.3 Word Embedding1.4 Positional Encoding1.5 Attention Mechanism1.6 Layer Norm1.7 Feed Forward1.8 Softmax1.9 Multinomial Part 2: Implementation in code 2.1 Data Preparation2.2 Tokenization2.3 Data Feeder Function2.4 Attention Head2.5 Multi-head Attention2.6 Feed Forward of Attention Block2.7 Attention Block2.8 Transformer Class2.9 Instantiate the Transformer2.10 Model Training2.11 Generate new Tokens Part 1: Concept of Transformers 1.1 Introduction to Transformers We cannot discuss the topic of Large Language Models without citing the famous paper “Attention Is All You Need” published by Vaswani et al. in 2017. In this article, the group of researchers introduced the attention mechanism and the transformer architecture that sparked the revolution in generative AI we experience today. Originally, the paper referred to machine language translation and introduced an encoder-decoder structure. Fig. 1.1.1: Transformer architecture introduced by Vaswani et al. | left original, right with explanation by author Fig. 1.1.1 on the left, shows the transformer as published in the paper, and on the right I marked the encoder and the decoder part. In machine language translation, the initial language is encoded by the encoder and decoded into the target language by the decoder.In contrast, ChatGPT has a decoder-only architecture. Therefore, in the following, we will ignore the left side and fully concentrate on the decoder. Before I start explaining the transformer, we need to recall that ChatGPT generates its output in a loop, one token after the other. Let’s assume we input the words “ChatGPT writes…” (yes, I know this context is unrealistically short). ChatGPT might output the token “…one” in the first cycle. The initial words plus the first output build the context for the second generation cycle, so the input is “ChatGPT writes one…”. Now, ChatGPT might output “…word”, which is concatenated to the existing context and inputted again. This loop goes on until the generated output is a stop token, which indicates that the response has reached its end and the generation loop is finished until the next user interaction. Fig. 1.1.2: ChatGPT generates its output in a loop one token after the other | image by author Now, the big question is: what happens inside the magic box denoted “ChatGPT” in Fig. 1.1.2? How does the algorithm conclude which token to output next? This is exactly the question we will answer in this article. Fig. 1.1.3 shows the processing steps of the transformer in a sequence and is an alternative illustration to that in the original paper (Fig. 1.1.1). I prefer using this image because it allows me to better structure the explanation. Fig. 1.1.3: Transformer architecture as used in ChatGPT | image by author In Fig. 1.1.3, we see the input into the transformer on the bottom left – the token sequence “ChatGPT writes…” – and the output of the transformer on the top right, which is “…one”. What happens between input and output? On the left side of Fig. 1.1.3, we find some preprocessing steps: tokenization, word embedding, and positional encoding. We will study these steps right after this introduction. In the middle part, we see the so-called attention block. This is where the context of the words and sentences is processed. The attention block is the magic of ChatGPT and the reason why the outputs of the bot are so convincing. On the right side of Fig. 1.1.3, we see that the output of the attention block is normalized (“Layer Norm”), fed into a neural network (“Feed forward”), softmaxed, and finally runs through a multinomial distribution. Later, we will see that, with those four steps, we calculate the probabilities for all tokens in our vocabulary to be the next output, and sample the actual output from the multinomial distribution according to those probabilities. But be patient — we will study this in the required details later in the article. For now, we accept that the output of this process is the token “one“. With this overview in mind, let us go through the processing steps one by one in the next chapters. 1.2 Tokenization Tokens are the basic building blocks for text processing in Large Language Models. The process of splitting text into tokens is called tokenization. Depending on the tokenization model, the received tokens can look quite different. Some models split text into words, others into subwords or characters. Independent of the granularity, tokenization models also include punctuation marks and special tokens like <start> and <stop> for controlling the LLM’s output to a user interaction.The basic idea of tokenization is to split the processed text into a potentially large but limited number of tokens the LLM knows. Fig. 1.2.1: Tokenization | image by author Fig. 1.2.1 shows a simple example. The context “Let’s go in the garden” is split into the seven tokens “let”, “ ‘ “, “s”, “go”, “in”, “the”, “garden”. These tokens are known to the LLM and will be represented by an internal number for further processing. Vice versa, when the LLM processes its output, it determines the next token through probabilities and composes the outputted sentences from the tokens of several generation cycles. 1.3 Word Embedding So far, we have seen that the tokenizer splits the input sentences into tokens. Next, word embedding translates tokens into large vectors with usually several hundred or several thousand of dimensions, depending on the chosen model. […]
↧