Quantcast
Channel: Machine Learning | Towards AI
Viewing all articles
Browse latest Browse all 792

From Pixels to Words: How Model Understands? 🤝🤝

$
0
0
Author(s): JAIGANESAN Originally published on Towards AI. From Pixels to Words: How Model Understands? 🤝🤝 From the pixels of images to the words of language, explore how multimodal AI models bridge diverse data types through sophisticated embedding communication. 👾 Photo by Andy Kelly on Unsplash In this article, we’ll dive into the world of multi-modal models, where numerical representations of different data types come together to achieve a common goal. Specifically, we’ll explore how image feature representations are understood through text descriptions. My main objective is to examine how embeddings from different modalities are used to achieve the objective of a model or use case. As a bilingual person, I can understand and translate both Tamil and English. How can I do this? It’s because I’ve learned the words, meanings, semantic, and syntactic representations of both languages. For me, both languages are just a means of communication, and I can switch between them effortlessly. Similarly, we use multimodal models to learn from modalities like image, text, audio, and video. Everything is represented as vectors or numerical representations. After all, AI is built on mathematical concepts. AI is not a Magic, It’s a Math ✌️ If you’re unfamiliar with how transformer models work, I recommend checking out my previous article, “Large Language Model (LLM): In and Out,” to understand the topic completely. Large Language Model (LLM)🤖: In and Out Delving into the Architecture of LLM: Unraveling the Mechanics Behind Large Language Models like GPT, LLAMA, etc. pub.towardsai.net Let’s dive into the world of multi-modalities and explore how ML models represent and understand different modalities. In this article, we’ll explore three key concepts in multi-modal models: 😁 👉 Joint embedding space, 👉 Cross-attention, 👉 Concatenation and fusion. Let’s start with the joint embedding space 🐎 1. Joint Embedding Space: A Shared Space for Images and Text ✌️ Image 1: CLIP Architecture, Source: https://arxiv.org/pdf/2103.00020 Image 1: Left side → CLIP Architecture, where the Text encoder is transformer encoder architecture, and the Image Encoder is the Vision Transformer. The T_1, T_2, T_3, .., T_N represents the Text embeddings and I_1, I_2, I_3, … I_N represents the Image embeddings. By Changing the last linear layer in the text and image encoder’s architecture, we get the same size of vector (e.g.: 1024 dimensions). This Image and Text vectors have the information of Image and text respectively. You might be aware of OpenAI’s CLIP [1](Contrastive Language Image Pre-training) model (Image 1), which is built upon the idea of a shared embedding space where both images and their textual descriptions are projected. This means that each image and its corresponding textual description are mapped to the same embedding space. By mapping images and text into the same space, CLIP enables direct comparison and combination of their representations. Images and their corresponding text are expected to have similar embeddings if they are semantically related (e.g., an image of a car and the text “a car”). Image 2: Joint Embedding beginning of the CLIP training. Created by the author Image 3: Joint Embedding after CLIP training. Created by the author 1.1 How CLIP Maps Images and Text to a Shared Embedding Space CLIP uses a transformer architecture for both text and image modalities. Text descriptions are converted into embedding vectors using a transformer encoder architecture, while images are represented as embedding vectors using a Vision Transformer (ViT). CLIP is trained using a contrastive learning objective, where the model learns to bring similar image-text pairs closer together in the embedding space while pushing dissimilar pairs farther apart. For example, The text encoder and Vision Transformer convert text descriptions and images into numerical representations using their respective architectures. The shared embedding space requires that both textual description vectors and Vision Transformer output vectors have the same dimension. You might be familiar with cosine similarity, which measures the similarity between two vectors. In CLIP, we use a similar approach, but in the opposite direction. Before training CLIP, the text description vector and image feature vector may be in different locations in the joint multi-dimensional space, as shown in Image 2. Our training objective is to make the vectors of similar pairs come as close as possible and push dissimilar pairs apart. We achieve this by assigning labels to similar pairs like (T_1, I_1), (T_2, I_2), … (T_N, I_N) 1, and dissimilar pairs like (I_1, T_2), (I_1, T_3), (T_1, I_3), … -1. By keeping this objective and loss function, we tweak both the text encoder and Vision encoder to generate the most similar vectors for similar image-text pairs as shown in Image 3. In this way, the model learns the representation of images and text. This training encourages the model to understand the semantic relationships between images and their associated text. Although the different modalities’ vectors may not be the same, they will be very close in the joint embedding space. Once trained, CLIP can perform tasks such as cross-modal retrieval, where it retrieves images based on textual queries or vice versa, and its ability to generalize the unseen data depends on how well the understanding of semantic relationships in joint embedding space. 2. Cross Attention: Relating Information between Different Modalities 🐊 2.1 Cross Attention in Vanilla Transformer Image 4: Source: Attention is all you need, Research paper. Edited by author As we know, the Attention is All You Need [2] paper clearly explains the cross-attention mechanism (Image 4). In the vanilla transformer, there are two components: the encoder and the decoder. The encoder deals with one modality (English), while the decoder deals with another modality (French). This architecture is essentially a language translator, and we can build models to translate sentences from English to French or any language to any language. Note: During training, the virtual tokens [SOS] and [EOS] in image 4 are not universally necessary for all types of training. This is because the model learns from the context of the training data and calculates its loss based on the sequence itself. It doesn’t need explicit markers to show where […]

Viewing all articles
Browse latest Browse all 792

Trending Articles