From Pixels to Words: How Model Understands? 🤝🤝

Author(s): JAIGANESAN Originally published on Towards AI. From Pixels to Words: How Model Understands? 🤝🤝 From the pixels of images to the words of language, explore how multimodal AI models bridge diverse data types through sophisticated embedding communication. 👾 Photo by Andy Kelly on Unsplash In this article, we’ll dive into the world of multi-modal models, where numerical representations of different data types come together to achieve a common goal. Specifically, we’ll explore how image feature representations are understood through text descriptions. My main objective is to examine how embeddings from different modalities are used to achieve the objective of a model or use case. As a bilingual person, I can understand and translate both Tamil and English. How can I do this? It’s because I’ve learned the words, meanings, semantic, and syntactic representations of both languages. For me, both languages are just a means of communication, and I can switch between them effortlessly. Similarly, we use multimodal models to learn from modalities like image, text, audio, and video. Everything is represented as vectors or numerical representations. After all, AI is built on mathematical concepts. AI is not a Magic, It’s a Math ✌️ If you’re unfamiliar with how transformer models work, I recommend checking out my previous article, “Large Language Model (LLM): In and Out,” to understand the topic completely. Large Language Model (LLM)🤖: In and Out Delving into the Architecture of LLM: Unraveling the Mechanics Behind Large Language Models like GPT, LLAMA, etc. pub.towardsai.net Let’s dive into the world of multi-modalities and explore how ML models represent and understand different modalities. In this article, we’ll explore three key concepts in multi-modal models: 😁 👉 Joint embedding space, 👉 Cross-attention, 👉 Concatenation and fusion. Let’s start with the joint embedding space 🐎 1. Joint Embedding Space: A Shared Space for Images and Text ✌️ Image 1: CLIP Architecture, Source: https://arxiv.org/pdf/2103.00020 Image 1: Left side → CLIP Architecture, where the Text encoder is transformer encoder architecture, and the Image Encoder is the Vision Transformer. The T_1, T_2, T_3, .., T_N represents the Text embeddings and I_1, I_2, I_3, … I_N represents the Image embeddings. By Changing the last linear layer in the text and image encoder’s architecture, we get the same size of vector (e.g.: 1024 dimensions). This Image and Text vectors have the information of Image and text respectively. You might be aware of OpenAI’s CLIP [1](Contrastive Language Image Pre-training) model (Image 1), which is built upon the idea of a shared embedding space where both images and their textual descriptions are projected. This means that each image and its corresponding textual description are mapped to the same embedding space. By mapping images and text into the same space, CLIP enables direct comparison and combination of their representations. Images and their corresponding text are expected to have similar embeddings if they are semantically related (e.g., an image of a car and the text “a car”). Image 2: Joint Embedding beginning of the CLIP training. Created by the author Image 3: Joint Embedding after CLIP training. Created by the author 1.1 How CLIP Maps Images and Text to a Shared Embedding Space CLIP uses a transformer architecture for both text and image modalities. Text descriptions are converted into embedding vectors using a transformer encoder architecture, while images are represented as embedding vectors using a Vision Transformer (ViT). CLIP is trained using a contrastive learning objective, where the model learns to bring similar image-text pairs closer together in the embedding space while pushing dissimilar pairs farther apart. For example, The text encoder and Vision Transformer convert text descriptions and images into numerical representations using their respective architectures. The shared embedding space requires that both textual description vectors and Vision Transformer output vectors have the same dimension. You might be familiar with cosine similarity, which measures the similarity between two vectors. In CLIP, we use a similar approach, but in the opposite direction. Before training CLIP, the text description vector and image feature vector may be in different locations in the joint multi-dimensional space, as shown in Image 2. Our training objective is to make the vectors of similar pairs come as close as possible and push dissimilar pairs apart. We achieve this by assigning labels to similar pairs like (T_1, I_1), (T_2, I_2), … (T_N, I_N) 1, and dissimilar pairs like (I_1, T_2), (I_1, T_3), (T_1, I_3), … -1. By keeping this objective and loss function, we tweak both the text encoder and Vision encoder to generate the most similar vectors for similar image-text pairs as shown in Image 3. In this way, the model learns the representation of images and text. This training encourages the model to understand the semantic relationships between images and their associated text. Although the different modalities’ vectors may not be the same, they will be very close in the joint embedding space. Once trained, CLIP can perform tasks such as cross-modal retrieval, where it retrieves images based on textual queries or vice versa, and its ability to generalize the unseen data depends on how well the understanding of semantic relationships in joint embedding space. 2. Cross Attention: Relating Information between Different Modalities 🐊 2.1 Cross Attention in Vanilla Transformer Image 4: Source: Attention is all you need, Research paper. Edited by author As we know, the Attention is All You Need [2] paper clearly explains the cross-attention mechanism (Image 4). In the vanilla transformer, there are two components: the encoder and the decoder. The encoder deals with one modality (English), while the decoder deals with another modality (French). This architecture is essentially a language translator, and we can build models to translate sentences from English to French or any language to any language. Note: During training, the virtual tokens [SOS] and [EOS] in image 4 are not universally necessary for all types of training. This is because the model learns from the context of the training data and calculates its loss based on the sequence itself. It doesn’t need explicit markers to show where […]

From Pixels to Words: How Model Understands? 🤝🤝

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112