Exploring the Landscape of Generative AI and the Emergence of Multimodal AI
Being in the Data Science field means you live and breathe AI, Data, and LLMs (Large Language Models), or at least know the hype and buzz about them. A few years back, my girlfriend asked me what Gen AI and Chat GPT are; I see them everywhere on social media and the web. I jokingly replied, “ You better catch the AI train, or your job will be replaced by these AI models”. Remembering that today, I thought I would see the Google trends on the keywords Generative AI, Multimodals, and LLMs.
The trends (in this case, the last 2 years) reveal a steady rise in interest in generative AI, reflecting its growing impact and the popularity of tools like OpenAI’s DALL-E, ChatGPT and Google’s Gemini. Multimodal AI shows significant spikes, indicating bursts of innovation and public fascination with its ability to integrate various data types. LLMs maintain consistently high interest, underscoring their foundational role in advancing AI capabilities. These trends highlight AI technologies' dynamic and rapidly evolving landscape, driving innovation and capturing widespread attention.
But what exactly is generative AI, and why has it captured so much interest? Let’s delve deeper.
Introduction to Generative AI
Generative AI refers to artificial intelligence systems that can create new content, like writing text, drawing images, composing music, and more, by learning from examples they have been trained on. Over the past few years, generative AI has seen significant advancements, from creating simple outputs to producing highly sophisticated and realistic content. This progress has been driven by deep learning and neural network innovations, leading to models like OpenAI’s GPT series and DALL-E, which can generate detailed and complex images from textual descriptions.
However, generative AI is not a new concept — it has been around for decades. One of the earliest examples is ELIZA, created in 1966 by Joseph Weizenbaum. ELIZA was a simple chatbot that could mimic human conversation by matching patterns in the input text and generating appropriate responses. Another early example is AARON, developed in the 1970s by Harold Cohen, which could create original artworks autonomously. These early systems were rudimentary compared to today’s standards but were groundbreaking at the time, showing the potential of AI to generate content.
In the 1980s, David Cope’s Experiments in Musical Intelligence (EMI) composed music that mimicked the style of famous classical composers. This showed that AI could simulate conversation, create visual art, and generate music. These early generative AI systems paved the way for the sophisticated models we have today, highlighting the long-standing interest and progress in this field.
The Evolution to Multimodal AI
Multimodal AI is the next step in the evolution of generative AI. It refers to AI systems capable of simultaneously understanding and generating content across multiple data types, such as text, images, audio, and video. Integrating various modalities allows for more comprehensive and contextually aware outputs, enhancing the AI’s versatility and application scope.
Initially, generative AI systems were primarily focused on a single type of data, such as text, images, or audio. For example, early models like GPT-2 and GPT-3 excelled in text generation, while models like DALL-E specialized in generating images from textual descriptions.
As AI technology progressed, researchers recognized the potential of combining multiple data types to create more sophisticated and contextually aware systems. This realization led to the development of multimodal AI, which can simultaneously understand and generate content across various modalities, including text, images, audio, and video.
The integration of different modalities allows multimodal AI to perform tasks that were previously unattainable with unimodal systems. For instance, a multimodal AI can generate a descriptive paragraph based on an image, create a video from a script, or produce music that complements a video clip. This capability enhances the AI’s versatility and broadens its application scope in fields such as entertainment, education, healthcare, and more.
One of the key breakthroughs in multimodal AI was the development of models like CLIP (Contrastive Language-Image Pre-Training) by OpenAI. These models can understand and relate text and images, paving the way for more integrated and interactive AI systems. Additionally, models like GPT-4 and GPT-4o have pushed further boundaries by handling multiple data types within a single framework, providing more seamless and comprehensive outputs.
Another notable breakthrough in the field is Google’s Gemini, which was built from the ground up to be a multimodal model. Gemini can process and combine various types of information, including text, images, audio, video, and code. This capability allows it to perform tasks such as generating music based on visual prompts or assessing homework with a nuanced understanding of different data types (there were a lot of AI and Gemini-related things in Google I/O 2024).
Understanding Multimodal AI
Multimodal AI is like a digital juggler, deftly handling multiple types of data — text, images, audio, and video — all at once to create a unified understanding and output. This digital juggling act is called data fusion, and it can be done at different stages of the process: early, mid, or late fusion. Let’s break it down in a fun and simple way.
Early Fusion: The Smoothie Approach Think of early fusion as throwing all your ingredients — bananas, strawberries, spinach — into a blender right from the start. Here, multimodal AI combines different data types at the very beginning. The system processes this mixed data as a single unit, like a smoothie, seamlessly integrating the final output from the get-go.
Mid-Fusion: The Stir Fry Method Mid-fusion is akin to making a stir fry. You start by cooking each ingredient — chicken, veggies, sauce — separately until they’re halfway done, then mix them all together to finish cooking. In multimodal AI, this means processing each data type separately at first and then combining them during the intermediate processing stage. This method allows initial specialization before integrating the data for a more nuanced result.
Late Fusion: The Dinner Plate Strategy Imagine late fusion as serving a dinner plate where each item — steak, mashed potatoes, and green beans — cooks independently and only comes together at the end, right before serving. For multimodal AI, this means processing text, images, audio, and video separately and combining their outputs at the final stage. Each modality has been fully processed independently, and the integration happens at the end to provide a comprehensive final output.
Wrapping Up
Multimodal AI’s ability to fuse different data types means it can offer richer and more contextually aware outputs. It’s like having a personal assistant who can read your emails and interpret your facial expressions during a video call, listen to the tone of your voice, and even understand the charts you’ve drawn on a whiteboard. Integrating all these modalities makes multimodal AI systems more versatile and capable, much like a Swiss Army knife of AI capabilities.
So, next time you see an AI marvel effortlessly juggling various data types, remember it’s all thanks to the clever process of data fusion — whether it’s blending, stirring, or plating, it’s all about integrating different flavours of data into one harmonious dish!