Multi-Modal AI Models Complete Guide: GPT-4V, Claude 3, Gemini and Beyond

Introduction

The next evolution in artificial intelligence isn’t just about better text processing—it’s about AI that can see, hear, and understand the world as humans do. Multi-modal AI models can process and generate text, images, audio, and video, opening up possibilities that were impossible with text-only models.

In this comprehensive guide, we’ll explore everything about multi-modal AI: how these models work, what they can do, how to use them, and how to build multi-modal applications.

Understanding Multi-Modality

Multi-modal AI refers to models that can process and generate multiple types of data:

Modality	Input	Output
Text	✅	✅
Images	✅	✅
Audio	✅	✅
Video	✅	✅
Code	✅	✅

Capability	Single-Modal	Multi-Modal
Image Analysis	❌	✅
Voice Conversation	❌	✅
Document Understanding	Limited	Full
Cross-Modal Reasoning	❌	✅

GPT-4V (Vision)

OpenAI’s vision capabilities:

from openai import OpenAI

client = OpenAI()

# Image understanding
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg"
                }
            }
        ]
    }]
)

print(response.choices[0].message.content)

Claude 3 (Vision)

Anthropic’s vision models:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": base64_image
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }]
)

Gemini 1.5 Pro

Google’s multi-modal powerhouse:

import google.generativeai as genai

model = genai.GenerativeModel('gemini-1.5-pro')

# Multi-turn with images
response = model.generate_content([
    "What's the difference between these two charts?",
    {"mime_type": "image/jpeg", "data": image1_bytes},
    {"mime_type": "image/jpeg", "data": image2_bytes}
])

print(response.text)

Image Understanding

Analyzing Screenshots

# Analyze a screenshot
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What UI elements do you see in this screenshot?"
            },
            {
                "type": "image_url",
                "image_url": {"url": screenshot_url}
            }
        ]
    }]
)

Extracting Data from Documents

# Extract structured data from image
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": """Extract the following fields from this invoice:
                - Invoice Number
                - Date
                - Total Amount
                - Vendor Name
                - Line Items (description, quantity, price)"""
            },
            {
                "type": "image_url",
                "image_url": {"url": invoice_image}
            }
        ]
    }]
)

Code Generation from Screenshots

# Generate code from wireframe
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Generate HTML and CSS for this UI design"
            },
            {
                "type": "image_url",
                "image_url": {"url": wireframe_image}
            }
        ]
    }]
)

Image Generation

DALL-E 3

OpenAI’s image generator:

# Generate image from text
response = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic city with flying cars and neon lights, cinematic view",
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response.data[0].url

Stable Diffusion

Open-source alternative:

# Using Diffusers library
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

image = pipe(
    prompt="An astronaut riding a horse in space",
    num_inference_steps=25
).images[0]

Midjourney (via API)

# Using Midjourney API
import requests

response = requests.post(
    "https://api.midjourney.com/v1/imagine",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "prompt": "cyberpunk city at night, neon lights",
        "version": "v6"
    }
)

image_url = response.json()["image_url"]

Video Understanding

Processing Video Frames

import cv2

def extract_frames(video_path, max_frames=16):
    """Extract evenly spaced frames from video"""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    frame_indices = np.linspace(
        0, total_frames - 1, max_frames, dtype=int
    )
    
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            # Convert to base64
            _, buffer = cv2.imencode('.jpg', frame)
            frames.append(base64.b64encode(buffer).decode())
    
    cap.release()
    return frames

Analyzing Video Content

# Analyze extracted frames
def analyze_video(video_path):
    frames = extract_frames(video_path)
    
    # Send frames to GPT-4V
    content = [{"type": "text", 
                "text": "Describe what's happening in this video"}]
    
    for frame in frames:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{frame}"}
        })
    
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": content}]
    )
    
    return response.choices[0].message.content

Audio Processing

Whisper for Transcription

import whisper

model = whisper.load_model("large-v3")

# Transcribe audio
result = model.transcribe(
    "audio_file.mp3",
    language="en",
    task="transcribe"
)

print(result["text"])

Audio Understanding with GPT-4o

# Using GPT-4o for audio understanding
# Note: Requires OpenAI's audio API

response = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio_base64,
                    "format": "wav"
                }
            },
            {
                "type": "text",
                "text": "What's this audio about?"
            }
        ]
    }]
)

Complete Chat App

class MultimodalChatbot:
    def __init__(self):
        self.client = OpenAI()
    
    async def chat(self, message, attachments=None):
        # Build content array
        content = [{"type": "text", "text": message}]
        
        # Add images
        if attachments:
            for att in attachments:
                if att.type == "image":
                    content.append({
                        "type": "image_url",
                        "image_url": {"url": att.url}
                    })
        
        response = self.client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[{"role": "user", "content": content}]
        )
        
        return response.choices[0].message.content

Document Processing Pipeline

class DocumentProcessor:
    def __init__(self):
        self.client = OpenAI()
    
    def process_document(self, file_path):
        # Convert PDF page to image (requires pdf2image)
        images = convert_pdf_to_images(file_path)
        
        all_text = []
        for i, img in enumerate(images):
            # Analyze each page
            response = self.client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"Extract all text from this document page {i+1}"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": img}
                        }
                    ]
                }]
            )
            all_text.append(response.choices[0].message.content)
        
        return "\n\n".join(all_text)

Use Cases

1. Visual Search

# Product identification
def identify_product(image_url):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", 
                 "text": "What product is in this image? "
                         "Provide name, brand, and where to buy."},
                {"type": "image_url", 
                 "image_url": {"url": image_url}}
            ]
        }]
    )
    return response.choices[0].message.content

2. Accessibility

# Describe images for visually impaired
def describe_for_accessibility(image_url):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text",
                 "text": "Provide a detailed, accessible description "
                         "of this image for someone who cannot see it."},
                {"type": "image_url",
                 "image_url": {"url": image_url}}
            ]
        }]
    )
    return response.choices[0].message.content

3. Quality Control

# Product defect detection
def check_quality(product_image):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text",
                 "text": "Analyze this product image for defects. "
                         "List any issues you find."},
                {"type": "image_url",
                 "image_url": {"url": product_image}}
            ]
        }]
    )
    return response.choices[0].message.content

Best Practices

Image Handling

# Optimize images for API
from PIL import Image
import io

def optimize_for_api(image_path, max_size=(1024, 1024)):
    img = Image.open(image_path)
    
    # Resize if needed
    if img.size > max_size:
        img.thumbnail(max_size, Image.LANCZOS)
    
    # Convert to RGB
    if img.mode != 'RGB':
        img = img.convert('RGB')
    
    # Save as JPEG
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    
    return base64.b64encode(buffer.getvalue()).decode()

Token Management

# Estimate tokens for images
def estimate_image_tokens(width, height):
    # OpenAI: (width * height) / 750 approximates tokens
    # For 1024x1024 = ~1400 tokens
    tiles = (width // 512) * (height // 512)
    return tiles * 170 + 85

External Resources

Models

Tools

Learning

Conclusion

Multi-modal AI represents the future of artificial intelligence. The ability to see, hear, and understand multiple modalities opens possibilities limited only by imagination.

Key takeaways:

Production-ready - GPT-4V, Claude, Gemini are highly capable
Use cases are vast - From accessibility to enterprise automation
APIs are accessible - Easy to integrate
Best practices matter - Image optimization, token management
Open source is growing - LLaVA and others are improving

Whether you’re building accessibility tools, visual search, or document processing, multi-modal AI provides the capabilities you need.