Skip to main content
โšก Calmops

Multi-Modal AI Models Complete Guide: GPT-4V, Claude 3, Gemini and Beyond

Introduction

The next evolution in artificial intelligence isn’t just about better text processingโ€”it’s about AI that can see, hear, and understand the world as humans do. Multi-modal AI models can process and generate text, images, audio, and video, opening up possibilities that were impossible with text-only models.

In this comprehensive guide, we’ll explore everything about multi-modal AI: how these models work, what they can do, how to use them, and how to build multi-modal applications.


What is Multi-Modal AI?

Understanding Multi-Modality

Multi-modal AI refers to models that can process and generate multiple types of data:

Modality Input Output
Text โœ… โœ…
Images โœ… โœ…
Audio โœ… โœ…
Video โœ… โœ…
Code โœ… โœ…

Why Multi-Modal Matters

Capability Single-Modal Multi-Modal
Image Analysis โŒ โœ…
Voice Conversation โŒ โœ…
Document Understanding Limited Full
Cross-Modal Reasoning โŒ โœ…

Leading Multi-Modal Models

GPT-4V (Vision)

OpenAI’s vision capabilities:

from openai import OpenAI

client = OpenAI()

# Image understanding
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg"
                }
            }
        ]
    }]
)

print(response.choices[0].message.content)

Claude 3 (Vision)

Anthropic’s vision models:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": base64_image
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }]
)

Gemini 1.5 Pro

Google’s multi-modal powerhouse:

import google.generativeai as genai

model = genai.GenerativeModel('gemini-1.5-pro')

# Multi-turn with images
response = model.generate_content([
    "What's the difference between these two charts?",
    {"mime_type": "image/jpeg", "data": image1_bytes},
    {"mime_type": "image/jpeg", "data": image2_bytes}
])

print(response.text)

Image Understanding

Analyzing Screenshots

# Analyze a screenshot
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What UI elements do you see in this screenshot?"
            },
            {
                "type": "image_url",
                "image_url": {"url": screenshot_url}
            }
        ]
    }]
)

Extracting Data from Documents

# Extract structured data from image
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": """Extract the following fields from this invoice:
                - Invoice Number
                - Date
                - Total Amount
                - Vendor Name
                - Line Items (description, quantity, price)"""
            },
            {
                "type": "image_url",
                "image_url": {"url": invoice_image}
            }
        ]
    }]
)

Code Generation from Screenshots

# Generate code from wireframe
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Generate HTML and CSS for this UI design"
            },
            {
                "type": "image_url",
                "image_url": {"url": wireframe_image}
            }
        ]
    }]
)

Image Generation

DALL-E 3

OpenAI’s image generator:

# Generate image from text
response = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic city with flying cars and neon lights, cinematic view",
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response.data[0].url

Stable Diffusion

Open-source alternative:

# Using Diffusers library
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

image = pipe(
    prompt="An astronaut riding a horse in space",
    num_inference_steps=25
).images[0]

Midjourney (via API)

# Using Midjourney API
import requests

response = requests.post(
    "https://api.midjourney.com/v1/imagine",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "prompt": "cyberpunk city at night, neon lights",
        "version": "v6"
    }
)

image_url = response.json()["image_url"]

Video Understanding

Processing Video Frames

import cv2

def extract_frames(video_path, max_frames=16):
    """Extract evenly spaced frames from video"""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    frame_indices = np.linspace(
        0, total_frames - 1, max_frames, dtype=int
    )
    
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            # Convert to base64
            _, buffer = cv2.imencode('.jpg', frame)
            frames.append(base64.b64encode(buffer).decode())
    
    cap.release()
    return frames

Analyzing Video Content

# Analyze extracted frames
def analyze_video(video_path):
    frames = extract_frames(video_path)
    
    # Send frames to GPT-4V
    content = [{"type": "text", 
                "text": "Describe what's happening in this video"}]
    
    for frame in frames:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{frame}"}
        })
    
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": content}]
    )
    
    return response.choices[0].message.content

Audio Processing

Whisper for Transcription

import whisper

model = whisper.load_model("large-v3")

# Transcribe audio
result = model.transcribe(
    "audio_file.mp3",
    language="en",
    task="transcribe"
)

print(result["text"])

Audio Understanding with GPT-4o

# Using GPT-4o for audio understanding
# Note: Requires OpenAI's audio API

response = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio_base64,
                    "format": "wav"
                }
            },
            {
                "type": "text",
                "text": "What's this audio about?"
            }
        ]
    }]
)

Building Multi-Modal Apps

Complete Chat App

class MultimodalChatbot:
    def __init__(self):
        self.client = OpenAI()
    
    async def chat(self, message, attachments=None):
        # Build content array
        content = [{"type": "text", "text": message}]
        
        # Add images
        if attachments:
            for att in attachments:
                if att.type == "image":
                    content.append({
                        "type": "image_url",
                        "image_url": {"url": att.url}
                    })
        
        response = self.client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[{"role": "user", "content": content}]
        )
        
        return response.choices[0].message.content

Document Processing Pipeline

class DocumentProcessor:
    def __init__(self):
        self.client = OpenAI()
    
    def process_document(self, file_path):
        # Convert PDF page to image (requires pdf2image)
        images = convert_pdf_to_images(file_path)
        
        all_text = []
        for i, img in enumerate(images):
            # Analyze each page
            response = self.client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"Extract all text from this document page {i+1}"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": img}
                        }
                    ]
                }]
            )
            all_text.append(response.choices[0].message.content)
        
        return "\n\n".join(all_text)

Use Cases

# Product identification
def identify_product(image_url):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", 
                 "text": "What product is in this image? "
                         "Provide name, brand, and where to buy."},
                {"type": "image_url", 
                 "image_url": {"url": image_url}}
            ]
        }]
    )
    return response.choices[0].message.content

2. Accessibility

# Describe images for visually impaired
def describe_for_accessibility(image_url):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text",
                 "text": "Provide a detailed, accessible description "
                         "of this image for someone who cannot see it."},
                {"type": "image_url",
                 "image_url": {"url": image_url}}
            ]
        }]
    )
    return response.choices[0].message.content

3. Quality Control

# Product defect detection
def check_quality(product_image):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text",
                 "text": "Analyze this product image for defects. "
                         "List any issues you find."},
                {"type": "image_url",
                 "image_url": {"url": product_image}}
            ]
        }]
    )
    return response.choices[0].message.content

Best Practices

Image Handling

# Optimize images for API
from PIL import Image
import io

def optimize_for_api(image_path, max_size=(1024, 1024)):
    img = Image.open(image_path)
    
    # Resize if needed
    if img.size > max_size:
        img.thumbnail(max_size, Image.LANCZOS)
    
    # Convert to RGB
    if img.mode != 'RGB':
        img = img.convert('RGB')
    
    # Save as JPEG
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    
    return base64.b64encode(buffer.getvalue()).decode()

Token Management

# Estimate tokens for images
def estimate_image_tokens(width, height):
    # OpenAI: (width * height) / 750 approximates tokens
    # For 1024x1024 = ~1400 tokens
    tiles = (width // 512) * (height // 512)
    return tiles * 170 + 85

External Resources

Models

Tools

Learning


Conclusion

Multi-modal AI represents the future of artificial intelligence. The ability to see, hear, and understand multiple modalities opens possibilities limited only by imagination.

Key takeaways:

  1. Production-ready - GPT-4V, Claude, Gemini are highly capable
  2. Use cases are vast - From accessibility to enterprise automation
  3. APIs are accessible - Easy to integrate
  4. Best practices matter - Image optimization, token management
  5. Open source is growing - LLaVA and others are improving

Whether you’re building accessibility tools, visual search, or document processing, multi-modal AI provides the capabilities you need.


Comments