Introduction
The next evolution in artificial intelligence isn’t just about better text processingโit’s about AI that can see, hear, and understand the world as humans do. Multi-modal AI models can process and generate text, images, audio, and video, opening up possibilities that were impossible with text-only models.
In this comprehensive guide, we’ll explore everything about multi-modal AI: how these models work, what they can do, how to use them, and how to build multi-modal applications.
What is Multi-Modal AI?
Understanding Multi-Modality
Multi-modal AI refers to models that can process and generate multiple types of data:
| Modality | Input | Output |
|---|---|---|
| Text | โ | โ |
| Images | โ | โ |
| Audio | โ | โ |
| Video | โ | โ |
| Code | โ | โ |
Why Multi-Modal Matters
| Capability | Single-Modal | Multi-Modal |
|---|---|---|
| Image Analysis | โ | โ |
| Voice Conversation | โ | โ |
| Document Understanding | Limited | Full |
| Cross-Modal Reasoning | โ | โ |
Leading Multi-Modal Models
GPT-4V (Vision)
OpenAI’s vision capabilities:
from openai import OpenAI
client = OpenAI()
# Image understanding
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}]
)
print(response.choices[0].message.content)
Claude 3 (Vision)
Anthropic’s vision models:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": base64_image
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}]
)
Gemini 1.5 Pro
Google’s multi-modal powerhouse:
import google.generativeai as genai
model = genai.GenerativeModel('gemini-1.5-pro')
# Multi-turn with images
response = model.generate_content([
"What's the difference between these two charts?",
{"mime_type": "image/jpeg", "data": image1_bytes},
{"mime_type": "image/jpeg", "data": image2_bytes}
])
print(response.text)
Image Understanding
Analyzing Screenshots
# Analyze a screenshot
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "What UI elements do you see in this screenshot?"
},
{
"type": "image_url",
"image_url": {"url": screenshot_url}
}
]
}]
)
Extracting Data from Documents
# Extract structured data from image
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": """Extract the following fields from this invoice:
- Invoice Number
- Date
- Total Amount
- Vendor Name
- Line Items (description, quantity, price)"""
},
{
"type": "image_url",
"image_url": {"url": invoice_image}
}
]
}]
)
Code Generation from Screenshots
# Generate code from wireframe
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate HTML and CSS for this UI design"
},
{
"type": "image_url",
"image_url": {"url": wireframe_image}
}
]
}]
)
Image Generation
DALL-E 3
OpenAI’s image generator:
# Generate image from text
response = client.images.generate(
model="dall-e-3",
prompt="A futuristic city with flying cars and neon lights, cinematic view",
size="1024x1024",
quality="standard",
n=1
)
image_url = response.data[0].url
Stable Diffusion
Open-source alternative:
# Using Diffusers library
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
image = pipe(
prompt="An astronaut riding a horse in space",
num_inference_steps=25
).images[0]
Midjourney (via API)
# Using Midjourney API
import requests
response = requests.post(
"https://api.midjourney.com/v1/imagine",
headers={"Authorization": f"Bearer {api_key}"},
json={
"prompt": "cyberpunk city at night, neon lights",
"version": "v6"
}
)
image_url = response.json()["image_url"]
Video Understanding
Processing Video Frames
import cv2
def extract_frames(video_path, max_frames=16):
"""Extract evenly spaced frames from video"""
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
frame_indices = np.linspace(
0, total_frames - 1, max_frames, dtype=int
)
frames = []
for idx in frame_indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
# Convert to base64
_, buffer = cv2.imencode('.jpg', frame)
frames.append(base64.b64encode(buffer).decode())
cap.release()
return frames
Analyzing Video Content
# Analyze extracted frames
def analyze_video(video_path):
frames = extract_frames(video_path)
# Send frames to GPT-4V
content = [{"type": "text",
"text": "Describe what's happening in this video"}]
for frame in frames:
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{frame}"}
})
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": content}]
)
return response.choices[0].message.content
Audio Processing
Whisper for Transcription
import whisper
model = whisper.load_model("large-v3")
# Transcribe audio
result = model.transcribe(
"audio_file.mp3",
language="en",
task="transcribe"
)
print(result["text"])
Audio Understanding with GPT-4o
# Using GPT-4o for audio understanding
# Note: Requires OpenAI's audio API
response = client.chat.completions.create(
model="gpt-4o-audio-preview",
messages=[{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_base64,
"format": "wav"
}
},
{
"type": "text",
"text": "What's this audio about?"
}
]
}]
)
Building Multi-Modal Apps
Complete Chat App
class MultimodalChatbot:
def __init__(self):
self.client = OpenAI()
async def chat(self, message, attachments=None):
# Build content array
content = [{"type": "text", "text": message}]
# Add images
if attachments:
for att in attachments:
if att.type == "image":
content.append({
"type": "image_url",
"image_url": {"url": att.url}
})
response = self.client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": content}]
)
return response.choices[0].message.content
Document Processing Pipeline
class DocumentProcessor:
def __init__(self):
self.client = OpenAI()
def process_document(self, file_path):
# Convert PDF page to image (requires pdf2image)
images = convert_pdf_to_images(file_path)
all_text = []
for i, img in enumerate(images):
# Analyze each page
response = self.client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"Extract all text from this document page {i+1}"
},
{
"type": "image_url",
"image_url": {"url": img}
}
]
}]
)
all_text.append(response.choices[0].message.content)
return "\n\n".join(all_text)
Use Cases
1. Visual Search
# Product identification
def identify_product(image_url):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{"type": "text",
"text": "What product is in this image? "
"Provide name, brand, and where to buy."},
{"type": "image_url",
"image_url": {"url": image_url}}
]
}]
)
return response.choices[0].message.content
2. Accessibility
# Describe images for visually impaired
def describe_for_accessibility(image_url):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{"type": "text",
"text": "Provide a detailed, accessible description "
"of this image for someone who cannot see it."},
{"type": "image_url",
"image_url": {"url": image_url}}
]
}]
)
return response.choices[0].message.content
3. Quality Control
# Product defect detection
def check_quality(product_image):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": [
{"type": "text",
"text": "Analyze this product image for defects. "
"List any issues you find."},
{"type": "image_url",
"image_url": {"url": product_image}}
]
}]
)
return response.choices[0].message.content
Best Practices
Image Handling
# Optimize images for API
from PIL import Image
import io
def optimize_for_api(image_path, max_size=(1024, 1024)):
img = Image.open(image_path)
# Resize if needed
if img.size > max_size:
img.thumbnail(max_size, Image.LANCZOS)
# Convert to RGB
if img.mode != 'RGB':
img = img.convert('RGB')
# Save as JPEG
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=85)
return base64.b64encode(buffer.getvalue()).decode()
Token Management
# Estimate tokens for images
def estimate_image_tokens(width, height):
# OpenAI: (width * height) / 750 approximates tokens
# For 1024x1024 = ~1400 tokens
tiles = (width // 512) * (height // 512)
return tiles * 170 + 85
External Resources
Models
- OpenAI GPT-4V
- Claude Vision
- Gemini
- LLaVA - Open source
Tools
Learning
Conclusion
Multi-modal AI represents the future of artificial intelligence. The ability to see, hear, and understand multiple modalities opens possibilities limited only by imagination.
Key takeaways:
- Production-ready - GPT-4V, Claude, Gemini are highly capable
- Use cases are vast - From accessibility to enterprise automation
- APIs are accessible - Easy to integrate
- Best practices matter - Image optimization, token management
- Open source is growing - LLaVA and others are improving
Whether you’re building accessibility tools, visual search, or document processing, multi-modal AI provides the capabilities you need.
Comments