Introduction
The AI revolution has moved beyond text. Modern AI systems can see, hear, speak, and create across modalities. In 2025, multimodal AI is enabling entirely new categories of applications - from AI assistants that can see your screen to systems that understand video content. This guide covers the multimodal AI landscape and how to build applications that process multiple types of data.
Understanding Multimodal AI
What Is Multimodal AI?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Multimodal AI Capabilities โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Input Modalities: โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Text โ โ Image โ โ Audio โ โ Video โ โ
โ โ ๐ โ โ ๐ผ๏ธ โ โ ๐ โ โ ๐ฌ โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ
โ Output Modalities: โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Text โ โ Image โ โ Audio โ โ Video โ โ
โ โ ๐ โ โ ๐จ โ โ ๐ โ โ ๐ฌ โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ
โ Cross-Modal: โ
โ โข Image โ Text (Captioning, OCR) โ
โ โข Text โ Image (Generation) โ
โ โข Audio โ Text (Transcription) โ
โ โข Video โ Summary (Understanding) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Concepts
# Multimodal AI concepts
concepts:
- name: "Vision-Language Models (VLM)"
description: "Models that understand images and text together"
examples: ["GPT-4V", "Claude 3", "Llama 3 Vision"]
- name: "Image Generation"
description: "Create images from text descriptions"
examples: ["DALL-E 3", "Midjourney v6", "Stable Diffusion 3"]
- name: "Audio AI"
description: "Speech recognition, generation, and analysis"
examples: ["Whisper", "ElevenLabs", "MusicGen"]
- name: "Video Understanding"
description: "Analyze and generate video content"
examples: ["Runway", "Pika", "Luma Dream Machine"]
Vision-Language Models
Using VLM APIs
# GPT-4V for image understanding
from openai import OpenAI
client = OpenAI()
def analyze_image(image_path):
"""Analyze image with GPT-4V"""
# Read image
with open(image_path, "rb") as image_file:
base64_image = base64.b64encode(image_file).decode('utf-8')
# Send to GPT-4V
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
max_tokens=300
)
return response.choices[0].message.content
def extract_text_from_image(image_path):
"""Extract text from image using GPT-4V"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text from this image exactly as it appears."},
{"type": "image_url", "image_url": {"url": f"file://{image_path}"}}
]
}
]
)
return response.choices[0].message.content
Building VLM Applications
# Document processing with VLM
class DocumentProcessor:
def __init__(self, vlm_client):
self.client = vlm_client
def process_document(self, document_image):
"""Extract structured information from document"""
# Extract text
text = self.extract_text(document_image)
# Identify document type
doc_type = self.classify_document(document_image)
# Extract specific fields based on type
if doc_type == "invoice":
fields = self.extract_invoice_fields(document_image)
elif doc_type == "receipt":
fields = self.extract_receipt_fields(document_image)
elif doc_type == "form":
fields = self.extract_form_fields(document_image)
return {
'type': doc_type,
'text': text,
'fields': fields,
'confidence': fields.get('confidence', 0)
}
def extract_invoice_fields(self, image):
"""Extract invoice-specific fields"""
prompt = """Extract these fields from the invoice:
- Invoice number
- Date
- Vendor name
- Total amount
- Line items (description, quantity, unit price)
Return as JSON:"""
response = self.client.analyze(image, prompt)
return json.loads(response)
Image Generation
Text-to-Image
# DALL-E 3 image generation
from openai import OpenAI
client = OpenAI()
def generate_image(prompt, size="1024x1024", quality="standard"):
"""Generate image from text prompt"""
response = client.images.generate(
model="dall-e-3",
prompt=prompt,
size=size,
quality=quality,
n=1
)
return {
'url': response.data[0].url,
'revised_prompt': response.data[0].revised_prompt
}
# Example prompts for better results
prompts = {
"product_photo": """
Professional product photography of a sleek wireless headphone
on a minimalist white background, studio lighting,
high detail, 4k quality, commercial photography style
""",
"illustration": """
Flat illustration style, vector art of a developer
working at a desk, modern minimalist style,
warm color palette, clean lines
""",
"concept_art": """
Sci-fi concept art of a futuristic cityscape,
flying vehicles, neon lights, cinematic lighting,
epic scale, blade runner aesthetic
"""
}
Image Editing with AI
# Image editing with DALL-E 3
def edit_image(image_path, mask_path, instructions):
"""Edit specific parts of an image"""
response = client.images.edit(
model="dall-e-2",
image=open(image_path, "rb"),
mask=open(mask_path, "rb"),
prompt=instructions,
n=1,
size="1024x1024"
)
return response.data[0].url
# Variations
def generate_variations(original_image, num_variations=3):
"""Generate variations of an image"""
response = client.images.create_variation(
model="dall-e-2",
image=open(original_image, "rb"),
n=num_variations,
size="1024x1024"
)
return [img.url for img in response.data]
Audio AI
Speech Recognition
# Whisper for transcription
import whisper
def transcribe_audio(audio_path, language=None):
"""Transcribe audio to text with Whisper"""
model = whisper.load_model("base")
result = model.transcribe(
audio_path,
language=language, # None = auto-detect
verbose=True
)
return {
'text': result['text'],
'segments': result['segments'],
'language': result['language']
}
# Transcribe with timestamps
def transcribe_with_timestamps(audio_path):
"""Get word-level timestamps"""
model = whisper.load_model("medium")
result = model.transcribe(audio_path, word_timestamps=True)
return [
{
'word': segment['word'],
'start': segment['start'],
'end': segment['end']
}
for segment in result['segments']
]
Speech Generation
# ElevenLabs for speech synthesis
from elevenlabs import generate, play, save
def text_to_speech(text, voice="Rachel", model="eleven_multilingual_v2"):
"""Generate speech from text"""
audio = generate(
text=text,
voice=voice,
model=model
)
return audio
# Use custom voice
def clone_voice(audio_samples, name="my_voice"):
"""Create custom voice from samples"""
import elevenlabs
from elevenlabs.api import VoiceSettings
# Upload voice samples
voice = elevenlabs.api.VoiceCloning.create(
name=name,
description="Custom voice",
files=audio_samples
)
return voice.voice_id
# Generate with custom voice
def speak_with_voice(text, voice_id):
"""Generate speech with cloned voice"""
audio = generate(
text=text,
voice=voice_id,
model="eleven_multilingual_v2"
)
save(audio, "output.mp3")
return "output.mp3"
Music Generation
# MusicGen for music creation
from audio_api import MusicGen
def generate_music(prompt, duration=30, style="professional"):
"""Generate music from text description"""
music_gen = MusicGen()
audio = music_gen.generate(
prompt=prompt,
duration=duration,
style=style
)
return audio
# Examples
music_prompts = {
"corporate": "Upbeat corporate background music, positive, professional, modern",
"cinematic": "Epic cinematic trailer music, orchestral, powerful, dramatic",
"lofi": "Chill lo-fi hip hop beat, relaxed, calm, study music",
"electronic": "Energetic electronic dance music, high BPM, festival style"
}
Video AI
Video Generation
# RunwayML for video generation
def generate_video(prompt, duration=5, model="gen3_alpha"):
"""Generate video from text prompt"""
import runwayml
client = runwayml.Client()
# Create generation
video = client.images.generate(
prompt=prompt,
model=model,
duration=duration
)
return video
# Video to video
def style_transfer_video(video_path, style):
"""Apply style transfer to video"""
import runwayml
client = runwayml.Client()
# Apply style
result = client.videos.style_transfer(
video=video_path,
style=style
)
return result
Video Understanding
# Video analysis with multimodal models
def analyze_video(video_path, questions):
"""Analyze video with VLM"""
# Extract frames
frames = extract_frames(video_path, fps=1)
# Analyze each frame
analysis = []
for frame in frames:
frame_analysis = vlm_client.analyze_image(
frame,
question=questions
)
analysis.append({
'timestamp': frame['timestamp'],
'analysis': frame_analysis
})
# Summarize
summary = summarize_video_analysis(analysis)
return {
'summary': summary,
'frames': analysis
}
def extract_frames(video_path, fps=1):
"""Extract frames from video"""
import cv2
video = cv2.VideoCapture(video_path)
video_fps = video.get(cv2.CAP_PROP_FPS)
frame_interval = int(video_fps / fps)
frames = []
frame_count = 0
while True:
ret, frame = video.read()
if not ret:
break
if frame_count % frame_interval == 0:
timestamp = frame_count / video_fps
# Save frame
cv2.imwrite(f"/tmp/frame_{timestamp}.jpg", frame)
frames.append({
'path': f"/tmp/frame_{timestamp}.jpg",
'timestamp': timestamp
})
frame_count += 1
video.release()
return frames
Building Multimodal Applications
Unified Multimodal Pipeline
# Combined multimodal pipeline
class MultimodalPipeline:
def __init__(self):
self.vision = VisionClient()
self.audio = AudioClient()
self.llm = LLMClient()
def process(self, inputs):
"""Process multiple input modalities"""
results = {}
# Process each input
for input_data in inputs:
modality = input_data['type']
if modality == 'image':
results['image'] = self.vision.analyze(
input_data['content'],
input_data.get('question', 'Describe this image')
)
elif modality == 'audio':
results['audio'] = self.audio.transcribe(
input_data['content']
)
elif modality == 'text':
results['text'] = input_data['content']
# Combine results with LLM
combined = self.combine_modalities(results)
# Generate response
response = self.llm.generate(
prompt=f"Based on all inputs: {combined}",
context=results
)
return response
def combine_modalities(self, results):
"""Combine results from different modalities"""
summary_parts = []
if 'image' in results:
summary_parts.append(f"Image shows: {results['image']}")
if 'audio' in results:
summary_parts.append(f"Audio contains: {results['audio']}")
if 'text' in results:
summary_parts.append(f"Text says: {results['text']}")
return " | ".join(summary_parts)
Multimodal RAG
# Multimodal retrieval
class MultimodalRAG:
def __init__(self):
self.text_store = VectorStore()
self.image_store = VectorStore()
self.audio_store = VectorStore()
self.llm = LLMClient()
def index_content(self, content_path):
"""Index various content types"""
if content_path.endswith(('.jpg', '.png', '.jpeg')):
# Image
description = self.vision.describe(content_path)
self.image_store.add(content_path, description)
elif content_path.endswith(('.mp3', '.wav')):
# Audio
transcript = self.audio.transcribe(content_path)
self.audio_store.add(content_path, transcript)
else:
# Text
self.text_store.add(content_path, open(content_path).read())
def query(self, question):
"""Query across all modalities"""
# Search each store
text_results = self.text_store.search(question, k=3)
image_results = self.image_store.search(question, k=3)
audio_results = self.audio_store.search(question, k=3)
# Combine context
context = self.build_context(
text=text_results,
images=image_results,
audio=audio_results
)
# Generate answer
answer = self.llm.generate(
prompt=question,
context=context
)
return answer
Use Cases
Visual Question Answering
# VQA application
def visual_qa(image_path, question):
"""Answer questions about images"""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {"url": f"file://{image_path}"}}
]
}
]
)
return response.choices[0].message.content
# Example applications
questions = {
"product": "What product is shown? What are its key features?",
"chart": "What data does this chart show? Extract all values.",
"diagram": "Explain this architecture diagram.",
"screenshot": "What is happening in this app screenshot?"
}
Content Moderation
# Multimodal content moderation
class ContentModerator:
def __init__(self):
self.text_moderator = TextModerationClient()
self.image_moderator = ImageModerationClient()
def moderate(self, content):
"""Check content for policy violations"""
results = {'text': None, 'image': None}
if 'text' in content:
results['text'] = self.text_moderator.check(content['text'])
if 'image' in content:
results['image'] = self.image_moderator.check(content['image'])
# Overall decision
violations = [
r for r in results.values()
if r and r.get('violated')
]
return {
'allowed': len(violations) == 0,
'violations': violations,
'details': results
}
Common Pitfalls
1. Ignoring Modalities
Wrong:
# Only process text, ignore images
def process_feedback(feedback):
return analyze_text(feedback.text)
Correct:
# Process all modalities
def process_feedback(feedback):
text_analysis = analyze_text(feedback.text)
image_analysis = analyze_image(feedback.screenshots)
return combine_analyses(text_analysis, image_analysis)
2. Token Limits
Wrong:
# Send entire video to model
analyze_video("2_hour_video.mp4")
# Fails - too many tokens
Correct:
# Sample key frames
frames = extract_key_frames("2_hour_video.mp4", num_frames=20)
analyze_frames(frames)
3. Not Handling Errors
Wrong:
# Assume all inputs are valid
result = vlm.analyze(image_path)
Correct:
# Handle various error cases
try:
result = vlm.analyze(image_path)
except InvalidImageError:
return {"error": "Invalid image format"}
except RateLimitError:
return {"error": "Rate limited, try again later"}
except TimeoutError:
return {"error": "Processing timeout"}
Key Takeaways
- Multimodal AI combines strengths - Best results from using multiple modalities
- Vision-language models are versatile - Text, images, documents, screenshots
- Image generation has practical uses - Marketing, prototyping, design
- Audio AI enables accessibility - Transcription, translation, voice cloning
- Video is the frontier - Understanding and generation are advancing rapidly
- Build unified pipelines - Process all modalities together for best results
- Consider costs - Multimodal APIs can be expensive at scale
External Resources
APIs and Services
- OpenAI Vision - GPT-4V
- Anthropic Vision - Claude with vision
- ElevenLabs - Voice AI
- Runway - Video AI
Open Source
- LLaVA - Open-source VLM
- Stable Diffusion - Image generation
- Whisper - Transcription
- MusicGen - Music generation
Comments