Introduction
WebRTC (Web Real-Time Communication) enables direct peer-to-peer audio, video, and data exchange in browsers and mobile applications without plugins. By 2026, WebRTC powers billions of video call minutes, live streaming sessions, and — the fastest-growing segment — AI voice agents. The protocol stack that was designed for browser-to-browser calls has become the transport layer for real-time AI: voice agents that listen, think, and speak with sub-500ms latency.
This guide covers WebRTC architecture, the protocols that make it work (ICE, STUN, TURN, RTP, SDP, DTLS), production deployment patterns with SFUs (mediasoup, LiveKit, Janus), the new WHIP/WHEP standards for broadcasting, WebRTC for AI voice agents, and comparison with the emerging WebTransport + WebCodecs stack.
For foundational knowledge of related protocols, see the WebSocket Protocol Guide and the WebTransport Protocol Guide.
What Is WebRTC?
WebRTC is a collection of W3C APIs and IETF protocols for real-time communication. Unlike WebSockets, which provide a single TCP stream, WebRTC operates over UDP with built-in NAT traversal, encryption, and codec negotiation.
Core Components
MediaStream (getUserMedia) — Access camera and microphone. Returns audio and video tracks that can be fed directly into a PeerConnection.
RTCPeerConnection — The central object that manages the full lifecycle of a peer-to-peer connection: ICE candidate gathering, DTLS key exchange, SRTP media encryption, and bandwidth estimation.
RTCDataChannel — Arbitrary data exchange over SCTP, with configurable reliability (ordered/unordered, retransmit limits, partial reliability).
Browser Support (2026)
| Browser | WebRTC | WHIP | WHEP | DataChannel | Insertable Streams |
|---|---|---|---|---|---|
| Chrome | Full | Full | Full | Full | Full |
| Firefox | Full | Full | Full | Full | Full |
| Safari | Full | Full (15+) | Full (15+) | Full | Partial |
| Edge | Full | Full | Full | Full | Full |
| Node.js | Via libraries | Via libraries | Via libraries | Via libraries | N/A |
WebRTC is part of WebKit’s Interop 2026 focus areas, having reached 91.6% cross-browser pass rate in 2025. The remaining long tail of interop issues is being addressed through the Interop project.
Architecture Overview
Protocol Stack
Application Layer
|
WebRTC API (JavaScript)
|
Signaling (WebSocket/HTTP)
|
SDP (Session Description)
|
ICE (Interactive Connectivity)
|
STUN / TURN
|
DTLS (Datagram Transport Layer)
|
SRTP (Secure RTP) | SCTP (Data Channel)
|
UDP (preferred) or TCP (fallback)
Connection Flow
Peer A Signaling Peer B
| | |
|------ Offer (SDP) ---->| |
| |------ Offer (SDP) --->|
| | |
|<--- Answer (SDP) -------| |
|<--- Answer (SDP) -------| |
| | |
|======= ICE Candidates ==|<==== ICE Candidates ==|
| | |
|========= DTLS Handshake ========================|
| | |
|<====== Media Streams (SRTP) ====================>|
| | |
|========== Data Channel ========================>|
Session Description Protocol (SDP)
SDP describes multimedia sessions for negotiation between peers. It defines the codecs, encryption keys, network addresses, and media formats each peer supports.
SDP Structure
v=0
o=- 7027864585432175469 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0 1 2
a=extmap-allow-mixed
a=msid-semantic: WMS *
m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 126 97 98
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:xyz123
a=ice-pwd:abcdefghijklmnopqrstuvwxyz
a=rtpmark:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
Key Attributes
o= (Origin) — Session identifier and version. Changes when the session is modified.
m= (Media Description) — Media type (audio/video/application), port, transport protocol, and list of payload type identifiers for supported codecs.
a=rtpmap — Maps payload type numbers to codec names and clock rates (e.g., 111 opus/48000/2 means payload type 111 is Opus at 48kHz stereo).
a=fmtp — Codec-specific parameters (e.g., useinbandfec=1 enables Opus in-band forward error correction).
a=ice-ufrag / a=ice-pwd — ICE credentials used to authenticate connectivity checks.
a=fingerprint — SHA-256 hash of the DTLS certificate, used for key verification.
Creating Offer/Answer
// Create SDP offer
const offer = await peerConnection.createOffer({
offerToReceiveAudio: true,
offerToReceiveVideo: true
});
await peerConnection.setLocalDescription(offer);
// Send offer via signaling, then receive answer
await peerConnection.setRemoteDescription(answer);
Interactive Connectivity Establishment (ICE)
ICE is the protocol that enables peers to discover and establish direct network paths through NATs and firewalls. It collects candidate addresses, prioritizes them, and tests each pair for connectivity.
ICE Candidates
a=candidate:1 1 UDP 2130706431 192.168.1.100 49170 typ host
a=candidate:2 1 UDP 1694498815 203.0.113.50 49171 typ srflx raddr 192.168.1.100 rport 49170
a=candidate:3 1 UDP 847250719 10.0.0.50 49172 typ relay raddr 203.0.113.50 rport 49173
Each candidate contains: foundation ID, component ID (1=RTP, 2=RTCP), transport protocol, priority (higher = better), IP address, port, and type.
Candidate Types
host — Local network address (highest priority). Works when both peers are on the same LAN.
srflx (Server Reflexive) — Public IP discovered via STUN. Works for most residential NATs.
prflx (Peer Reflexive) — Public IP learned dynamically from a peer’s connectivity check.
relay — TURN relay address (lowest priority). Used when all other paths fail.
ICE Configuration
const pc = new RTCPeerConnection({
iceServers: [
{ urls: 'stun:stun.l.google.com:19302' },
{
urls: 'turn:turn.example.com:3478',
username: 'user',
credential: 'pass'
}
],
iceCandidatePoolSize: 10
});
// ICE candidate events
pc.onicecandidate = (event) => {
if (event.candidate) {
sendToPeer({ type: 'candidate', candidate: event.candidate });
}
};
pc.oniceconnectionstatechange = () => {
console.log('ICE state:', pc.iceConnectionState);
// States: new -> checking -> connected -> completed
// Or: new -> checking -> disconnected -> failed
};
pc.addIceCandidate(candidate);
ICE State Machine
stateDiagram-v2
[*] --> New
New --> Checking
Checking --> Connected
Checking --> Failed
Connected --> Completed
Connected --> Disconnected
Completed --> Disconnected
Disconnected --> Checking : ICE restart
Disconnected --> Failed
Failed --> [*]
STUN Protocol
STUN (Session Traversal Utilities for NAT, RFC 8489) helps peers discover their public IP addresses and port mappings. A STUN server is stateless — it simply echoes back the source address it sees.
Client STUN Server
| |
|--- Binding Request ----->|
| (transaction ID) |
| |
|<-- Binding Response -----|
| (XOR-MAPPED-ADDRESS) |
| (MAPPED-ADDRESS) |
import socket
import struct
STUN_SERVER = 'stun.l.google.com'
STUN_PORT = 19302
def stun_binding_request():
# STUN binding request with magic cookie
transaction_id = b'\x00\x01\x00\x00' # Message type + length
magic_cookie = b'\x21\x12\xa4\x42' # RFC 8489 magic cookie
request = transaction_id + magic_cookie + b'\x00' * 12
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(3)
sock.sendto(request, (STUN_SERVER, STUN_PORT))
response, addr = sock.recvfrom(512)
# Parse XOR-MAPPED-ADDRESS from the response
return response
Google operates a public STUN server at stun.l.google.com:19302 that is free to use and suitable for development. Production deployments should operate their own STUN infrastructure or use a commercial TURN provider.
TURN Protocol
TURN (Traversal Using Relays around NAT, RFC 5766) provides relay servers when direct peer-to-peer connections are impossible. In 2026, this is especially critical for AI voice agents — unlike human video calls where only ~15-20% of connections need TURN, AI voice agents always connect through infrastructure with no P2P fallback, meaning 100% of traffic hits TURN servers.
When TURN Is Needed
- Symmetric NAT (both peers behind carrier-grade NAT)
- Firewall blocks UDP entirely (TCP/TLS TURN fallback)
- Enterprise proxy environments that restrict outbound traffic
- AI voice agents requiring guaranteed low-latency connections
TURN Server Setup with coturn
# Install coturn (battle-tested TURN server)
apt-get install coturn
# /etc/turnserver.conf
listening-port=3478
tls-listening-port=5349
realm=example.com
fingerprint
lt-cred-mech
user=app1:securepass
total-quota=100
bps-capacity=0
log-file=/var/log/turn/turn.log
Production TURN Cost Modeling
For AI voice agent deployments, TURN costs follow a different model than traditional WebRTC:
| Factor | Human Video Calls | AI Voice Agents |
|---|---|---|
| TURN traffic share | 15-20% of connections | 100% of connections |
| Average call duration | 5-30 minutes | 2-10 minutes |
| Bandwidth per stream | 1-4 Mbps (video) | 64-128 Kbps (audio only) |
| Cost driver | Concurrent video streams | Concurrent sessions × duration |
Deploy TURN servers in 3+ regions with anycast routing for lowest latency. Use coturn (the standard open-source implementation) or LiveKit’s bundled TURN for managed deployments.
RTP and SRTP
RTP (Real-time Transport Protocol) carries media streams — audio encoded with Opus, video encoded with VP8, H.264, VP9, or AV1. SRTP encrypts the RTP payload using AES, negotiated automatically via DTLS-SRTP during the handshake.
RTP Packet Header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Codec Support in 2026
| Codec | Type | Bitrate Range | Browser Support | Notes |
|---|---|---|---|---|
| Opus | Audio | 6-510 Kbps | All | Default, 48kHz, inband FEC, DTX |
| VP8 | Video | 100 Kbps - 2 Mbps | All | Fallback, mandatory to implement |
| H.264 | Video | 100 Kbps - 5 Mbps | All | Hardware encode/decode on most devices |
| VP9 | Video | 100 Kbps - 8 Mbps | Chrome, Firefox, Edge | 30% better compression than VP8 |
| AV1 | Video | 80 Kbps - 4 Mbps | Chrome, Firefox, Edge (partial) | 50% better than VP8, limited hardware encode |
AV1 adoption is progressing but not yet dominant in 2026. VP8 and H.264 remain the safest choices for broad compatibility. As WebRTC expert Tsahi Levent-Levi notes, AV1 may not reach dominant status until 2028.
WebRTC Codec Negotiation
// Prefer specific codecs in SDP
const transceiver = pc.getTransceivers()[0];
// Chrome allows codec preferences via RTCRtpTransceiver.setCodecPreferences
const codecs = RTCRtpSender.getCapabilities('video').codecs;
const av1Codec = codecs.find(c => c.mimeType === 'video/AV1');
if (av1Codec) {
transceiver.setCodecPreferences([av1Codec, ...codecs]);
}
WHIP and WHEP: WebRTC Broadcasting Standards
WHIP (WebRTC-HTTP Ingestion Protocol, RFC 9725, published March 2025) and WHEP (WebRTC-HTTP Egress Protocol, IETF draft) define simple HTTP-based signaling for WebRTC broadcasting. They replace RTMP with a standards-based, browser-native ingestion and distribution pipeline.
WHIP for Ingestion
// WHIP: Ingest a WebRTC stream via HTTP POST
async function whipPublish(stream, whipEndpoint) {
const pc = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});
stream.getTracks().forEach(track => pc.addTrack(track, stream));
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// POST SDP offer to WHIP endpoint
const response = await fetch(whipEndpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/sdp' },
body: offer.sdp
});
const answerSdp = await response.text();
await pc.setRemoteDescription({ type: 'answer', sdp: answerSdp });
return pc;
}
WHEP for egress follows the same pattern but for subscribing to a stream. Combined, WHIP + WHEP enable end-to-end WebRTC broadcasting at CDN scale without RTMP translation. Cloudflare Stream was the first major service to support both WHIP and WHEP, followed by Janus, mediasoup, and LiveKit.
WebRTC for AI Voice Agents
The fastest-growing WebRTC use case in 2026 is AI voice agents. Unlike traditional video calls, voice agents require WebRTC for every connection — there is no P2P fallback, and 100% of traffic flows through infrastructure.
AI Voice Agent Architecture
// AI voice agent using WebRTC as transport
class VoiceAgent {
constructor(agentEndpoint) {
this.endpoint = agentEndpoint;
this.pc = null;
}
async connect() {
this.pc = new RTCPeerConnection({
iceServers: [
{ urls: 'stun:stun.l.google.com:19302' },
{ urls: 'turn:turn.example.com:3478',
username: 'agent', credential: process.env.TURN_SECRET }
]
});
// Outgoing audio from agent
const audioTrack = await this.createAgentAudioTrack();
this.pc.addTrack(audioTrack);
// Incoming audio from user
this.pc.ontrack = (event) => {
this.handleUserAudio(event.streams[0]);
};
// Data channel for control messages
const dc = this.pc.createDataChannel('control', {
ordered: true
});
dc.onmessage = (e) => this.handleControl(JSON.parse(e.data));
const offer = await this.pc.createOffer();
await this.pc.setLocalDescription(offer);
// Send offer to agent server via signaling
}
async handleUserAudio(stream) {
// Stream to ASR (speech-to-text) pipeline
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);
// Route to Whisper / Deepgram / etc.
}
}
Key Differences: AI Agent vs Human Calling
| Dimension | Human Video Call | AI Voice Agent |
|---|---|---|
| TURN requirement | 15-20% of calls | 100% of calls |
| Target latency | <500ms mouth-to-ear | <300ms for natural conversation |
| Audio processing | None (pass-through) | ASR → LLM → TTS pipeline |
| Barge-in support | Not needed | Required — detect interruptions within 300ms |
| State management | Per-call state | Session persists across media stream |
| Server-side libraries | Optional | Required (pion/Go, aiortc/Python, werift/Node.js) |
Barge-In (Interruption Handling)
Full-duplex WebRTC enables natural interruption. When the user speaks while the agent is talking:
- The agent detects speech via Voice Activity Detection (VAD) on the incoming audio stream
- It stops playback immediately and buffers the user’s utterance
- The ASR processes the utterance, LLM generates a response, TTS streams it back
- Total interruption cycle targets <300ms for natural feel
WebRTC Server Libraries for AI Agents
| Library | Language | Strengths |
|---|---|---|
| Pion | Go | Most popular, production-grade, powers LiveKit |
| aiortc | Python | Best for AI/ML integration, SRTP-to-PCM bridge |
| werift | TypeScript/Node.js | TypeScript native, good for JS stacks |
| LiveKit | Go | Full SFU platform with AI agent framework, E2EE |
RTCDataChannel
Data channels provide arbitrary data exchange between peers over SCTP (Stream Control Transmission Protocol). They support configurable reliability modes critical for different use cases.
// Configurable data channel
const dc = pc.createDataChannel('game-state', {
ordered: false, // Allow out-of-order delivery
maxRetransmits: 2, // Retry at most twice, then drop
});
dc.onopen = () => {
dc.send(JSON.stringify({ type: 'join', room: 'lobby' }));
};
dc.onmessage = (event) => {
console.log('Received:', event.data);
};
// Partial reliability modes:
// - ordered: true, maxRetransmits: 0 -> reliable, ordered (file transfer)
// - ordered: false, maxRetransmits: 0 -> reliable, unordered (chat)
// - ordered: false, maxRetransmits: 2 -> unreliable, unordered (game positions)
// - ordered: true, maxPacketLifeTime: 100 -> unreliable (time-bounded)
Security
DTLS Fingerprint Verification
WebRTC mandates encryption via DTLS-SRTP. The DTLS fingerprint from the remote peer’s certificate can be verified against a known value obtained through the signaling channel:
// Verify remote fingerprint after connection
pc.oniceconnectionstatechange = () => {
if (pc.iceConnectionState === 'connected') {
const certs = pc.getRemoteCertificates();
if (certs.length > 0) {
const fingerprint = certs[0].fingerprint;
verifyAgainstExpected(fingerprint);
}
}
};
Insertable Streams (E2EE)
The Insertable Streams API (W3C, supported in Chrome and LiveKit) allows applications to encrypt media frames before they reach the WebRTC stack, providing end-to-end encryption that the SFU cannot decrypt:
// Insertable Streams for custom E2EE
const sender = pc.getSenders()[0];
const stream = sender.createEncodedStreams();
const encoder = new TransformStream({
transform: (frame, controller) => {
// Encrypt frame with application-layer key
frame.data = encrypt(frame.data, sessionKey);
controller.enqueue(frame);
}
});
stream.readable.pipeThrough(encoder);
Best Practices
- Verify ICE candidates originate from authorized STUN/TURN servers to prevent SSRF
- Use short-lived TURN credentials with time-based expiry
- Limit DataChannel message sizes to 256KB (SCTP fragmentation)
- Use Insertable Streams for sensitive applications requiring true E2EE
- Monitor ICE connection state and implement ICE restart on network transitions
Scalability Architectures
Mesh (P2P)
Every peer connects to every other peer. Simple but limited to 2-4 participants due to O(n²) upstream bandwidth.
SFU (Selective Forwarding Unit)
A central server forwards streams without decoding. Each peer sends one stream and receives N-1. Scales to 10-100+ participants. The 2026 default for any room above 4 peers.
MCU (Multipoint Control Unit)
Central server decodes, mixes, and re-encodes streams. Provides composite output but requires significant CPU. Mostly legacy — SFUs have replaced MCUs for most use cases.
SFU Comparison: mediasoup vs LiveKit vs Janus
| Feature | mediasoup | LiveKit | Janus |
|---|---|---|---|
| Language | C++ (Node.js/Rust API) | Go | C |
| Simulcast | Full | Full | Plugin-dependent |
| SVC | Full (VP9/AV1) | Full | Limited |
| AI agent support | Manual integration | Built-in agent framework | Manual |
| E2EE (Insertable Streams) | Manual | First-class | Manual |
| Recording | Via FFmpeg | Built-in egress | Plugin |
| Kubernetes | Manual | Official Helm charts | Manual |
| Learning curve | Steep (low-level API) | Moderate | Moderate |
| Best for | Custom pipelines, large rooms | AI agents, quick production | SIP interop, legacy |
flowchart TB
subgraph SFU["SFU Architecture"]
P1[Publisher] -->|Single upstream| SFU_CORE
SFU_CORE -->|Forward| C1[Consumer 1]
SFU_CORE -->|Forward| C2[Consumer 2]
SFU_CORE -->|Forward| C3[Consumer N]
end
subgraph Mesh["Mesh (P2P) Architecture"]
M1[Peer A] -->|N-1 streams| M2[Peer B]
M1 -->|N-1 streams| M3[Peer C]
M2 -->|N-1 streams| M1
M2 -->|N-1 streams| M3
M3 -->|N-1 streams| M1
M3 -->|N-1 streams| M2
end
Production SFU Deployment
# Deploy mediasoup with Docker
docker run -d --name mediasoup \
-p 3000:3000 \
-p 40000-40010:40000-40010/udp \
-e MEDIASOUP_LISTEN_IP=0.0.0.0 \
-e MEDIASOUP_ANNOUNCED_IP=<your-public-ip> \
your-mediasoup-image
# LiveKit with Helm
helm repo add livekit https://helm.livekit.io
helm upgrade --install my-server livekit/livekit-server \
--set keys.api=your-api-key \
--set keys.secret=your-secret
WebRTC vs WebTransport
WebTransport (built on QUIC + HTTP/3) is emerging as a complementary technology for specific use cases. The combination of WebTransport + WebCodecs + WebAssembly is considered a “dark horse” stack that could disrupt traditional WebRTC in client-server media pipelines.
| Dimension | WebRTC | WebTransport |
|---|---|---|
| Architecture | Peer-to-peer + SFU | Client-server |
| Media pipeline | Built-in (codec negotiation, echo cancellation) | Build with WebCodecs |
| Transport | SCTP + SRTP over DTLS | QUIC streams + datagrams |
| Browser support | 100% of browsers | Baseline since March 2026 |
| Unreliable data | Partial (SCTP partial reliability) | Native datagrams |
| Connection setup | ICE + STUN/TURN (3-5 RTTs) | HTTP/3 (0-1 RTT) |
| Worker support | Limited DataChannel in workers | Full WebTransport in workers |
| Server complexity | High (ICE, TURN, signaling) | Lower (direct HTTP/3 server) |
| Best for | Video/audio calls, P2P | Cloud gaming, live streaming, AI inference |
As of 2026, Media over QUIC (MoQ, IETF draft-17) combines elements of both — using WebTransport as transport with a pub-sub media model. Production MoQ deployments exist in controlled environments, but universal browser-based MoQ is a 2026-2027 story.
Production Considerations
TURN Capacity Planning
For AI voice agent deployments, allocate TURN bandwidth based on concurrent sessions rather than peak call minutes. Each audio-only session consumes ~64-128 Kbps. A server handling 1,000 concurrent agents needs approximately 100 Mbps of relay capacity.
Connection Quality Monitoring
// Monitor WebRTC connection quality
setInterval(async () => {
const stats = await pc.getStats();
stats.forEach(report => {
if (report.type === 'inbound-rtp') {
console.log(`Packets lost: ${report.packetsLost}`);
console.log(`Jitter: ${report.jitterSeconds}s`);
console.log(`Round trip: ${report.roundTripTime}s`);
}
});
}, 5000);
Simulcast and SVC
Simulcast encodes the same video at multiple resolutions and bitrates. The SFU selects the appropriate layer per receiver based on bandwidth. SVC (Scalable Video Coding) encodes a base layer with enhancement layers for temporal and spatial scalability. VP9 and AV1 support SVC; VP8 and H.264 use simulcast.
Conclusion
WebRTC has matured from a browser-based video call protocol into the foundational transport for real-time communication. In 2026, three trends define its evolution: the rise of AI voice agents that use WebRTC as their transport layer, the standardization of WHIP/WHEP for broadcasting, and the emergence of WebTransport + MoQ as complementary technologies for specific use cases. Understanding the full protocol stack — ICE for connectivity, STUN/TURN for NAT traversal, SRTP for media encryption, and SDP for negotiation — remains essential for building production-grade real-time applications.
Resources
- WebRTC W3C Specification — Official W3C API spec
- RFC 5245 - ICE — Interactive Connectivity Establishment
- RFC 5766 - TURN — Traversal Using Relays around NAT
- RFC 8489 - STUN — Session Traversal Utilities for NAT
- RFC 9725 - WHIP — WebRTC-HTTP Ingestion Protocol (March 2025)
- LiveKit — Open-source SFU with AI agent framework
- mediasoup — Low-level C++ SFU with Node.js/Rust API
- Pion WebRTC — Go WebRTC implementation for AI agents
- Cloudflare Stream WHIP/WHEP — Standards-based WebRTC broadcasting
- Interop 2026: WebRTC — WebKit’s cross-browser WebRTC focus area
Comments