Skip to main content

WebRTC Protocols: Real-Time Communication Architecture 2026

Created: March 11, 2026 Larry Qu 15 min read

Introduction

WebRTC (Web Real-Time Communication) enables direct peer-to-peer audio, video, and data exchange in browsers and mobile applications without plugins. By 2026, WebRTC powers billions of video call minutes, live streaming sessions, and — the fastest-growing segment — AI voice agents. The protocol stack that was designed for browser-to-browser calls has become the transport layer for real-time AI: voice agents that listen, think, and speak with sub-500ms latency.

This guide covers WebRTC architecture, the protocols that make it work (ICE, STUN, TURN, RTP, SDP, DTLS), production deployment patterns with SFUs (mediasoup, LiveKit, Janus), the new WHIP/WHEP standards for broadcasting, WebRTC for AI voice agents, and comparison with the emerging WebTransport + WebCodecs stack.

For foundational knowledge of related protocols, see the WebSocket Protocol Guide and the WebTransport Protocol Guide.

What Is WebRTC?

WebRTC is a collection of W3C APIs and IETF protocols for real-time communication. Unlike WebSockets, which provide a single TCP stream, WebRTC operates over UDP with built-in NAT traversal, encryption, and codec negotiation.

Core Components

MediaStream (getUserMedia) — Access camera and microphone. Returns audio and video tracks that can be fed directly into a PeerConnection.

RTCPeerConnection — The central object that manages the full lifecycle of a peer-to-peer connection: ICE candidate gathering, DTLS key exchange, SRTP media encryption, and bandwidth estimation.

RTCDataChannel — Arbitrary data exchange over SCTP, with configurable reliability (ordered/unordered, retransmit limits, partial reliability).

Browser Support (2026)

Browser WebRTC WHIP WHEP DataChannel Insertable Streams
Chrome Full Full Full Full Full
Firefox Full Full Full Full Full
Safari Full Full (15+) Full (15+) Full Partial
Edge Full Full Full Full Full
Node.js Via libraries Via libraries Via libraries Via libraries N/A

WebRTC is part of WebKit’s Interop 2026 focus areas, having reached 91.6% cross-browser pass rate in 2025. The remaining long tail of interop issues is being addressed through the Interop project.

Architecture Overview

Protocol Stack

Application Layer
       |
WebRTC API (JavaScript)
       |
Signaling (WebSocket/HTTP)
       |
SDP (Session Description)
       |
ICE (Interactive Connectivity)
       |
STUN / TURN
       |
DTLS (Datagram Transport Layer)
       |
SRTP (Secure RTP)  |  SCTP (Data Channel)
       |
UDP (preferred) or TCP (fallback)

Connection Flow

Peer A                    Signaling                 Peer B
  |                         |                        |
  |------ Offer (SDP) ---->|                        |
  |                         |------ Offer (SDP) --->|
  |                         |                        |
  |<--- Answer (SDP) -------|                        |
  |<--- Answer (SDP) -------|                        |
  |                         |                        |
  |======= ICE Candidates ==|<==== ICE Candidates ==|
  |                         |                        |
  |========= DTLS Handshake ========================|
  |                         |                        |
  |<====== Media Streams (SRTP) ====================>|
  |                         |                        |
  |========== Data Channel ========================>|

Session Description Protocol (SDP)

SDP describes multimedia sessions for negotiation between peers. It defines the codecs, encryption keys, network addresses, and media formats each peer supports.

SDP Structure

v=0
o=- 7027864585432175469 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0 1 2
a=extmap-allow-mixed
a=msid-semantic: WMS *
m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 126 97 98
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:xyz123
a=ice-pwd:abcdefghijklmnopqrstuvwxyz
a=rtpmark:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1

Key Attributes

o= (Origin) — Session identifier and version. Changes when the session is modified.

m= (Media Description) — Media type (audio/video/application), port, transport protocol, and list of payload type identifiers for supported codecs.

a=rtpmap — Maps payload type numbers to codec names and clock rates (e.g., 111 opus/48000/2 means payload type 111 is Opus at 48kHz stereo).

a=fmtp — Codec-specific parameters (e.g., useinbandfec=1 enables Opus in-band forward error correction).

a=ice-ufrag / a=ice-pwd — ICE credentials used to authenticate connectivity checks.

a=fingerprint — SHA-256 hash of the DTLS certificate, used for key verification.

Creating Offer/Answer

// Create SDP offer
const offer = await peerConnection.createOffer({
    offerToReceiveAudio: true,
    offerToReceiveVideo: true
});

await peerConnection.setLocalDescription(offer);

// Send offer via signaling, then receive answer
await peerConnection.setRemoteDescription(answer);

Interactive Connectivity Establishment (ICE)

ICE is the protocol that enables peers to discover and establish direct network paths through NATs and firewalls. It collects candidate addresses, prioritizes them, and tests each pair for connectivity.

ICE Candidates

a=candidate:1 1 UDP 2130706431 192.168.1.100 49170 typ host
a=candidate:2 1 UDP 1694498815 203.0.113.50 49171 typ srflx raddr 192.168.1.100 rport 49170
a=candidate:3 1 UDP 847250719 10.0.0.50 49172 typ relay raddr 203.0.113.50 rport 49173

Each candidate contains: foundation ID, component ID (1=RTP, 2=RTCP), transport protocol, priority (higher = better), IP address, port, and type.

Candidate Types

host — Local network address (highest priority). Works when both peers are on the same LAN.

srflx (Server Reflexive) — Public IP discovered via STUN. Works for most residential NATs.

prflx (Peer Reflexive) — Public IP learned dynamically from a peer’s connectivity check.

relay — TURN relay address (lowest priority). Used when all other paths fail.

ICE Configuration

const pc = new RTCPeerConnection({
    iceServers: [
        { urls: 'stun:stun.l.google.com:19302' },
        {
            urls: 'turn:turn.example.com:3478',
            username: 'user',
            credential: 'pass'
        }
    ],
    iceCandidatePoolSize: 10
});

// ICE candidate events
pc.onicecandidate = (event) => {
    if (event.candidate) {
        sendToPeer({ type: 'candidate', candidate: event.candidate });
    }
};

pc.oniceconnectionstatechange = () => {
    console.log('ICE state:', pc.iceConnectionState);
    // States: new -> checking -> connected -> completed
    // Or: new -> checking -> disconnected -> failed
};

pc.addIceCandidate(candidate);

ICE State Machine

stateDiagram-v2
    [*] --> New
    New --> Checking
    Checking --> Connected
    Checking --> Failed
    Connected --> Completed
    Connected --> Disconnected
    Completed --> Disconnected
    Disconnected --> Checking : ICE restart
    Disconnected --> Failed
    Failed --> [*]

STUN Protocol

STUN (Session Traversal Utilities for NAT, RFC 8489) helps peers discover their public IP addresses and port mappings. A STUN server is stateless — it simply echoes back the source address it sees.

Client                    STUN Server
  |                          |
  |--- Binding Request ----->|
  |    (transaction ID)       |
  |                          |
  |<-- Binding Response -----|
  |    (XOR-MAPPED-ADDRESS)  |
  |    (MAPPED-ADDRESS)      |
import socket
import struct

STUN_SERVER = 'stun.l.google.com'
STUN_PORT = 19302

def stun_binding_request():
    # STUN binding request with magic cookie
    transaction_id = b'\x00\x01\x00\x00'  # Message type + length
    magic_cookie = b'\x21\x12\xa4\x42'    # RFC 8489 magic cookie
    request = transaction_id + magic_cookie + b'\x00' * 12

    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(3)
    sock.sendto(request, (STUN_SERVER, STUN_PORT))
    response, addr = sock.recvfrom(512)

    # Parse XOR-MAPPED-ADDRESS from the response
    return response

Google operates a public STUN server at stun.l.google.com:19302 that is free to use and suitable for development. Production deployments should operate their own STUN infrastructure or use a commercial TURN provider.

TURN Protocol

TURN (Traversal Using Relays around NAT, RFC 5766) provides relay servers when direct peer-to-peer connections are impossible. In 2026, this is especially critical for AI voice agents — unlike human video calls where only ~15-20% of connections need TURN, AI voice agents always connect through infrastructure with no P2P fallback, meaning 100% of traffic hits TURN servers.

When TURN Is Needed

  • Symmetric NAT (both peers behind carrier-grade NAT)
  • Firewall blocks UDP entirely (TCP/TLS TURN fallback)
  • Enterprise proxy environments that restrict outbound traffic
  • AI voice agents requiring guaranteed low-latency connections

TURN Server Setup with coturn

# Install coturn (battle-tested TURN server)
apt-get install coturn

# /etc/turnserver.conf
listening-port=3478
tls-listening-port=5349
realm=example.com
fingerprint
lt-cred-mech
user=app1:securepass
total-quota=100
bps-capacity=0
log-file=/var/log/turn/turn.log

Production TURN Cost Modeling

For AI voice agent deployments, TURN costs follow a different model than traditional WebRTC:

Factor Human Video Calls AI Voice Agents
TURN traffic share 15-20% of connections 100% of connections
Average call duration 5-30 minutes 2-10 minutes
Bandwidth per stream 1-4 Mbps (video) 64-128 Kbps (audio only)
Cost driver Concurrent video streams Concurrent sessions × duration

Deploy TURN servers in 3+ regions with anycast routing for lowest latency. Use coturn (the standard open-source implementation) or LiveKit’s bundled TURN for managed deployments.

RTP and SRTP

RTP (Real-time Transport Protocol) carries media streams — audio encoded with Opus, video encoded with VP8, H.264, VP9, or AV1. SRTP encrypts the RTP payload using AES, negotiated automatically via DTLS-SRTP during the handshake.

RTP Packet Header

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT     |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Synchronization Source (SSRC) identifier           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          Payload                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Codec Support in 2026

Codec Type Bitrate Range Browser Support Notes
Opus Audio 6-510 Kbps All Default, 48kHz, inband FEC, DTX
VP8 Video 100 Kbps - 2 Mbps All Fallback, mandatory to implement
H.264 Video 100 Kbps - 5 Mbps All Hardware encode/decode on most devices
VP9 Video 100 Kbps - 8 Mbps Chrome, Firefox, Edge 30% better compression than VP8
AV1 Video 80 Kbps - 4 Mbps Chrome, Firefox, Edge (partial) 50% better than VP8, limited hardware encode

AV1 adoption is progressing but not yet dominant in 2026. VP8 and H.264 remain the safest choices for broad compatibility. As WebRTC expert Tsahi Levent-Levi notes, AV1 may not reach dominant status until 2028.

WebRTC Codec Negotiation

// Prefer specific codecs in SDP
const transceiver = pc.getTransceivers()[0];
// Chrome allows codec preferences via RTCRtpTransceiver.setCodecPreferences
const codecs = RTCRtpSender.getCapabilities('video').codecs;
const av1Codec = codecs.find(c => c.mimeType === 'video/AV1');
if (av1Codec) {
    transceiver.setCodecPreferences([av1Codec, ...codecs]);
}

WHIP and WHEP: WebRTC Broadcasting Standards

WHIP (WebRTC-HTTP Ingestion Protocol, RFC 9725, published March 2025) and WHEP (WebRTC-HTTP Egress Protocol, IETF draft) define simple HTTP-based signaling for WebRTC broadcasting. They replace RTMP with a standards-based, browser-native ingestion and distribution pipeline.

WHIP for Ingestion

// WHIP: Ingest a WebRTC stream via HTTP POST
async function whipPublish(stream, whipEndpoint) {
    const pc = new RTCPeerConnection({
        iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
    });

    stream.getTracks().forEach(track => pc.addTrack(track, stream));

    const offer = await pc.createOffer();
    await pc.setLocalDescription(offer);

    // POST SDP offer to WHIP endpoint
    const response = await fetch(whipEndpoint, {
        method: 'POST',
        headers: { 'Content-Type': 'application/sdp' },
        body: offer.sdp
    });

    const answerSdp = await response.text();
    await pc.setRemoteDescription({ type: 'answer', sdp: answerSdp });

    return pc;
}

WHEP for egress follows the same pattern but for subscribing to a stream. Combined, WHIP + WHEP enable end-to-end WebRTC broadcasting at CDN scale without RTMP translation. Cloudflare Stream was the first major service to support both WHIP and WHEP, followed by Janus, mediasoup, and LiveKit.

WebRTC for AI Voice Agents

The fastest-growing WebRTC use case in 2026 is AI voice agents. Unlike traditional video calls, voice agents require WebRTC for every connection — there is no P2P fallback, and 100% of traffic flows through infrastructure.

AI Voice Agent Architecture

// AI voice agent using WebRTC as transport
class VoiceAgent {
    constructor(agentEndpoint) {
        this.endpoint = agentEndpoint;
        this.pc = null;
    }

    async connect() {
        this.pc = new RTCPeerConnection({
            iceServers: [
                { urls: 'stun:stun.l.google.com:19302' },
                { urls: 'turn:turn.example.com:3478',
                  username: 'agent', credential: process.env.TURN_SECRET }
            ]
        });

        // Outgoing audio from agent
        const audioTrack = await this.createAgentAudioTrack();
        this.pc.addTrack(audioTrack);

        // Incoming audio from user
        this.pc.ontrack = (event) => {
            this.handleUserAudio(event.streams[0]);
        };

        // Data channel for control messages
        const dc = this.pc.createDataChannel('control', {
            ordered: true
        });
        dc.onmessage = (e) => this.handleControl(JSON.parse(e.data));

        const offer = await this.pc.createOffer();
        await this.pc.setLocalDescription(offer);
        // Send offer to agent server via signaling
    }

    async handleUserAudio(stream) {
        // Stream to ASR (speech-to-text) pipeline
        const audioContext = new AudioContext();
        const source = audioContext.createMediaStreamSource(stream);
        // Route to Whisper / Deepgram / etc.
    }
}

Key Differences: AI Agent vs Human Calling

Dimension Human Video Call AI Voice Agent
TURN requirement 15-20% of calls 100% of calls
Target latency <500ms mouth-to-ear <300ms for natural conversation
Audio processing None (pass-through) ASR → LLM → TTS pipeline
Barge-in support Not needed Required — detect interruptions within 300ms
State management Per-call state Session persists across media stream
Server-side libraries Optional Required (pion/Go, aiortc/Python, werift/Node.js)

Barge-In (Interruption Handling)

Full-duplex WebRTC enables natural interruption. When the user speaks while the agent is talking:

  1. The agent detects speech via Voice Activity Detection (VAD) on the incoming audio stream
  2. It stops playback immediately and buffers the user’s utterance
  3. The ASR processes the utterance, LLM generates a response, TTS streams it back
  4. Total interruption cycle targets <300ms for natural feel

WebRTC Server Libraries for AI Agents

Library Language Strengths
Pion Go Most popular, production-grade, powers LiveKit
aiortc Python Best for AI/ML integration, SRTP-to-PCM bridge
werift TypeScript/Node.js TypeScript native, good for JS stacks
LiveKit Go Full SFU platform with AI agent framework, E2EE

RTCDataChannel

Data channels provide arbitrary data exchange between peers over SCTP (Stream Control Transmission Protocol). They support configurable reliability modes critical for different use cases.

// Configurable data channel
const dc = pc.createDataChannel('game-state', {
    ordered: false,          // Allow out-of-order delivery
    maxRetransmits: 2,       // Retry at most twice, then drop
});

dc.onopen = () => {
    dc.send(JSON.stringify({ type: 'join', room: 'lobby' }));
};

dc.onmessage = (event) => {
    console.log('Received:', event.data);
};

// Partial reliability modes:
// - ordered: true,  maxRetransmits: 0   -> reliable, ordered (file transfer)
// - ordered: false, maxRetransmits: 0   -> reliable, unordered (chat)
// - ordered: false, maxRetransmits: 2   -> unreliable, unordered (game positions)
// - ordered: true,  maxPacketLifeTime: 100 -> unreliable (time-bounded)

Security

DTLS Fingerprint Verification

WebRTC mandates encryption via DTLS-SRTP. The DTLS fingerprint from the remote peer’s certificate can be verified against a known value obtained through the signaling channel:

// Verify remote fingerprint after connection
pc.oniceconnectionstatechange = () => {
    if (pc.iceConnectionState === 'connected') {
        const certs = pc.getRemoteCertificates();
        if (certs.length > 0) {
            const fingerprint = certs[0].fingerprint;
            verifyAgainstExpected(fingerprint);
        }
    }
};

Insertable Streams (E2EE)

The Insertable Streams API (W3C, supported in Chrome and LiveKit) allows applications to encrypt media frames before they reach the WebRTC stack, providing end-to-end encryption that the SFU cannot decrypt:

// Insertable Streams for custom E2EE
const sender = pc.getSenders()[0];
const stream = sender.createEncodedStreams();
const encoder = new TransformStream({
    transform: (frame, controller) => {
        // Encrypt frame with application-layer key
        frame.data = encrypt(frame.data, sessionKey);
        controller.enqueue(frame);
    }
});
stream.readable.pipeThrough(encoder);

Best Practices

  • Verify ICE candidates originate from authorized STUN/TURN servers to prevent SSRF
  • Use short-lived TURN credentials with time-based expiry
  • Limit DataChannel message sizes to 256KB (SCTP fragmentation)
  • Use Insertable Streams for sensitive applications requiring true E2EE
  • Monitor ICE connection state and implement ICE restart on network transitions

Scalability Architectures

Mesh (P2P)

Every peer connects to every other peer. Simple but limited to 2-4 participants due to O(n²) upstream bandwidth.

SFU (Selective Forwarding Unit)

A central server forwards streams without decoding. Each peer sends one stream and receives N-1. Scales to 10-100+ participants. The 2026 default for any room above 4 peers.

MCU (Multipoint Control Unit)

Central server decodes, mixes, and re-encodes streams. Provides composite output but requires significant CPU. Mostly legacy — SFUs have replaced MCUs for most use cases.

SFU Comparison: mediasoup vs LiveKit vs Janus

Feature mediasoup LiveKit Janus
Language C++ (Node.js/Rust API) Go C
Simulcast Full Full Plugin-dependent
SVC Full (VP9/AV1) Full Limited
AI agent support Manual integration Built-in agent framework Manual
E2EE (Insertable Streams) Manual First-class Manual
Recording Via FFmpeg Built-in egress Plugin
Kubernetes Manual Official Helm charts Manual
Learning curve Steep (low-level API) Moderate Moderate
Best for Custom pipelines, large rooms AI agents, quick production SIP interop, legacy
flowchart TB
    subgraph SFU["SFU Architecture"]
        P1[Publisher] -->|Single upstream| SFU_CORE
        SFU_CORE -->|Forward| C1[Consumer 1]
        SFU_CORE -->|Forward| C2[Consumer 2]
        SFU_CORE -->|Forward| C3[Consumer N]
    end

    subgraph Mesh["Mesh (P2P) Architecture"]
        M1[Peer A] -->|N-1 streams| M2[Peer B]
        M1 -->|N-1 streams| M3[Peer C]
        M2 -->|N-1 streams| M1
        M2 -->|N-1 streams| M3
        M3 -->|N-1 streams| M1
        M3 -->|N-1 streams| M2
    end

Production SFU Deployment

# Deploy mediasoup with Docker
docker run -d --name mediasoup \
  -p 3000:3000 \
  -p 40000-40010:40000-40010/udp \
  -e MEDIASOUP_LISTEN_IP=0.0.0.0 \
  -e MEDIASOUP_ANNOUNCED_IP=<your-public-ip> \
  your-mediasoup-image

# LiveKit with Helm
helm repo add livekit https://helm.livekit.io
helm upgrade --install my-server livekit/livekit-server \
  --set keys.api=your-api-key \
  --set keys.secret=your-secret

WebRTC vs WebTransport

WebTransport (built on QUIC + HTTP/3) is emerging as a complementary technology for specific use cases. The combination of WebTransport + WebCodecs + WebAssembly is considered a “dark horse” stack that could disrupt traditional WebRTC in client-server media pipelines.

Dimension WebRTC WebTransport
Architecture Peer-to-peer + SFU Client-server
Media pipeline Built-in (codec negotiation, echo cancellation) Build with WebCodecs
Transport SCTP + SRTP over DTLS QUIC streams + datagrams
Browser support 100% of browsers Baseline since March 2026
Unreliable data Partial (SCTP partial reliability) Native datagrams
Connection setup ICE + STUN/TURN (3-5 RTTs) HTTP/3 (0-1 RTT)
Worker support Limited DataChannel in workers Full WebTransport in workers
Server complexity High (ICE, TURN, signaling) Lower (direct HTTP/3 server)
Best for Video/audio calls, P2P Cloud gaming, live streaming, AI inference

As of 2026, Media over QUIC (MoQ, IETF draft-17) combines elements of both — using WebTransport as transport with a pub-sub media model. Production MoQ deployments exist in controlled environments, but universal browser-based MoQ is a 2026-2027 story.

Production Considerations

TURN Capacity Planning

For AI voice agent deployments, allocate TURN bandwidth based on concurrent sessions rather than peak call minutes. Each audio-only session consumes ~64-128 Kbps. A server handling 1,000 concurrent agents needs approximately 100 Mbps of relay capacity.

Connection Quality Monitoring

// Monitor WebRTC connection quality
setInterval(async () => {
    const stats = await pc.getStats();
    stats.forEach(report => {
        if (report.type === 'inbound-rtp') {
            console.log(`Packets lost: ${report.packetsLost}`);
            console.log(`Jitter: ${report.jitterSeconds}s`);
            console.log(`Round trip: ${report.roundTripTime}s`);
        }
    });
}, 5000);

Simulcast and SVC

Simulcast encodes the same video at multiple resolutions and bitrates. The SFU selects the appropriate layer per receiver based on bandwidth. SVC (Scalable Video Coding) encodes a base layer with enhancement layers for temporal and spatial scalability. VP9 and AV1 support SVC; VP8 and H.264 use simulcast.

Conclusion

WebRTC has matured from a browser-based video call protocol into the foundational transport for real-time communication. In 2026, three trends define its evolution: the rise of AI voice agents that use WebRTC as their transport layer, the standardization of WHIP/WHEP for broadcasting, and the emergence of WebTransport + MoQ as complementary technologies for specific use cases. Understanding the full protocol stack — ICE for connectivity, STUN/TURN for NAT traversal, SRTP for media encryption, and SDP for negotiation — remains essential for building production-grade real-time applications.

Resources

Comments

👍 Was this article helpful?