Skip to main content

Shadow AI Complete Guide 2026: Detection Scripts, Policy Templates, and Governance Framework

Created: March 2, 2026 Larry Qu 5 min read

Introduction

Shadow AI — the use of AI tools and services without explicit IT or security approval — has become one of the most significant enterprise risks in 2026. Employees using consumer ChatGPT, Claude, or Gemini for work tasks can expose sensitive data to external servers, violate compliance requirements (GDPR, HIPAA, PCI DSS), and create blind spots in security monitoring.

This guide provides concrete tools for detection and governance: Python scripts for identifying Shadow AI traffic in network logs, firewall rules for discovering and controlling AI service access, a data classification and DLP integration pattern, a YAML policy template for AI governance, and a risk assessment framework for evaluating AI tools.

Detecting Shadow AI with Network Traffic Analysis

Discovering AI Service Endpoints from Network Logs

The following script analyzes firewall or proxy logs to identify connections to known AI service providers:

import re
from collections import Counter
from datetime import datetime, timedelta

# Known AI service API endpoints (updated Q2 2026)
AI_ENDPOINTS = {
    "api.openai.com": "OpenAI API",
    "api.anthropic.com": "Anthropic Claude API",
    "generativelanguage.googleapis.com": "Google Gemini API",
    "api.deepseek.com": "DeepSeek API",
    "chat.openai.com": "ChatGPT Web",
    "claude.ai": "Claude Web",
    "chat.deepseek.com": "DeepSeek Chat Web",
    "copilot.microsoft.com": "GitHub Copilot",
    "api.together.xyz": "Together AI",
    "router.huggingface.co": "Hugging Face Inference",
}

def analyze_dns_logs(log_path: str, hours: int = 24) -> dict:
    """Scan DNS/proxy logs for connections to AI service endpoints.

    Expected log format: 'timestamp client_ip domain status bytes'
    Returns a dict of {domain: {count, unique_clients, total_bytes}}
    """
    cutoff = datetime.now() - timedelta(hours=hours)
    findings = {}

    with open(log_path) as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) < 3:
                continue

            timestamp = datetime.fromisoformat(parts[0])
            if timestamp < cutoff:
                continue

            for endpoint, name in AI_ENDPOINTS.items():
                if endpoint in line:
                    domain = endpoint
                    client_ip = parts[1]
                    bytes_transferred = int(parts[3]) if len(parts) > 3 else 0

                    if domain not in findings:
                        findings[domain] = {
                            "service": name,
                            "count": 0,
                            "unique_clients": set(),
                            "total_bytes": 0
                        }
                    findings[domain]["count"] += 1
                    findings[domain]["unique_clients"].add(client_ip)
                    findings[domain]["total_bytes"] += bytes_transferred

    # Convert sets to counts for display
    for domain in findings:
        findings[domain]["unique_clients"] = len(findings[domain]["unique_clients"])

    return findings

# Usage: python shadow_ai_detect.py /var/log/squid/access.log
if __name__ == "__main__":
    import sys
    log_path = sys.argv[1] if len(sys.argv) > 1 else "/var/log/squid/access.log"
    results = analyze_dns_logs(log_path)

    print(f"{'Service':25s} {'Requests':>10s} {'Clients':>10s} {'Data (MB)':>10s}")
    print("-" * 55)
    for domain, data in sorted(results.items(), key=lambda x: x[1]["count"], reverse=True):
        mb = data["total_bytes"] / 1_000_000
        print(f"{data['service']:25s} {data['count']:>10d} "
              f"{data['unique_clients']:>10d} {mb:>10.1f}")

Firewall Rules to Discover AI Traffic

Use these iptables/nftables rules to log all traffic to known AI endpoints without blocking it (discovery phase):

# Create a new chain for AI service logging
iptables -N AI_SERVICES

# Log connections to known AI endpoints
for endpoint in api.openai.com api.anthropic.com generativelanguage.googleapis.com \
                api.deepseek.com chat.openai.com claude.ai copilot.microsoft.com; do
    ip=$(dig +short $endpoint | head -1)
    [ -n "$ip" ] && iptables -A AI_SERVICES -d $ip -j LOG \
        --log-prefix "SHADOW_AI: " --log-uid
done

# Apply to forward chain
iptables -A FORWARD -j AI_SERVICES

# View discovered connections
grep SHADOW_AI /var/log/kern.log | awk '{print $NF}' | sort | uniq -c | sort -rn | head -20

For sustained monitoring, use nftables with a named set:

# /etc/nftables/ai-endpoints.conf
table inet shadow_ai {
    set ai_endpoints {
        type ipv4_addr
        flags timeout
        elements = {
            104.18.0.0/16 timeout 1d,  # Cloudflare range (OpenAI CDN)
            13.107.0.0/16 timeout 1d,  # Microsoft range (Copilot)
        }
    }

    chain log_ai_traffic {
        type filter hook forward priority 0; policy accept;
        ip daddr @ai_endpoints log prefix "SHADOW_AI: " group 0
        tcp dport { 443 } log prefix "HTTPS_TO_UNKNOWN: " group 0
    }
}

Data Loss Prevention Integration

Configure DLP rules to detect sensitive data being sent to AI services. This example uses a regex-based scanner for common sensitive patterns:

import re

SENSITIVE_PATTERNS = {
    "SSN": r"\b\d{3}-\d{2}-\d{4}\b",
    "Credit Card": r"\b\d{4}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}\b",
    "Email": r"\b[\w.]+@[\w.]+\.\w+\b",
    "API Key": r"(?i)(api[_-]?key|secret|token)[\s:=]+['\"]?[a-zA-Z0-9_\-]{16,}['\"]?",
    "Internal IP": r"\b(10|192\.168|172\.1[6-9])\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
}

def scan_request_for_sensitive_data(request_body: str) -> list:
    """Scan HTTP request body for sensitive patterns before it reaches an AI API."""
    findings = []
    for pattern_name, pattern in SENSITIVE_PATTERNS.items():
        matches = re.findall(pattern, request_body)
        if matches:
            findings.append({
                "type": pattern_name,
                "count": len(matches),
                "example": matches[0][:20] + "..." if len(matches[0]) > 20 else matches[0]
            })
    return findings

# Example: intercept at proxy level
def check_outbound_request(url: str, body: str) -> bool:
    """Returns False if request should be blocked (contains sensitive data)."""
    for endpoint in ["api.openai.com", "api.anthropic.com"]:
        if endpoint in url:
            findings = scan_request_for_sensitive_data(body)
            if findings:
                print(f"BLOCKED: Sensitive data to {url}: {findings}")
                return False
    return True

Policy Template (YAML)

# ai-governance-policy.yaml — Enterprise AI usage policy
policy:
  version: "1.0"
  effective_date: "2026-05-01"
  owner: "CISO & AI Governance Committee"

  # Tier 1: Approved — can be used with any data
  approved_tools:
    - provider: anthropic
      services:
        - claude-sonnet-4-20260514
        - claude-opus-4-20260515
      allowed_data_classifications: ["public", "internal", "confidential"]
      requires_mfa: true
      data_retention: "none (API does not train on prompts)"

    - provider: openai
      services:
        - gpt-5.5
        - gpt-5.4
      allowed_data_classifications: ["public", "internal"]
      requires_mfa: true
      requires_data_classification_header: true

  # Tier 2: Conditional — requires business justification
  conditional_tools:
    - provider: deepseek
      services:
        - deepseek-v4-pro
      allowed_data_classifications: ["public"]
      requires_business_justification: true
      additional_controls:
        - data_masking_required
        - audit_logging_required

  # Tier 3: Prohibited
  prohibited_tools:
    - provider: "*"  # Any unlisted provider
    - provider: consumer_services
      services:
        - "chatgpt-free-tier"  # No data protection guarantees
        - "claude-free-tier"

  # Data classification rules
  data_classification:
    public:
      description: "Information that can be shared publicly"
      allowed_ai_tiers: ["approved", "conditional"]
    internal:
      description: "Internal business data, not for public"
      allowed_ai_tiers: ["approved"]
    confidential:
      description: "Customer PII, financial data, trade secrets"
      allowed_ai_tiers: ["approved"]
      requires_dlp_scan: true
    restricted:
      description: "Regulated data (HIPAA, PCI, GDPR Article 9)"
      allowed_ai_tiers: []  # No AI tool may process restricted data

  # Detection and response
  monitoring:
    network_detection: true
    endpoint_detection: true
    dlp_integration: true
    review_frequency: "weekly"
    escalation: "within 4 hours for confidential data exposure"

Risk Assessment Framework

Use this scorecard to evaluate any new AI tool:

Criterion Weight Score (1-5) Notes
Data handling transparency 20% Does provider publish data processing details?
API data retention policy 20% Do they train on submitted data?
Encryption (in transit + at rest) 15% TLS 1.3 minimum, encryption at rest
SOC 2 / ISO 27001 certification 15% Independent security audit
Data residency options 10% Can data stay in your region?
MFA / SSO support 10% Enterprise authentication
Contractual data protection 10% DPA, BAA for healthcare

Score: 0-2 = Prohibited, 2-3.5 = Conditional, 3.5-5 = Approved

Resources

Comments

Share this article

Scan to read on mobile

👍 Was this article helpful?