Introduction
Shadow AI — the use of AI tools and services without explicit IT or security approval — has become one of the most significant enterprise risks in 2026. Employees using consumer ChatGPT, Claude, or Gemini for work tasks can expose sensitive data to external servers, violate compliance requirements (GDPR, HIPAA, PCI DSS), and create blind spots in security monitoring.
This guide provides concrete tools for detection and governance: Python scripts for identifying Shadow AI traffic in network logs, firewall rules for discovering and controlling AI service access, a data classification and DLP integration pattern, a YAML policy template for AI governance, and a risk assessment framework for evaluating AI tools.
Detecting Shadow AI with Network Traffic Analysis
Discovering AI Service Endpoints from Network Logs
The following script analyzes firewall or proxy logs to identify connections to known AI service providers:
import re
from collections import Counter
from datetime import datetime, timedelta
# Known AI service API endpoints (updated Q2 2026)
AI_ENDPOINTS = {
"api.openai.com": "OpenAI API",
"api.anthropic.com": "Anthropic Claude API",
"generativelanguage.googleapis.com": "Google Gemini API",
"api.deepseek.com": "DeepSeek API",
"chat.openai.com": "ChatGPT Web",
"claude.ai": "Claude Web",
"chat.deepseek.com": "DeepSeek Chat Web",
"copilot.microsoft.com": "GitHub Copilot",
"api.together.xyz": "Together AI",
"router.huggingface.co": "Hugging Face Inference",
}
def analyze_dns_logs(log_path: str, hours: int = 24) -> dict:
"""Scan DNS/proxy logs for connections to AI service endpoints.
Expected log format: 'timestamp client_ip domain status bytes'
Returns a dict of {domain: {count, unique_clients, total_bytes}}
"""
cutoff = datetime.now() - timedelta(hours=hours)
findings = {}
with open(log_path) as f:
for line in f:
parts = line.strip().split()
if len(parts) < 3:
continue
timestamp = datetime.fromisoformat(parts[0])
if timestamp < cutoff:
continue
for endpoint, name in AI_ENDPOINTS.items():
if endpoint in line:
domain = endpoint
client_ip = parts[1]
bytes_transferred = int(parts[3]) if len(parts) > 3 else 0
if domain not in findings:
findings[domain] = {
"service": name,
"count": 0,
"unique_clients": set(),
"total_bytes": 0
}
findings[domain]["count"] += 1
findings[domain]["unique_clients"].add(client_ip)
findings[domain]["total_bytes"] += bytes_transferred
# Convert sets to counts for display
for domain in findings:
findings[domain]["unique_clients"] = len(findings[domain]["unique_clients"])
return findings
# Usage: python shadow_ai_detect.py /var/log/squid/access.log
if __name__ == "__main__":
import sys
log_path = sys.argv[1] if len(sys.argv) > 1 else "/var/log/squid/access.log"
results = analyze_dns_logs(log_path)
print(f"{'Service':25s} {'Requests':>10s} {'Clients':>10s} {'Data (MB)':>10s}")
print("-" * 55)
for domain, data in sorted(results.items(), key=lambda x: x[1]["count"], reverse=True):
mb = data["total_bytes"] / 1_000_000
print(f"{data['service']:25s} {data['count']:>10d} "
f"{data['unique_clients']:>10d} {mb:>10.1f}")
Firewall Rules to Discover AI Traffic
Use these iptables/nftables rules to log all traffic to known AI endpoints without blocking it (discovery phase):
# Create a new chain for AI service logging
iptables -N AI_SERVICES
# Log connections to known AI endpoints
for endpoint in api.openai.com api.anthropic.com generativelanguage.googleapis.com \
api.deepseek.com chat.openai.com claude.ai copilot.microsoft.com; do
ip=$(dig +short $endpoint | head -1)
[ -n "$ip" ] && iptables -A AI_SERVICES -d $ip -j LOG \
--log-prefix "SHADOW_AI: " --log-uid
done
# Apply to forward chain
iptables -A FORWARD -j AI_SERVICES
# View discovered connections
grep SHADOW_AI /var/log/kern.log | awk '{print $NF}' | sort | uniq -c | sort -rn | head -20
For sustained monitoring, use nftables with a named set:
# /etc/nftables/ai-endpoints.conf
table inet shadow_ai {
set ai_endpoints {
type ipv4_addr
flags timeout
elements = {
104.18.0.0/16 timeout 1d, # Cloudflare range (OpenAI CDN)
13.107.0.0/16 timeout 1d, # Microsoft range (Copilot)
}
}
chain log_ai_traffic {
type filter hook forward priority 0; policy accept;
ip daddr @ai_endpoints log prefix "SHADOW_AI: " group 0
tcp dport { 443 } log prefix "HTTPS_TO_UNKNOWN: " group 0
}
}
Data Loss Prevention Integration
Configure DLP rules to detect sensitive data being sent to AI services. This example uses a regex-based scanner for common sensitive patterns:
import re
SENSITIVE_PATTERNS = {
"SSN": r"\b\d{3}-\d{2}-\d{4}\b",
"Credit Card": r"\b\d{4}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}\b",
"Email": r"\b[\w.]+@[\w.]+\.\w+\b",
"API Key": r"(?i)(api[_-]?key|secret|token)[\s:=]+['\"]?[a-zA-Z0-9_\-]{16,}['\"]?",
"Internal IP": r"\b(10|192\.168|172\.1[6-9])\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
}
def scan_request_for_sensitive_data(request_body: str) -> list:
"""Scan HTTP request body for sensitive patterns before it reaches an AI API."""
findings = []
for pattern_name, pattern in SENSITIVE_PATTERNS.items():
matches = re.findall(pattern, request_body)
if matches:
findings.append({
"type": pattern_name,
"count": len(matches),
"example": matches[0][:20] + "..." if len(matches[0]) > 20 else matches[0]
})
return findings
# Example: intercept at proxy level
def check_outbound_request(url: str, body: str) -> bool:
"""Returns False if request should be blocked (contains sensitive data)."""
for endpoint in ["api.openai.com", "api.anthropic.com"]:
if endpoint in url:
findings = scan_request_for_sensitive_data(body)
if findings:
print(f"BLOCKED: Sensitive data to {url}: {findings}")
return False
return True
Policy Template (YAML)
# ai-governance-policy.yaml — Enterprise AI usage policy
policy:
version: "1.0"
effective_date: "2026-05-01"
owner: "CISO & AI Governance Committee"
# Tier 1: Approved — can be used with any data
approved_tools:
- provider: anthropic
services:
- claude-sonnet-4-20260514
- claude-opus-4-20260515
allowed_data_classifications: ["public", "internal", "confidential"]
requires_mfa: true
data_retention: "none (API does not train on prompts)"
- provider: openai
services:
- gpt-5.5
- gpt-5.4
allowed_data_classifications: ["public", "internal"]
requires_mfa: true
requires_data_classification_header: true
# Tier 2: Conditional — requires business justification
conditional_tools:
- provider: deepseek
services:
- deepseek-v4-pro
allowed_data_classifications: ["public"]
requires_business_justification: true
additional_controls:
- data_masking_required
- audit_logging_required
# Tier 3: Prohibited
prohibited_tools:
- provider: "*" # Any unlisted provider
- provider: consumer_services
services:
- "chatgpt-free-tier" # No data protection guarantees
- "claude-free-tier"
# Data classification rules
data_classification:
public:
description: "Information that can be shared publicly"
allowed_ai_tiers: ["approved", "conditional"]
internal:
description: "Internal business data, not for public"
allowed_ai_tiers: ["approved"]
confidential:
description: "Customer PII, financial data, trade secrets"
allowed_ai_tiers: ["approved"]
requires_dlp_scan: true
restricted:
description: "Regulated data (HIPAA, PCI, GDPR Article 9)"
allowed_ai_tiers: [] # No AI tool may process restricted data
# Detection and response
monitoring:
network_detection: true
endpoint_detection: true
dlp_integration: true
review_frequency: "weekly"
escalation: "within 4 hours for confidential data exposure"
Risk Assessment Framework
Use this scorecard to evaluate any new AI tool:
| Criterion | Weight | Score (1-5) | Notes |
|---|---|---|---|
| Data handling transparency | 20% | Does provider publish data processing details? | |
| API data retention policy | 20% | Do they train on submitted data? | |
| Encryption (in transit + at rest) | 15% | TLS 1.3 minimum, encryption at rest | |
| SOC 2 / ISO 27001 certification | 15% | Independent security audit | |
| Data residency options | 10% | Can data stay in your region? | |
| MFA / SSO support | 10% | Enterprise authentication | |
| Contractual data protection | 10% | DPA, BAA for healthcare |
Score: 0-2 = Prohibited, 2-3.5 = Conditional, 3.5-5 = Approved
Resources
- OWASP AI Security and Privacy Guide — Risk assessment frameworks
- NIST AI Risk Management Framework — Enterprise AI governance standards
- Anthropic Enterprise Data Protection — API data handling policy
- OpenAI Data Privacy (API) — No training on API data
- Cloudflare AI Gateway — AI traffic monitoring and DLP
Comments