Input Validation & Sanitization in Python: Securing Your Applications
Every application accepts input from users. That input might be a username, a search query, a file upload, or data from an API. Each piece of input is a potential attack vector. Attackers exploit improperly handled input to inject malicious code, steal data, or compromise systems.
Input validation and sanitization are your first line of defense. They’re not optional security featuresโthey’re fundamental requirements for any application that accepts user input.
This guide explores input validation and sanitization comprehensively, showing you how to implement them effectively in Python.
Understanding the Risks
What Happens Without Validation?
Consider a simple login form:
# DANGEROUS: No validation
username = request.form.get('username')
password = request.form.get('password')
query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
result = db.execute(query)
An attacker could submit:
- Username:
admin' -- - Password:
anything
The query becomes:
SELECT * FROM users WHERE username='admin' --' AND password='anything'
The -- comments out the password check, allowing login without a password. This is SQL injection.
Common Vulnerabilities
SQL Injection: Injecting SQL code through input fields
# Vulnerable
query = f"SELECT * FROM users WHERE id={user_id}"
# Attack: user_id = "1 OR 1=1"
# Result: SELECT * FROM users WHERE id=1 OR 1=1
# Returns all users!
Cross-Site Scripting (XSS): Injecting JavaScript that runs in users’ browsers
# Vulnerable
comment = request.form.get('comment')
html = f"<p>{comment}</p>"
# Attack: comment = "<script>alert('hacked')</script>"
# Result: <p><script>alert('hacked')</script></p>
# Script runs in browser!
Command Injection: Injecting shell commands
# Vulnerable
filename = request.form.get('filename')
os.system(f"convert {filename} output.jpg")
# Attack: filename = "image.jpg; rm -rf /"
# Result: convert image.jpg; rm -rf /
# Deletes files!
Path Traversal: Accessing files outside intended directory
# Vulnerable
filename = request.form.get('filename')
with open(f"uploads/{filename}") as f:
return f.read()
# Attack: filename = "../../etc/passwd"
# Result: opens /etc/passwd
Validation vs Sanitization
These terms are often confused, but they’re different:
Validation
Validation checks if input meets expected criteria. It answers: “Is this input acceptable?”
# Validation: Check if email looks valid
def is_valid_email(email):
"""Check if email format is valid"""
import re
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
# Valid
is_valid_email("[email protected]") # True
# Invalid
is_valid_email("not-an-email") # False
Sanitization
Sanitization cleans or escapes input to make it safe. It answers: “How do I make this input safe?”
# Sanitization: Escape HTML special characters
def sanitize_html(text):
"""Escape HTML special characters"""
import html
return html.escape(text)
# Input: <script>alert('xss')</script>
# Output: <script>alert('xss')</script>
# Safe to display in HTML
When to Use Each
- Validation: Reject invalid input (e.g., reject non-numeric input for age)
- Sanitization: Accept input but make it safe (e.g., escape HTML in user comments)
Often you use both:
def process_user_comment(comment):
"""Validate and sanitize user comment"""
# Validation: Check length
if not comment or len(comment) > 1000:
raise ValueError("Comment must be 1-1000 characters")
# Sanitization: Escape HTML
safe_comment = html.escape(comment)
return safe_comment
Practical Validation Techniques
Technique 1: Type Checking
def validate_age(age):
"""Validate that age is a positive integer"""
try:
age_int = int(age)
if age_int < 0 or age_int > 150:
raise ValueError("Age must be between 0 and 150")
return age_int
except ValueError:
raise ValueError("Age must be a valid integer")
# Valid
validate_age("25") # 25
# Invalid
validate_age("abc") # ValueError
validate_age("-5") # ValueError
Technique 2: Regular Expressions
import re
def validate_phone(phone):
"""Validate phone number format"""
pattern = r'^\+?1?\d{9,15}$'
if re.match(pattern, phone):
return phone
raise ValueError("Invalid phone number format")
# Valid
validate_phone("+1-555-123-4567") # Valid
validate_phone("5551234567") # Valid
# Invalid
validate_phone("123") # Invalid
Technique 3: Using the validators Library
import validators
# Email validation
if validators.email("[email protected]"):
print("Valid email")
# URL validation
if validators.url("https://example.com"):
print("Valid URL")
# IP address validation
if validators.ipv4("192.168.1.1"):
print("Valid IPv4")
# Domain validation
if validators.domain("example.com"):
print("Valid domain")
Technique 4: Allowlisting vs Denylisting
Allowlisting (preferred): Only accept known-good values
# Good: Allowlist
ALLOWED_ROLES = {'admin', 'user', 'moderator'}
def validate_role(role):
"""Validate role is in allowed list"""
if role not in ALLOWED_ROLES:
raise ValueError(f"Invalid role. Must be one of {ALLOWED_ROLES}")
return role
validate_role('admin') # OK
validate_role('hacker') # ValueError
Denylisting (not recommended): Reject known-bad values
# Bad: Denylist (incomplete)
FORBIDDEN_WORDS = {'admin', 'root', 'system'}
def validate_username(username):
"""Validate username doesn't contain forbidden words"""
if any(word in username.lower() for word in FORBIDDEN_WORDS):
raise ValueError("Username contains forbidden words")
return username
# Attacker could use: 'adm1n', 'ADMIN', 'a-d-m-i-n'
# Denylist is incomplete!
Technique 5: String Validation
def validate_username(username):
"""Validate username format"""
# Check length
if not username or len(username) < 3 or len(username) > 20:
raise ValueError("Username must be 3-20 characters")
# Check characters (alphanumeric and underscore only)
if not re.match(r'^[a-zA-Z0-9_]+$', username):
raise ValueError("Username can only contain letters, numbers, and underscores")
# Check doesn't start with number
if username[0].isdigit():
raise ValueError("Username cannot start with a number")
return username
validate_username("john_doe") # OK
validate_username("123invalid") # ValueError
validate_username("user@name") # ValueError
Practical Sanitization Techniques
Technique 1: HTML Escaping
import html
def sanitize_html_text(text):
"""Escape HTML special characters"""
return html.escape(text)
# Input: <script>alert('xss')</script>
# Output: <script>alert('xss')</script>
# Safe to display in HTML
html_output = f"<p>{sanitize_html_text(user_input)}</p>"
Technique 2: SQL Parameterization
import sqlite3
# DANGEROUS: String concatenation
# query = f"SELECT * FROM users WHERE username='{username}'"
# SAFE: Parameterized query
def get_user(username):
"""Get user by username safely"""
conn = sqlite3.connect('database.db')
cursor = conn.cursor()
# Use ? placeholders
cursor.execute("SELECT * FROM users WHERE username=?", (username,))
return cursor.fetchone()
# Parameterization prevents SQL injection
get_user("admin' --") # Treated as literal string, not SQL
Technique 3: URL Encoding
from urllib.parse import quote, unquote
def sanitize_url_parameter(param):
"""Encode parameter for safe URL inclusion"""
return quote(param, safe='')
# Input: "hello world & special chars"
# Output: "hello%20world%20%26%20special%20chars"
# Safe to include in URL
url = f"https://example.com/search?q={sanitize_url_parameter(user_input)}"
Technique 4: Using bleach for HTML
import bleach
def sanitize_user_html(html_content):
"""Allow only safe HTML tags"""
allowed_tags = ['p', 'br', 'strong', 'em', 'u', 'a']
allowed_attributes = {'a': ['href', 'title']}
return bleach.clean(
html_content,
tags=allowed_tags,
attributes=allowed_attributes,
strip=True
)
# Input: <p>Hello <script>alert('xss')</script></p>
# Output: <p>Hello </p>
# Script tag removed!
# Input: <p>Check <a href="https://example.com">this</a></p>
# Output: <p>Check <a href="https://example.com">this</a></p>
# Safe link preserved
Technique 5: File Upload Validation
import os
from pathlib import Path
def validate_file_upload(file, allowed_extensions, max_size_mb=5):
"""Validate uploaded file"""
# Check file exists
if not file or file.filename == '':
raise ValueError("No file selected")
# Check file extension
filename = file.filename
ext = os.path.splitext(filename)[1].lower()
if ext not in allowed_extensions:
raise ValueError(f"File type not allowed. Allowed: {allowed_extensions}")
# Check file size
file.seek(0, os.SEEK_END)
file_size = file.tell()
file.seek(0)
if file_size > max_size_mb * 1024 * 1024:
raise ValueError(f"File too large. Max size: {max_size_mb}MB")
# Sanitize filename (prevent path traversal)
safe_filename = Path(filename).name
if safe_filename != filename:
raise ValueError("Invalid filename")
return safe_filename
# Usage
try:
safe_name = validate_file_upload(
file,
allowed_extensions={'.jpg', '.png', '.gif'},
max_size_mb=5
)
except ValueError as e:
print(f"Upload error: {e}")
Real-World Scenarios
Scenario 1: User Registration
import re
import hashlib
def validate_registration(username, email, password):
"""Validate user registration input"""
errors = []
# Validate username
if not username or len(username) < 3:
errors.append("Username must be at least 3 characters")
if not re.match(r'^[a-zA-Z0-9_]+$', username):
errors.append("Username can only contain letters, numbers, and underscores")
# Validate email
if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email):
errors.append("Invalid email format")
# Validate password
if len(password) < 8:
errors.append("Password must be at least 8 characters")
if not re.search(r'[A-Z]', password):
errors.append("Password must contain uppercase letter")
if not re.search(r'[0-9]', password):
errors.append("Password must contain number")
if errors:
raise ValueError("; ".join(errors))
# Hash password (never store plaintext!)
password_hash = hashlib.sha256(password.encode()).hexdigest()
return {
'username': username,
'email': email.lower(),
'password_hash': password_hash
}
# Valid
try:
user = validate_registration("john_doe", "[email protected]", "SecurePass123")
print("Registration successful")
except ValueError as e:
print(f"Registration error: {e}")
Scenario 2: Search Query
def validate_search_query(query):
"""Validate and sanitize search query"""
# Remove leading/trailing whitespace
query = query.strip()
# Check length
if not query or len(query) > 200:
raise ValueError("Search query must be 1-200 characters")
# Remove special characters that could cause issues
# Allow only alphanumeric, spaces, and basic punctuation
if not re.match(r'^[a-zA-Z0-9\s\-\.\,\!\?]+$', query):
raise ValueError("Search query contains invalid characters")
return query
# Usage
try:
safe_query = validate_search_query(user_input)
# Use safe_query in database search
except ValueError as e:
print(f"Invalid search: {e}")
Scenario 3: API Request
import json
from typing import Any, Dict
def validate_api_request(data: Dict[str, Any], schema: Dict[str, Any]) -> Dict[str, Any]:
"""Validate API request against schema"""
validated = {}
for field, rules in schema.items():
if field not in data:
if rules.get('required'):
raise ValueError(f"Missing required field: {field}")
continue
value = data[field]
field_type = rules.get('type')
# Type validation
if field_type == 'string':
if not isinstance(value, str):
raise ValueError(f"{field} must be string")
max_len = rules.get('max_length', 1000)
if len(value) > max_len:
raise ValueError(f"{field} exceeds max length {max_len}")
validated[field] = value.strip()
elif field_type == 'integer':
try:
value = int(value)
except (ValueError, TypeError):
raise ValueError(f"{field} must be integer")
min_val = rules.get('min')
max_val = rules.get('max')
if min_val is not None and value < min_val:
raise ValueError(f"{field} must be >= {min_val}")
if max_val is not None and value > max_val:
raise ValueError(f"{field} must be <= {max_val}")
validated[field] = value
elif field_type == 'email':
if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', value):
raise ValueError(f"{field} must be valid email")
validated[field] = value.lower()
return validated
# Usage
schema = {
'name': {'type': 'string', 'required': True, 'max_length': 100},
'age': {'type': 'integer', 'required': True, 'min': 0, 'max': 150},
'email': {'type': 'email', 'required': True},
}
try:
validated = validate_api_request(request_data, schema)
except ValueError as e:
return {'error': str(e)}, 400
Best Practices
1. Validate Early, Validate Often
# Good: Validate at entry point
def handle_request(request):
try:
username = validate_username(request.form.get('username'))
email = validate_email(request.form.get('email'))
except ValueError as e:
return {'error': str(e)}, 400
# Use validated data
create_user(username, email)
2. Use Parameterized Queries
# Bad
query = f"SELECT * FROM users WHERE id={user_id}"
# Good
cursor.execute("SELECT * FROM users WHERE id=?", (user_id,))
3. Implement Rate Limiting
from functools import wraps
import time
def rate_limit(max_calls=10, time_window=60):
"""Rate limit decorator"""
def decorator(func):
calls = []
@wraps(func)
def wrapper(*args, **kwargs):
now = time.time()
# Remove old calls outside time window
calls[:] = [call for call in calls if call > now - time_window]
if len(calls) >= max_calls:
raise ValueError("Rate limit exceeded")
calls.append(now)
return func(*args, **kwargs)
return wrapper
return decorator
@rate_limit(max_calls=5, time_window=60)
def api_endpoint():
return "Success"
4. Log Security Events
import logging
logger = logging.getLogger(__name__)
def process_user_input(user_input):
"""Process input with logging"""
try:
validated = validate_input(user_input)
return validated
except ValueError as e:
# Log suspicious activity
logger.warning(f"Invalid input attempt: {user_input[:50]}")
raise
5. Use Security Headers
# In Flask/Django
def add_security_headers(response):
"""Add security headers to response"""
response.headers['X-Content-Type-Options'] = 'nosniff'
response.headers['X-Frame-Options'] = 'DENY'
response.headers['X-XSS-Protection'] = '1; mode=block'
response.headers['Content-Security-Policy'] = "default-src 'self'"
return response
Conclusion
Input validation and sanitization are not optionalโthey’re essential security practices. Every piece of user input is a potential attack vector. By implementing proper validation and sanitization, you protect your application and users.
Key takeaways:
- Validation checks if input meets expected criteria
- Sanitization makes input safe for use
- Use both together for comprehensive protection
- Allowlist known-good values instead of denylisting bad ones
- Parameterize queries to prevent SQL injection
- Escape output to prevent XSS
- Validate file uploads to prevent path traversal
- Log security events to detect attacks
- Use libraries like validators and bleach for common tasks
- Test thoroughly with malicious input
Security is not a featureโit’s a requirement. Make input validation and sanitization a core part of your development process. Your users will thank you.
Comments