Skip to main content
โšก Calmops

Input Validation & Sanitization in Python: Securing Your Applications

Input Validation & Sanitization in Python: Securing Your Applications

Every application accepts input from users. That input might be a username, a search query, a file upload, or data from an API. Each piece of input is a potential attack vector. Attackers exploit improperly handled input to inject malicious code, steal data, or compromise systems.

Input validation and sanitization are your first line of defense. They’re not optional security featuresโ€”they’re fundamental requirements for any application that accepts user input.

This guide explores input validation and sanitization comprehensively, showing you how to implement them effectively in Python.

Understanding the Risks

What Happens Without Validation?

Consider a simple login form:

# DANGEROUS: No validation
username = request.form.get('username')
password = request.form.get('password')

query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
result = db.execute(query)

An attacker could submit:

  • Username: admin' --
  • Password: anything

The query becomes:

SELECT * FROM users WHERE username='admin' --' AND password='anything'

The -- comments out the password check, allowing login without a password. This is SQL injection.

Common Vulnerabilities

SQL Injection: Injecting SQL code through input fields

# Vulnerable
query = f"SELECT * FROM users WHERE id={user_id}"

# Attack: user_id = "1 OR 1=1"
# Result: SELECT * FROM users WHERE id=1 OR 1=1
# Returns all users!

Cross-Site Scripting (XSS): Injecting JavaScript that runs in users’ browsers

# Vulnerable
comment = request.form.get('comment')
html = f"<p>{comment}</p>"

# Attack: comment = "<script>alert('hacked')</script>"
# Result: <p><script>alert('hacked')</script></p>
# Script runs in browser!

Command Injection: Injecting shell commands

# Vulnerable
filename = request.form.get('filename')
os.system(f"convert {filename} output.jpg")

# Attack: filename = "image.jpg; rm -rf /"
# Result: convert image.jpg; rm -rf /
# Deletes files!

Path Traversal: Accessing files outside intended directory

# Vulnerable
filename = request.form.get('filename')
with open(f"uploads/{filename}") as f:
    return f.read()

# Attack: filename = "../../etc/passwd"
# Result: opens /etc/passwd

Validation vs Sanitization

These terms are often confused, but they’re different:

Validation

Validation checks if input meets expected criteria. It answers: “Is this input acceptable?”

# Validation: Check if email looks valid
def is_valid_email(email):
    """Check if email format is valid"""
    import re
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

# Valid
is_valid_email("[email protected]")  # True

# Invalid
is_valid_email("not-an-email")  # False

Sanitization

Sanitization cleans or escapes input to make it safe. It answers: “How do I make this input safe?”

# Sanitization: Escape HTML special characters
def sanitize_html(text):
    """Escape HTML special characters"""
    import html
    return html.escape(text)

# Input: <script>alert('xss')</script>
# Output: &lt;script&gt;alert('xss')&lt;/script&gt;
# Safe to display in HTML

When to Use Each

  • Validation: Reject invalid input (e.g., reject non-numeric input for age)
  • Sanitization: Accept input but make it safe (e.g., escape HTML in user comments)

Often you use both:

def process_user_comment(comment):
    """Validate and sanitize user comment"""
    # Validation: Check length
    if not comment or len(comment) > 1000:
        raise ValueError("Comment must be 1-1000 characters")
    
    # Sanitization: Escape HTML
    safe_comment = html.escape(comment)
    
    return safe_comment

Practical Validation Techniques

Technique 1: Type Checking

def validate_age(age):
    """Validate that age is a positive integer"""
    try:
        age_int = int(age)
        if age_int < 0 or age_int > 150:
            raise ValueError("Age must be between 0 and 150")
        return age_int
    except ValueError:
        raise ValueError("Age must be a valid integer")

# Valid
validate_age("25")  # 25

# Invalid
validate_age("abc")  # ValueError
validate_age("-5")   # ValueError

Technique 2: Regular Expressions

import re

def validate_phone(phone):
    """Validate phone number format"""
    pattern = r'^\+?1?\d{9,15}$'
    if re.match(pattern, phone):
        return phone
    raise ValueError("Invalid phone number format")

# Valid
validate_phone("+1-555-123-4567")  # Valid
validate_phone("5551234567")       # Valid

# Invalid
validate_phone("123")              # Invalid

Technique 3: Using the validators Library

import validators

# Email validation
if validators.email("[email protected]"):
    print("Valid email")

# URL validation
if validators.url("https://example.com"):
    print("Valid URL")

# IP address validation
if validators.ipv4("192.168.1.1"):
    print("Valid IPv4")

# Domain validation
if validators.domain("example.com"):
    print("Valid domain")

Technique 4: Allowlisting vs Denylisting

Allowlisting (preferred): Only accept known-good values

# Good: Allowlist
ALLOWED_ROLES = {'admin', 'user', 'moderator'}

def validate_role(role):
    """Validate role is in allowed list"""
    if role not in ALLOWED_ROLES:
        raise ValueError(f"Invalid role. Must be one of {ALLOWED_ROLES}")
    return role

validate_role('admin')   # OK
validate_role('hacker')  # ValueError

Denylisting (not recommended): Reject known-bad values

# Bad: Denylist (incomplete)
FORBIDDEN_WORDS = {'admin', 'root', 'system'}

def validate_username(username):
    """Validate username doesn't contain forbidden words"""
    if any(word in username.lower() for word in FORBIDDEN_WORDS):
        raise ValueError("Username contains forbidden words")
    return username

# Attacker could use: 'adm1n', 'ADMIN', 'a-d-m-i-n'
# Denylist is incomplete!

Technique 5: String Validation

def validate_username(username):
    """Validate username format"""
    # Check length
    if not username or len(username) < 3 or len(username) > 20:
        raise ValueError("Username must be 3-20 characters")
    
    # Check characters (alphanumeric and underscore only)
    if not re.match(r'^[a-zA-Z0-9_]+$', username):
        raise ValueError("Username can only contain letters, numbers, and underscores")
    
    # Check doesn't start with number
    if username[0].isdigit():
        raise ValueError("Username cannot start with a number")
    
    return username

validate_username("john_doe")    # OK
validate_username("123invalid")  # ValueError
validate_username("user@name")   # ValueError

Practical Sanitization Techniques

Technique 1: HTML Escaping

import html

def sanitize_html_text(text):
    """Escape HTML special characters"""
    return html.escape(text)

# Input: <script>alert('xss')</script>
# Output: &lt;script&gt;alert('xss')&lt;/script&gt;

# Safe to display in HTML
html_output = f"<p>{sanitize_html_text(user_input)}</p>"

Technique 2: SQL Parameterization

import sqlite3

# DANGEROUS: String concatenation
# query = f"SELECT * FROM users WHERE username='{username}'"

# SAFE: Parameterized query
def get_user(username):
    """Get user by username safely"""
    conn = sqlite3.connect('database.db')
    cursor = conn.cursor()
    
    # Use ? placeholders
    cursor.execute("SELECT * FROM users WHERE username=?", (username,))
    return cursor.fetchone()

# Parameterization prevents SQL injection
get_user("admin' --")  # Treated as literal string, not SQL

Technique 3: URL Encoding

from urllib.parse import quote, unquote

def sanitize_url_parameter(param):
    """Encode parameter for safe URL inclusion"""
    return quote(param, safe='')

# Input: "hello world & special chars"
# Output: "hello%20world%20%26%20special%20chars"

# Safe to include in URL
url = f"https://example.com/search?q={sanitize_url_parameter(user_input)}"

Technique 4: Using bleach for HTML

import bleach

def sanitize_user_html(html_content):
    """Allow only safe HTML tags"""
    allowed_tags = ['p', 'br', 'strong', 'em', 'u', 'a']
    allowed_attributes = {'a': ['href', 'title']}
    
    return bleach.clean(
        html_content,
        tags=allowed_tags,
        attributes=allowed_attributes,
        strip=True
    )

# Input: <p>Hello <script>alert('xss')</script></p>
# Output: <p>Hello </p>
# Script tag removed!

# Input: <p>Check <a href="https://example.com">this</a></p>
# Output: <p>Check <a href="https://example.com">this</a></p>
# Safe link preserved

Technique 5: File Upload Validation

import os
from pathlib import Path

def validate_file_upload(file, allowed_extensions, max_size_mb=5):
    """Validate uploaded file"""
    # Check file exists
    if not file or file.filename == '':
        raise ValueError("No file selected")
    
    # Check file extension
    filename = file.filename
    ext = os.path.splitext(filename)[1].lower()
    if ext not in allowed_extensions:
        raise ValueError(f"File type not allowed. Allowed: {allowed_extensions}")
    
    # Check file size
    file.seek(0, os.SEEK_END)
    file_size = file.tell()
    file.seek(0)
    
    if file_size > max_size_mb * 1024 * 1024:
        raise ValueError(f"File too large. Max size: {max_size_mb}MB")
    
    # Sanitize filename (prevent path traversal)
    safe_filename = Path(filename).name
    if safe_filename != filename:
        raise ValueError("Invalid filename")
    
    return safe_filename

# Usage
try:
    safe_name = validate_file_upload(
        file,
        allowed_extensions={'.jpg', '.png', '.gif'},
        max_size_mb=5
    )
except ValueError as e:
    print(f"Upload error: {e}")

Real-World Scenarios

Scenario 1: User Registration

import re
import hashlib

def validate_registration(username, email, password):
    """Validate user registration input"""
    errors = []
    
    # Validate username
    if not username or len(username) < 3:
        errors.append("Username must be at least 3 characters")
    if not re.match(r'^[a-zA-Z0-9_]+$', username):
        errors.append("Username can only contain letters, numbers, and underscores")
    
    # Validate email
    if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email):
        errors.append("Invalid email format")
    
    # Validate password
    if len(password) < 8:
        errors.append("Password must be at least 8 characters")
    if not re.search(r'[A-Z]', password):
        errors.append("Password must contain uppercase letter")
    if not re.search(r'[0-9]', password):
        errors.append("Password must contain number")
    
    if errors:
        raise ValueError("; ".join(errors))
    
    # Hash password (never store plaintext!)
    password_hash = hashlib.sha256(password.encode()).hexdigest()
    
    return {
        'username': username,
        'email': email.lower(),
        'password_hash': password_hash
    }

# Valid
try:
    user = validate_registration("john_doe", "[email protected]", "SecurePass123")
    print("Registration successful")
except ValueError as e:
    print(f"Registration error: {e}")

Scenario 2: Search Query

def validate_search_query(query):
    """Validate and sanitize search query"""
    # Remove leading/trailing whitespace
    query = query.strip()
    
    # Check length
    if not query or len(query) > 200:
        raise ValueError("Search query must be 1-200 characters")
    
    # Remove special characters that could cause issues
    # Allow only alphanumeric, spaces, and basic punctuation
    if not re.match(r'^[a-zA-Z0-9\s\-\.\,\!\?]+$', query):
        raise ValueError("Search query contains invalid characters")
    
    return query

# Usage
try:
    safe_query = validate_search_query(user_input)
    # Use safe_query in database search
except ValueError as e:
    print(f"Invalid search: {e}")

Scenario 3: API Request

import json
from typing import Any, Dict

def validate_api_request(data: Dict[str, Any], schema: Dict[str, Any]) -> Dict[str, Any]:
    """Validate API request against schema"""
    validated = {}
    
    for field, rules in schema.items():
        if field not in data:
            if rules.get('required'):
                raise ValueError(f"Missing required field: {field}")
            continue
        
        value = data[field]
        field_type = rules.get('type')
        
        # Type validation
        if field_type == 'string':
            if not isinstance(value, str):
                raise ValueError(f"{field} must be string")
            max_len = rules.get('max_length', 1000)
            if len(value) > max_len:
                raise ValueError(f"{field} exceeds max length {max_len}")
            validated[field] = value.strip()
        
        elif field_type == 'integer':
            try:
                value = int(value)
            except (ValueError, TypeError):
                raise ValueError(f"{field} must be integer")
            min_val = rules.get('min')
            max_val = rules.get('max')
            if min_val is not None and value < min_val:
                raise ValueError(f"{field} must be >= {min_val}")
            if max_val is not None and value > max_val:
                raise ValueError(f"{field} must be <= {max_val}")
            validated[field] = value
        
        elif field_type == 'email':
            if not re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', value):
                raise ValueError(f"{field} must be valid email")
            validated[field] = value.lower()
    
    return validated

# Usage
schema = {
    'name': {'type': 'string', 'required': True, 'max_length': 100},
    'age': {'type': 'integer', 'required': True, 'min': 0, 'max': 150},
    'email': {'type': 'email', 'required': True},
}

try:
    validated = validate_api_request(request_data, schema)
except ValueError as e:
    return {'error': str(e)}, 400

Best Practices

1. Validate Early, Validate Often

# Good: Validate at entry point
def handle_request(request):
    try:
        username = validate_username(request.form.get('username'))
        email = validate_email(request.form.get('email'))
    except ValueError as e:
        return {'error': str(e)}, 400
    
    # Use validated data
    create_user(username, email)

2. Use Parameterized Queries

# Bad
query = f"SELECT * FROM users WHERE id={user_id}"

# Good
cursor.execute("SELECT * FROM users WHERE id=?", (user_id,))

3. Implement Rate Limiting

from functools import wraps
import time

def rate_limit(max_calls=10, time_window=60):
    """Rate limit decorator"""
    def decorator(func):
        calls = []
        
        @wraps(func)
        def wrapper(*args, **kwargs):
            now = time.time()
            # Remove old calls outside time window
            calls[:] = [call for call in calls if call > now - time_window]
            
            if len(calls) >= max_calls:
                raise ValueError("Rate limit exceeded")
            
            calls.append(now)
            return func(*args, **kwargs)
        
        return wrapper
    return decorator

@rate_limit(max_calls=5, time_window=60)
def api_endpoint():
    return "Success"

4. Log Security Events

import logging

logger = logging.getLogger(__name__)

def process_user_input(user_input):
    """Process input with logging"""
    try:
        validated = validate_input(user_input)
        return validated
    except ValueError as e:
        # Log suspicious activity
        logger.warning(f"Invalid input attempt: {user_input[:50]}")
        raise

5. Use Security Headers

# In Flask/Django
def add_security_headers(response):
    """Add security headers to response"""
    response.headers['X-Content-Type-Options'] = 'nosniff'
    response.headers['X-Frame-Options'] = 'DENY'
    response.headers['X-XSS-Protection'] = '1; mode=block'
    response.headers['Content-Security-Policy'] = "default-src 'self'"
    return response

Conclusion

Input validation and sanitization are not optionalโ€”they’re essential security practices. Every piece of user input is a potential attack vector. By implementing proper validation and sanitization, you protect your application and users.

Key takeaways:

  • Validation checks if input meets expected criteria
  • Sanitization makes input safe for use
  • Use both together for comprehensive protection
  • Allowlist known-good values instead of denylisting bad ones
  • Parameterize queries to prevent SQL injection
  • Escape output to prevent XSS
  • Validate file uploads to prevent path traversal
  • Log security events to detect attacks
  • Use libraries like validators and bleach for common tasks
  • Test thoroughly with malicious input

Security is not a featureโ€”it’s a requirement. Make input validation and sanitization a core part of your development process. Your users will thank you.

Comments