Data comes in many formats. Whether you’re consuming an API, processing spreadsheets, or parsing configuration files, you’ll encounter JSON, CSV, and XML. Understanding how to work with these formats efficiently is essential for any Python developer.
This guide covers all three formats, showing you how to read, write, parse, and manipulate data in each. By the end, you’ll know which format to use for different scenarios and how to handle each one effectively.
Understanding Data Formats
Before diving into code, let’s understand when to use each format:
- JSON: Lightweight, human-readable, ideal for APIs and configuration. Best for nested data structures.
- CSV: Simple tabular format, widely supported, perfect for spreadsheets and data analysis.
- XML: Hierarchical, self-describing, good for complex documents and enterprise systems.
Part 1: Working with JSON
What is JSON?
JSON (JavaScript Object Notation) is a lightweight data format that’s easy for humans to read and for machines to parse. It’s the de facto standard for web APIs and modern data interchange.
Reading JSON Files
import json
# Read JSON from a file
with open('data.json', 'r') as file:
data = json.load(file)
# Access the data
print(data)
print(data['name'])
print(data['users'][0]['email'])
Writing JSON Files
import json
# Create data structure
data = {
'name': 'My Application',
'version': '1.0.0',
'users': [
{'id': 1, 'name': 'Alice', 'email': '[email protected]'},
{'id': 2, 'name': 'Bob', 'email': '[email protected]'}
],
'settings': {
'debug': True,
'timeout': 30
}
}
# Write to file
with open('output.json', 'w') as file:
json.dump(data, file, indent=2)
# Or get JSON string
json_string = json.dumps(data, indent=2)
print(json_string)
Parsing JSON Strings
import json
# JSON string from API response
json_string = '''
{
"status": "success",
"data": {
"user_id": 123,
"username": "alice",
"created_at": "2025-12-16"
}
}
'''
# Parse JSON string
data = json.loads(json_string)
print(data['data']['username']) # Output: alice
Practical Example: Processing API Response
import json
from datetime import datetime
def process_api_response(json_response):
"""Process and validate API response"""
try:
data = json.loads(json_response)
# Validate required fields
if 'status' not in data or 'data' not in data:
raise ValueError("Invalid API response structure")
# Extract and transform data
users = []
for user in data['data'].get('users', []):
users.append({
'id': user['id'],
'name': user['name'],
'email': user['email'],
'joined': datetime.fromisoformat(user['created_at'])
})
return {
'status': data['status'],
'user_count': len(users),
'users': users
}
except json.JSONDecodeError as e:
print(f"Invalid JSON: {e}")
return None
except (KeyError, ValueError) as e:
print(f"Data validation error: {e}")
return None
# Example usage
api_response = '''
{
"status": "success",
"data": {
"users": [
{"id": 1, "name": "Alice", "email": "[email protected]", "created_at": "2025-01-01"},
{"id": 2, "name": "Bob", "email": "[email protected]", "created_at": "2025-01-02"}
]
}
}
'''
result = process_api_response(api_response)
if result:
print(f"Processed {result['user_count']} users")
Handling Special Data Types
import json
from datetime import datetime
from decimal import Decimal
# Custom JSON encoder for special types
class CustomEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
elif isinstance(obj, Decimal):
return float(obj)
elif isinstance(obj, set):
return list(obj)
return super().default(obj)
# Data with special types
data = {
'timestamp': datetime.now(),
'price': Decimal('19.99'),
'tags': {'python', 'json', 'tutorial'}
}
# Serialize with custom encoder
json_string = json.dumps(data, cls=CustomEncoder, indent=2)
print(json_string)
JSON Best Practices
- Validate data: Always check for required fields and correct types
- Handle errors: Use try-except for JSON parsing
- Use indent: Make JSON readable with
indent=2orindent=4 - Be careful with large files: Stream large JSON files instead of loading all at once
- Use custom encoders: For non-standard types like datetime or Decimal
Part 2: Working with CSV
What is CSV?
CSV (Comma-Separated Values) is a simple format for tabular data. It’s widely used for spreadsheets, data exports, and data analysis.
Reading CSV Files
import csv
# Read CSV file
with open('data.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row) # Each row is a list
# Read with header
with open('data.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row) # Each row is a dictionary
Writing CSV Files
import csv
# Write CSV file
data = [
['Name', 'Email', 'Age'],
['Alice', '[email protected]', 30],
['Bob', '[email protected]', 25],
['Charlie', '[email protected]', 35]
]
with open('output.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
# Write with dictionaries
data_dicts = [
{'name': 'Alice', 'email': '[email protected]', 'age': 30},
{'name': 'Bob', 'email': '[email protected]', 'age': 25},
{'name': 'Charlie', 'email': '[email protected]', 'age': 35}
]
with open('output.csv', 'w', newline='') as file:
fieldnames = ['name', 'email', 'age']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data_dicts)
Practical Example: Processing Sales Data
import csv
from collections import defaultdict
def analyze_sales_data(filename):
"""Analyze sales data from CSV file"""
sales_by_region = defaultdict(float)
sales_by_product = defaultdict(int)
total_sales = 0
try:
with open(filename, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
region = row['region']
product = row['product']
amount = float(row['amount'])
sales_by_region[region] += amount
sales_by_product[product] += 1
total_sales += amount
return {
'total_sales': total_sales,
'by_region': dict(sales_by_region),
'by_product': dict(sales_by_product)
}
except FileNotFoundError:
print(f"File {filename} not found")
return None
except (KeyError, ValueError) as e:
print(f"Error processing data: {e}")
return None
# Example: Create sample CSV
sample_data = [
['region', 'product', 'amount'],
['North', 'Widget A', '1000'],
['North', 'Widget B', '1500'],
['South', 'Widget A', '800'],
['South', 'Widget C', '2000'],
['East', 'Widget B', '1200']
]
with open('sales.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(sample_data)
# Analyze the data
results = analyze_sales_data('sales.csv')
if results:
print(f"Total Sales: ${results['total_sales']}")
print(f"By Region: {results['by_region']}")
print(f"By Product: {results['by_product']}")
Handling Different Delimiters
import csv
# Tab-separated values
with open('data.tsv', 'r') as file:
reader = csv.reader(file, delimiter='\t')
for row in reader:
print(row)
# Semicolon-separated (common in Europe)
with open('data.csv', 'r') as file:
reader = csv.reader(file, delimiter=';')
for row in reader:
print(row)
# Custom delimiter
with open('data.txt', 'r') as file:
reader = csv.reader(file, delimiter='|')
for row in reader:
print(row)
CSV Best Practices
- Always use
newline='': When opening CSV files for writing - Use DictReader/DictWriter: For better readability with headers
- Handle encoding: Specify encoding if needed:
open(file, encoding='utf-8') - Quote fields with special characters: Use
quoting=csv.QUOTE_MINIMAL - Validate data types: Convert strings to appropriate types (int, float, etc.)
Part 3: Working with XML
What is XML?
XML (eXtensible Markup Language) is a hierarchical format for structured data. It’s self-describing and widely used in enterprise systems and configuration files.
Reading XML Files
import xml.etree.ElementTree as ET
# Parse XML file
tree = ET.parse('data.xml')
root = tree.getroot()
# Access elements
for child in root:
print(child.tag, child.attrib, child.text)
# Find specific elements
users = root.findall('user')
for user in users:
name = user.find('name').text
email = user.find('email').text
print(f"{name}: {email}")
Writing XML Files
import xml.etree.ElementTree as ET
# Create XML structure
root = ET.Element('users')
# Add users
for user_data in [
{'id': 1, 'name': 'Alice', 'email': '[email protected]'},
{'id': 2, 'name': 'Bob', 'email': '[email protected]'}
]:
user = ET.SubElement(root, 'user', id=str(user_data['id']))
name = ET.SubElement(user, 'name')
name.text = user_data['name']
email = ET.SubElement(user, 'email')
email.text = user_data['email']
# Write to file
tree = ET.ElementTree(root)
tree.write('output.xml', encoding='utf-8', xml_declaration=True)
# Or get XML string
xml_string = ET.tostring(root, encoding='unicode')
print(xml_string)
Parsing XML Strings
import xml.etree.ElementTree as ET
xml_string = '''
<response>
<status>success</status>
<data>
<user id="1">
<name>Alice</name>
<email>[email protected]</email>
</user>
<user id="2">
<name>Bob</name>
<email>[email protected]</email>
</user>
</data>
</response>
'''
# Parse XML string
root = ET.fromstring(xml_string)
# Extract data
status = root.find('status').text
users = root.findall('.//user')
for user in users:
user_id = user.get('id')
name = user.find('name').text
email = user.find('email').text
print(f"User {user_id}: {name} ({email})")
Practical Example: Processing Configuration File
import xml.etree.ElementTree as ET
def load_config(filename):
"""Load configuration from XML file"""
try:
tree = ET.parse(filename)
root = tree.getroot()
config = {}
# Parse database settings
db = root.find('database')
if db is not None:
config['database'] = {
'host': db.find('host').text,
'port': int(db.find('port').text),
'name': db.find('name').text,
'user': db.find('user').text
}
# Parse server settings
server = root.find('server')
if server is not None:
config['server'] = {
'host': server.find('host').text,
'port': int(server.find('port').text),
'debug': server.find('debug').text.lower() == 'true'
}
# Parse features
features = root.find('features')
if features is not None:
config['features'] = {}
for feature in features.findall('feature'):
name = feature.get('name')
enabled = feature.text.lower() == 'true'
config['features'][name] = enabled
return config
except ET.ParseError as e:
print(f"XML parsing error: {e}")
return None
except (AttributeError, ValueError) as e:
print(f"Configuration error: {e}")
return None
# Example: Create sample XML config
config_xml = '''<?xml version="1.0" encoding="UTF-8"?>
<config>
<database>
<host>localhost</host>
<port>5432</port>
<name>myapp</name>
<user>admin</user>
</database>
<server>
<host>0.0.0.0</host>
<port>8000</port>
<debug>true</debug>
</server>
<features>
<feature name="caching">true</feature>
<feature name="logging">true</feature>
<feature name="analytics">false</feature>
</features>
</config>
'''
with open('config.xml', 'w') as f:
f.write(config_xml)
# Load and display config
config = load_config('config.xml')
if config:
print("Database:", config['database'])
print("Server:", config['server'])
print("Features:", config['features'])
XPath Queries
import xml.etree.ElementTree as ET
xml_string = '''
<library>
<book id="1">
<title>Python Basics</title>
<author>John Doe</author>
<year>2020</year>
</book>
<book id="2">
<title>Advanced Python</title>
<author>Jane Smith</author>
<year>2022</year>
</book>
</library>
'''
root = ET.fromstring(xml_string)
# Find all books
all_books = root.findall('.//book')
print(f"Total books: {len(all_books)}")
# Find books by year
for book in root.findall(".//book[year='2022']"):
title = book.find('title').text
print(f"Book from 2022: {title}")
# Get book by ID
book_1 = root.find(".//book[@id='1']")
if book_1 is not None:
title = book_1.find('title').text
print(f"Book 1: {title}")
XML Best Practices
- Use ElementTree: It’s part of the standard library and sufficient for most use cases
- Handle namespaces: Use namespace-aware parsing for complex XML
- Validate structure: Check for required elements before accessing
- Use try-except: Handle parsing errors gracefully
- Consider lxml: For advanced XML processing and validation
- Pretty print: Format XML output for readability
Comparison: JSON vs CSV vs XML
| Feature | JSON | CSV | XML |
|---|---|---|---|
| Readability | High | High | Medium |
| Nested Data | Excellent | Poor | Excellent |
| Tabular Data | Awkward | Excellent | Awkward |
| File Size | Small | Small | Large |
| Parsing Speed | Fast | Fast | Slower |
| Schema Support | No | No | Yes (XSD) |
| Best For | APIs, Config | Spreadsheets, Data | Documents, Config |
| Learning Curve | Easy | Easy | Medium |
Choosing the Right Format
Use JSON when:
- Building APIs or web services
- Storing configuration files
- Working with nested data structures
- You need human-readable format with small file size
Use CSV when:
- Working with tabular data
- Exporting/importing spreadsheets
- Data analysis and reporting
- You need simplicity and wide compatibility
Use XML when:
- Working with complex hierarchical data
- You need schema validation
- Building enterprise systems
- You need self-describing data with metadata
Common Pitfalls and Solutions
JSON Pitfalls
import json
# โ Pitfall: Trying to serialize non-serializable objects
data = {'date': datetime.now()} # datetime not JSON serializable
# json.dumps(data) # Raises TypeError
# โ Solution: Use custom encoder or convert to string
data = {'date': datetime.now().isoformat()}
json.dumps(data)
CSV Pitfalls
import csv
# โ Pitfall: Forgetting newline='' parameter
# with open('file.csv', 'w') as f: # Wrong!
# writer = csv.writer(f)
# โ Solution: Always use newline=''
with open('file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Name', 'Email'])
XML Pitfalls
import xml.etree.ElementTree as ET
# โ Pitfall: Not handling namespaces
# root.findall('user') # Won't find elements with namespace
# โ Solution: Use namespace-aware queries
namespaces = {'ns': 'http://example.com/schema'}
root.findall('ns:user', namespaces)
Conclusion
JSON, CSV, and XML are essential data formats for modern Python development. Each has strengths and appropriate use cases:
- JSON is ideal for APIs and configuration with its simplicity and readability
- CSV excels at tabular data and spreadsheet integration
- XML provides structure and validation for complex hierarchical data
Key takeaways:
- Choose the right format for your specific use case
- Always validate data when parsing from external sources
- Handle errors gracefully with try-except blocks
- Use standard library modules for basic needs
- Consider performance for large datasets
- Document your data structures for maintainability
By mastering these formats, you’ll be equipped to handle data in virtually any Python project. Start with the format that matches your current needs, and expand your skills as you encounter new requirements.
Comments