Skip to main content
โšก Calmops

Object Storage and Data Lakes: Architecture, Patterns, and Best Practices

Introduction

Object storage is the foundation for modern cloud data architectures. It provides scalable, durable storage for any amount of dataโ€”from small configuration files to massive data lakes containing petabytes of information. Understanding object storage capabilities enables building architectures that are cost-effective, performant, and manageable.

Data lakes extend object storage to create centralized repositories that store data in its native format. They enable analytics, machine learning, and data processing at scale. Understanding data lake architecture is essential for organizations leveraging data as a strategic asset.

This comprehensive guide examines object storage and data lakes across major cloud providers. We explore storage types, lifecycle management, data lake patterns, and integration with analytics platforms. Whether building your first storage architecture or optimizing existing implementations, this guide provides the knowledge necessary for success.

Understanding Object Storage

Object storage manages data as objects, each with unique identifiers, metadata, and content.

Object Storage Characteristics

  • Scalability: Virtually unlimited storage capacity
  • Durability: High durability through distributed storage
  • Cost-effectiveness: Pay only for storage used
  • Accessibility: RESTful API access from anywhere
  • Metadata: Rich metadata for categorization

Object Storage vs. Block/File Storage

Type Use Case Characteristics
Object Storage Unstructured data, archives, data lakes REST API, infinite scale
Block Storage Databases, applications Low latency, high IOPS
File Storage Shared file systems NFS/SMB protocols

Amazon S3

S3 is the foundational object storage service in AWS.

S3 Bucket Configuration

# S3 Bucket
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake"
  
  tags = {
    Environment = "production"
    DataClassification = "internal"
  }
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  
  rule {
    apply_server_side_encryption_by_default {
      kms_master_key_id = aws_kms_key.s3.arn
      sse_algorithm     = "aws:kms"
    }
  }
}

S3 Storage Classes

# S3 Intelligent-Tiering
resource "aws_s3_bucket_intelligent_tiering_configuration" "main" {
  bucket = aws_s3_bucket.data_lake.id
  name   = "entire-bucket"
  
  tiering {
    access_tier = "INTELLIGENT_TIERING"
    days = 30
  }
  
  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days = 90
  }
  
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days = 180
  }
}

S3 Lifecycle Policies

# S3 Lifecycle Rule
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  
  rule {
    id     = "transition-to-glacier"
    status = "Enabled"
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
    
    expiration {
      days = 2555  # 7 years
    }
  }
  
  rule {
    id     = "delete-old-versions"
    status = "Enabled"
    
    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

Azure Blob Storage

Azure Blob Storage provides scalable object storage for Azure.

Blob Storage Configuration

# Create storage account
New-AzStorageAccount `
    -Name "datalakestorage" `
    -ResourceGroupName "rg-data" `
    -Location "eastus" `
    -SkuName "Standard_LRS" `
    -Kind "StorageV2" `
    -EnableHierarchicalNamespace $true

# Create container
$ctx = New-AzStorageContext -StorageAccountName "datalakestorage"
New-AzStorageContainer -Name "datalake" -Context $ctx -Permission Container

Azure Data Lake Storage Gen2

# Enable Data Lake Gen2 features
Set-AzStorageAccount `
    -ResourceGroupName "rg-data" `
    -Name "datalakestorage" `
    -EnableHierarchicalNamespace $true

# Create filesystem
$filesystem = New-AzDataLakeGen2FileSystem `
    -Context $ctx `
    -Name "analytics"

# Upload file
$item = Set-AzDataLakeGen2Item `
    -FileSystem $filesystem `
    -Path "raw/data.csv" `
    -Source "local-data.csv" `
    -Permission "rwxr-x---"

Google Cloud Storage

Cloud Storage provides object storage across GCP.

Bucket Configuration

# Create bucket
gsutil mb -l us-central1 gs://company-data-lake/

# Set default storage class
gsutil defstorageclass -c STANDARD gs://company-data-lake/

# Enable versioning
gsutil versioning set on gs://company-data-lake/

# Set lifecycle management
gsutil lifecycle set lifecycle-config.json gs://company-data-lake/

Lifecycle Configuration

{
  "rule": [
    {
      "action": {
        "type": "SetStorageClass",
        "storageClass": "NEARLINE"
      },
      "condition": {
        "age": 30,
        "matchesPrefix": ["logs/", "archive/"]
      }
    },
    {
      "action": {
        "type": "SetStorageClass",
        "storageClass": "COLDLINE"
      },
      "condition": {
        "age": 90
      }
    },
    {
      "action": {
        "type": "Delete"
      },
      "condition": {
        "age": 365
      }
    }
  ]
}

Data Lake Architecture

Data lakes store data in its native format, enabling diverse analytics workloads.

Data Lake Patterns

graph TB
    subgraph "Ingestion Layer"
        Stream[Streaming Data]
        Batch[Batch Data]
        API[API Data]
    end
    
    subgraph "Storage Layer"
        Bronze[Bronze - Raw]
        Silver[Silver - Cleaned]
        Gold[Gold - Aggregated]
    end
    
    subgraph "Processing Layer"
        Spark[Spark]
        Athena[Athena/Query]
        ML[ML Training]
    end
    
    subgraph "Consumption Layer"
        BI[BI Dashboards]
        API[API/Apps]
        Notebook[Notebooks]
    end
    
    Stream --> Bronze
    Batch --> Bronze
    API --> Bronze
    
    Bronze --> Spark
    Spark --> Silver
    Silver --> Gold
    
    Gold --> Athena
    Gold --> BI
    Gold --> ML
    Gold --> Notebook

Bronze/Silver/Gold Architecture

# Data Lake folder structure
s3://datalake/
โ”œโ”€โ”€ bronze/                    # Raw, immutable data
โ”‚   โ”œโ”€โ”€ source=website/
โ”‚   โ”‚   โ””โ”€โ”€ year=2026/
โ”‚   โ”‚       โ””โ”€โ”€ month=03/
โ”‚   โ”‚           โ””โ”€โ”€ day=05/
โ”‚   โ”‚               โ””โ”€โ”€ events.json
โ”‚   โ”œโ”€โ”€ source=api/
โ”‚   โ””โ”€โ”€ source=mobile/
โ”‚
โ”œโ”€โ”€ silver/                   # Cleaned, validated data
โ”‚   โ”œโ”€โ”€ events/
โ”‚   โ”‚   โ””โ”€โ”€ year=2026/
โ”‚   โ””โ”€โ”€ users/
โ”‚
โ”œโ”€โ”€ gold/                     # Business-level aggregations
โ”‚   โ”œโ”€โ”€ daily_metrics/
โ”‚   โ”œโ”€โ”€ user_segments/
โ”‚   โ””โ”€โ”€ ml_features/
โ”‚
โ””โ”€โ”€ glue_tables/             # Schema definitions
    โ”œโ”€โ”€ events/
    โ””โ”€โ”€ users/

Data Ingestion

Batch Ingestion

# Python - S3 batch upload
import boto3
import os

def upload_directory(bucket_name, local_dir, prefix=''):
    s3 = boto3.client('s3')
    
    for root, dirs, files in os.walk(local_dir):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, local_dir)
            s3_path = f"{prefix}/{relative_path}"
            
            print(f"Uploading {local_path} to {s3_path}")
            s3.upload_file(local_path, bucket_name, s3_path)

# Usage
upload_directory('my-bucket', '/data/raw', 'bronze/source=website')

Streaming Ingestion

# Kinesis Data Firehose to S3
import boto3
import json

firehose = boto3.client('firehose')

def send_to_firehose(stream_name, data):
    response = firehose.put_record(
        DeliveryStreamName=stream_name,
        Record={
            'Data': json.dumps(data) + '\n'
        }
    )
    return response

# Example: Send application events
events = [
    {'event': 'page_view', 'page': '/home', 'user': 'user1'},
    {'event': 'click', 'button': 'buy', 'user': 'user2'},
]

for event in events:
    send_to_firehose('events-stream', event)

AWS Glue for ETL

# AWS Glue Job
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Read from bronze
dyf = glueContext.create_dynamic_frame.from_catalog(
    database="datalake",
    table_name="bronze_events"
)

# Transform
transformed = ApplyMapping.apply(
    frame=dyf,
    mappings=[
        ("event_id", "string", "event_id", "string"),
        ("timestamp", "long", "timestamp", "timestamp"),
        ("user_id", "string", "user_id", "string"),
        ("properties", "map", "properties", "map")
    ]
)

# Write to silver
glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={"path": "s3://datalake/silver/events/"},
    format="parquet"
)

Data Access and Querying

Querying with Amazon Athena

-- Create table for raw JSON data
CREATE EXTERNAL TABLE IF NOT EXISTS bronze_events (
  event_id STRING,
  event_type STRING,
  timestamp BIGINT,
  user_id STRING,
  properties STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://datalake/bronze/events/';

-- Query with data transformation
SELECT 
    from_unixtime(timestamp/1000, 'yyyy-MM-dd') as event_date,
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM bronze_events
WHERE timestamp > UNIX_TIMESTAMP('2026-01-01') * 1000
GROUP BY from_unixtime(timestamp/1000, 'yyyy-MM-dd'), event_type
ORDER BY event_date DESC;

Querying with BigQuery

-- Create external table
CREATE EXTERNAL TABLE datasource.events
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://datalake/gold/events/*.parquet']
);

-- Query external data
SELECT 
    DATE(event_timestamp) as event_date,
    event_type,
    COUNT(*) as events,
    COUNT(DISTINCT user_id) as users
FROM datasource.events
WHERE event_timestamp >= TIMESTAMP('2026-01-01')
GROUP BY 1, 2
ORDER BY events DESC;

Security and Access Control

S3 Bucket Policies

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictPublicAccess",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s:::my-bucket",
        "arn:aws:s:::my-bucket/*"
      ],
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    },
    {
      "Sid": "AllowReadFromAnalytics",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/AnalyticsRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s:::my-bucket/analytics/*",
        "arn:aws:s:::my-bucket"
      ]
    }
  ]
}

Access Control Lists

# S3 ACL configuration
import boto3

s3 = boto3.client('s3')

s3.put_bucket_acl(
    Bucket='my-bucket',
    AccessControlPolicy={
        'Grants': [
            {
                'Grantee': {
                    'Type': 'Group',
                    'URI': 'http://acs.amazonaws.com/groups/global/AllUsers'
                },
                'Permission': 'READ'
            },
            {
                'Grantee': {
                    'Type': 'CanonicalUser',
                    'ID': 'owner-account-id'
                },
                'Permission': 'FULL_CONTROL'
            }
        ],
        'Owner': {
            'DisplayName': 'owner',
            'ID': 'owner-account-id'
        }
    }
)

Cost Optimization

Storage Cost Analysis

# Analyze storage costs
import boto3

ce = boto3.client('ce')

def analyze_storage_costs():
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': '2026-01-01',
            'End': '2026-02-01'
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost', 'UsageQuantity'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
            {'Type': 'TAG', 'Key': 'Environment'}
        ],
        Filter={
            'Dimensions': {
                'Key': 'SERVICE',
                'Values': ['Amazon Simple Storage Service']
            }
        }
    )
    
    return response['ResultsByTime']

# Identify optimization opportunities
def find_orphaned_objects(bucket):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    
    total_size = 0
    old_objects = []
    
    for page in paginator.paginate(Bucket=bucket):
        if 'Contents' not in page:
            continue
            
        for obj in page['Contents']:
            total_size += obj['Size']
            # Objects not accessed in 90 days
            if obj.get('LastModified'):
                age_days = (now - obj['LastModified']).days
                if age_days > 90:
                    old_objects.append({
                        'Key': obj['Key'],
                        'Size': obj['Size'],
                        'DaysOld': age_days
                    })
    
    return {
        'total_size_gb': total_size / (1024**3),
        'old_objects': old_objects
    }

Conclusion

Object storage and data lakes form the foundation of modern cloud data architectures. Understanding storage classes, lifecycle policies, and data lake patterns enables building cost-effective, scalable data platforms.

Key practices include implementing appropriate lifecycle policies to move data to cheaper storage classes, using the bronze/silver/gold architecture for data organization, securing access through bucket policies and IAM, and regularly analyzing storage usage for optimization opportunities.

As organizations generate increasing amounts of data, investing in robust storage architecture pays dividends through improved analytics capabilities, machine learning readiness, and cost efficiency.


Resources

Comments