Understanding Big Data Technologies

Introduction

Big data technologies enable processing massive datasets that exceed traditional database capabilities. From batch processing to real-time analytics, understanding these tools is essential for modern data engineering.

What Is Big Data

The Three Vs

Volume: Massive amounts of data
Velocity: High-speed data generation
Variety: Different data types and sources

Later added:

Veracity: Data quality
Value: Extracting useful insights

When Big Data Is Needed

Datasets exceeding single machine storage
Need for parallel processing
Real-time analytics requirements
Unstructured data processing

Hadoop Ecosystem

Core Components

HDFS (Hadoop Distributed File System):

# Start HDFS
start-dfs.sh

# Put file
hdfs dfs -put data.txt /input/

# List files
hdfs dfs -ls /

# Read file
hdfs dfs -cat /input/data.txt

MapReduce:

# Word count mapper
def mapper(key, value):
    for word in value.split():
        yield (word, 1)

# Word count reducer
def reducer(key, values):
    yield (key, sum(values))

# Submit job
hadoop jar wordcount.jar WordCount /input /output

Hadoop Ecosystem Tools

Tool	Purpose
Hive	SQL-like queries
Pig	Data flow scripting
HBase	NoSQL database
Spark	In-memory processing
Flink	Stream processing

Apache Spark

Why Spark

100x faster than MapReduce
In-memory processing
Easy API (Python, Scala, Java, R)
Unified batch and streaming

Spark Architecture

Driver
  ↓
Cluster Manager (YARN, Mesos, Kubernetes)
  ↓
Executors (Worker Nodes)

Spark DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Analytics") \
    .getOrCreate()

# Read data
df = spark.read.csv("data.csv", header=True)

# Show schema
df.printSchema()

# Transformations
result = df.filter(df.age > 21) \
    .groupBy("city") \
    .count() \
    .orderBy("count", ascending=False)

# Write output
result.write.parquet("output/")

Spark SQL

# Create temporary view
df.createOrReplaceTempView("users")

# SQL query
results = spark.sql("""
    SELECT city, COUNT(*) as count 
    FROM users 
    WHERE age > 21 
    GROUP BY city 
    ORDER BY count DESC
""")

Structured Streaming

# Read streaming data
streaming_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Process
word_counts = streaming_df \
    .select(explode(split("value", " ")).alias("word")) \
    .groupBy("word") \
    .count()

# Write stream
query = word_counts.writeStream \
    .format("console") \
    .start()

query.awaitTermination()

Data Lakes

What Is a Data Lake

Central repository storing raw data at any scale.

Architecture

┌─────────────────────────────────────┐
│           Data Sources              │
│  (DB, Logs, APIs, IoT, Files)       │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│         Ingestion Layer             │
│    (Kafka, Flume, NiFi)             │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│         Storage Layer               │
│   (S3, HDFS, GCS, ADLS)            │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│        Processing Layer             │
│  (Spark, Flink, DBT)               │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│          Output Layer               │
│   (Data Warehouse, Apps)            │
└─────────────────────────────────────┘

Data Lake vs Data Warehouse

Aspect	Data Lake	Data Warehouse
Data Type	Raw, any format	Structured
Users	Data scientists	Business analysts
Schema	On-read	On-write
Cost	Lower (commodity storage)	Higher

Stream Processing

What Is Streaming

Processing data in real-time as it arrives.

Apache Kafka

from kafka import KafkaProducer, KafkaConsumer

# Producer
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.send('events', {'event': 'click', 'user': '123'})

# Consumer
consumer = KafkaConsumer(
    'events',
    bootstrap_servers=['localhost:9092'],
    group_id='my-group',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    process(message.value)

Kafka Streams

from kafka import KafkaStreams

streams = KafkaStreams(
    'input-topic',
    {
        'application.id': 'my-app',
        'bootstrap.servers': 'localhost:9092'
    }
)

# Transform and produce
streams.map_values(lambda v: v.upper()).to('output-topic')
streams.start()

Cloud Big Data Services

AWS

EMR: Managed Hadoop/Spark
Athena: SQL on S3
Redshift: Data warehouse
Kinesis: Streaming

Google Cloud

Dataproc: Managed Spark/Hadoop
BigQuery: Serverless data warehouse
Dataflow: Stream/batch processing

Azure

HDInsight: Managed Hadoop
Synapse: Analytics workspace
Databricks: Spark platform

ETL and Data Pipelines

Batch ETL

# Airflow DAG example
from airflow import DAG
from airflow.operators.python import PythonOperator

dag = DAG('etl_pipeline', start_date=datetime(2026, 1, 1))

extract = PythonOperator(
    task_id='extract',
    python_callable=extract_data,
    dag=dag)

transform = PythonOperator(
    task_id='transform',
    python_callable=transform_data,
   load = PythonOperator dag=dag)

(
    task_id='load',
    python_callable=load_data,
    dag=dag)

extract >> transform >> load

Streaming ETL

# Using Apache Flink
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

data = env.add_source(FlumeSource())

result = data \
    .map(lambda x: process(x)) \
    .filter(lambda x: x.is_valid) \
    .key_by(lambda x: x.category) \
    .window(TumblingEventTimeWindows.of(Time.minutes(5))) \
    .sum(1)

result.add_sink(Sink())
env.execute()

Conclusion

Big data technologies solve problems at scale. Start with understanding your data and requirements, then choose appropriate tools. Spark has become the default choice, but cloud services simplify operations significantly.