Skip to main content
โšก Calmops

Understanding Big Data Technologies

Introduction

Big data technologies enable processing massive datasets that exceed traditional database capabilities. From batch processing to real-time analytics, understanding these tools is essential for modern data engineering.

What Is Big Data

The Three Vs

  • Volume: Massive amounts of data
  • Velocity: High-speed data generation
  • Variety: Different data types and sources

Later added:

  • Veracity: Data quality
  • Value: Extracting useful insights

When Big Data Is Needed

  • Datasets exceeding single machine storage
  • Need for parallel processing
  • Real-time analytics requirements
  • Unstructured data processing

Hadoop Ecosystem

Core Components

HDFS (Hadoop Distributed File System):

# Start HDFS
start-dfs.sh

# Put file
hdfs dfs -put data.txt /input/

# List files
hdfs dfs -ls /

# Read file
hdfs dfs -cat /input/data.txt

MapReduce:

# Word count mapper
def mapper(key, value):
    for word in value.split():
        yield (word, 1)

# Word count reducer
def reducer(key, values):
    yield (key, sum(values))

# Submit job
hadoop jar wordcount.jar WordCount /input /output

Hadoop Ecosystem Tools

Tool Purpose
Hive SQL-like queries
Pig Data flow scripting
HBase NoSQL database
Spark In-memory processing
Flink Stream processing

Apache Spark

Why Spark

  • 100x faster than MapReduce
  • In-memory processing
  • Easy API (Python, Scala, Java, R)
  • Unified batch and streaming

Spark Architecture

Driver
  โ†“
Cluster Manager (YARN, Mesos, Kubernetes)
  โ†“
Executors (Worker Nodes)

Spark DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Analytics") \
    .getOrCreate()

# Read data
df = spark.read.csv("data.csv", header=True)

# Show schema
df.printSchema()

# Transformations
result = df.filter(df.age > 21) \
    .groupBy("city") \
    .count() \
    .orderBy("count", ascending=False)

# Write output
result.write.parquet("output/")

Spark SQL

# Create temporary view
df.createOrReplaceTempView("users")

# SQL query
results = spark.sql("""
    SELECT city, COUNT(*) as count 
    FROM users 
    WHERE age > 21 
    GROUP BY city 
    ORDER BY count DESC
""")

Structured Streaming

# Read streaming data
streaming_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Process
word_counts = streaming_df \
    .select(explode(split("value", " ")).alias("word")) \
    .groupBy("word") \
    .count()

# Write stream
query = word_counts.writeStream \
    .format("console") \
    .start()

query.awaitTermination()

Data Lakes

What Is a Data Lake

Central repository storing raw data at any scale.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Data Sources              โ”‚
โ”‚  (DB, Logs, APIs, IoT, Files)       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Ingestion Layer             โ”‚
โ”‚    (Kafka, Flume, NiFi)             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Storage Layer               โ”‚
โ”‚   (S3, HDFS, GCS, ADLS)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚        Processing Layer             โ”‚
โ”‚  (Spark, Flink, DBT)               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Output Layer               โ”‚
โ”‚   (Data Warehouse, Apps)            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Lake vs Data Warehouse

Aspect Data Lake Data Warehouse
Data Type Raw, any format Structured
Users Data scientists Business analysts
Schema On-read On-write
Cost Lower (commodity storage) Higher

Stream Processing

What Is Streaming

Processing data in real-time as it arrives.

Apache Kafka

from kafka import KafkaProducer, KafkaConsumer

# Producer
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.send('events', {'event': 'click', 'user': '123'})

# Consumer
consumer = KafkaConsumer(
    'events',
    bootstrap_servers=['localhost:9092'],
    group_id='my-group',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    process(message.value)

Kafka Streams

from kafka import KafkaStreams

streams = KafkaStreams(
    'input-topic',
    {
        'application.id': 'my-app',
        'bootstrap.servers': 'localhost:9092'
    }
)

# Transform and produce
streams.map_values(lambda v: v.upper()).to('output-topic')
streams.start()

Cloud Big Data Services

AWS

  • EMR: Managed Hadoop/Spark
  • Athena: SQL on S3
  • Redshift: Data warehouse
  • Kinesis: Streaming

Google Cloud

  • Dataproc: Managed Spark/Hadoop
  • BigQuery: Serverless data warehouse
  • Dataflow: Stream/batch processing

Azure

  • HDInsight: Managed Hadoop
  • Synapse: Analytics workspace
  • Databricks: Spark platform

ETL and Data Pipelines

Batch ETL

# Airflow DAG example
from airflow import DAG
from airflow.operators.python import PythonOperator

dag = DAG('etl_pipeline', start_date=datetime(2026, 1, 1))

extract = PythonOperator(
    task_id='extract',
    python_callable=extract_data,
    dag=dag)

transform = PythonOperator(
    task_id='transform',
    python_callable=transform_data,
   load = PythonOperator dag=dag)

(
    task_id='load',
    python_callable=load_data,
    dag=dag)

extract >> transform >> load

Streaming ETL

# Using Apache Flink
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)

data = env.add_source(FlumeSource())

result = data \
    .map(lambda x: process(x)) \
    .filter(lambda x: x.is_valid) \
    .key_by(lambda x: x.category) \
    .window(TumblingEventTimeWindows.of(Time.minutes(5))) \
    .sum(1)

result.add_sink(Sink())
env.execute()

Conclusion

Big data technologies solve problems at scale. Start with understanding your data and requirements, then choose appropriate tools. Spark has become the default choice, but cloud services simplify operations significantly.


Resources

Comments