Introduction
Big data technologies enable processing massive datasets that exceed traditional database capabilities. From batch processing to real-time analytics, understanding these tools is essential for modern data engineering.
What Is Big Data
The Three Vs
- Volume: Massive amounts of data
- Velocity: High-speed data generation
- Variety: Different data types and sources
Later added:
- Veracity: Data quality
- Value: Extracting useful insights
When Big Data Is Needed
- Datasets exceeding single machine storage
- Need for parallel processing
- Real-time analytics requirements
- Unstructured data processing
Hadoop Ecosystem
Core Components
HDFS (Hadoop Distributed File System):
# Start HDFS
start-dfs.sh
# Put file
hdfs dfs -put data.txt /input/
# List files
hdfs dfs -ls /
# Read file
hdfs dfs -cat /input/data.txt
MapReduce:
# Word count mapper
def mapper(key, value):
for word in value.split():
yield (word, 1)
# Word count reducer
def reducer(key, values):
yield (key, sum(values))
# Submit job
hadoop jar wordcount.jar WordCount /input /output
Hadoop Ecosystem Tools
| Tool | Purpose |
|---|---|
| Hive | SQL-like queries |
| Pig | Data flow scripting |
| HBase | NoSQL database |
| Spark | In-memory processing |
| Flink | Stream processing |
Apache Spark
Why Spark
- 100x faster than MapReduce
- In-memory processing
- Easy API (Python, Scala, Java, R)
- Unified batch and streaming
Spark Architecture
Driver
โ
Cluster Manager (YARN, Mesos, Kubernetes)
โ
Executors (Worker Nodes)
Spark DataFrames
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Analytics") \
.getOrCreate()
# Read data
df = spark.read.csv("data.csv", header=True)
# Show schema
df.printSchema()
# Transformations
result = df.filter(df.age > 21) \
.groupBy("city") \
.count() \
.orderBy("count", ascending=False)
# Write output
result.write.parquet("output/")
Spark SQL
# Create temporary view
df.createOrReplaceTempView("users")
# SQL query
results = spark.sql("""
SELECT city, COUNT(*) as count
FROM users
WHERE age > 21
GROUP BY city
ORDER BY count DESC
""")
Structured Streaming
# Read streaming data
streaming_df = spark.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Process
word_counts = streaming_df \
.select(explode(split("value", " ")).alias("word")) \
.groupBy("word") \
.count()
# Write stream
query = word_counts.writeStream \
.format("console") \
.start()
query.awaitTermination()
Data Lakes
What Is a Data Lake
Central repository storing raw data at any scale.
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Sources โ
โ (DB, Logs, APIs, IoT, Files) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Ingestion Layer โ
โ (Kafka, Flume, NiFi) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Storage Layer โ
โ (S3, HDFS, GCS, ADLS) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Processing Layer โ
โ (Spark, Flink, DBT) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Output Layer โ
โ (Data Warehouse, Apps) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Lake vs Data Warehouse
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Data Type | Raw, any format | Structured |
| Users | Data scientists | Business analysts |
| Schema | On-read | On-write |
| Cost | Lower (commodity storage) | Higher |
Stream Processing
What Is Streaming
Processing data in real-time as it arrives.
Apache Kafka
from kafka import KafkaProducer, KafkaConsumer
# Producer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
producer.send('events', {'event': 'click', 'user': '123'})
# Consumer
consumer = KafkaConsumer(
'events',
bootstrap_servers=['localhost:9092'],
group_id='my-group',
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
process(message.value)
Kafka Streams
from kafka import KafkaStreams
streams = KafkaStreams(
'input-topic',
{
'application.id': 'my-app',
'bootstrap.servers': 'localhost:9092'
}
)
# Transform and produce
streams.map_values(lambda v: v.upper()).to('output-topic')
streams.start()
Cloud Big Data Services
AWS
- EMR: Managed Hadoop/Spark
- Athena: SQL on S3
- Redshift: Data warehouse
- Kinesis: Streaming
Google Cloud
- Dataproc: Managed Spark/Hadoop
- BigQuery: Serverless data warehouse
- Dataflow: Stream/batch processing
Azure
- HDInsight: Managed Hadoop
- Synapse: Analytics workspace
- Databricks: Spark platform
ETL and Data Pipelines
Batch ETL
# Airflow DAG example
from airflow import DAG
from airflow.operators.python import PythonOperator
dag = DAG('etl_pipeline', start_date=datetime(2026, 1, 1))
extract = PythonOperator(
task_id='extract',
python_callable=extract_data,
dag=dag)
transform = PythonOperator(
task_id='transform',
python_callable=transform_data,
load = PythonOperator dag=dag)
(
task_id='load',
python_callable=load_data,
dag=dag)
extract >> transform >> load
Streaming ETL
# Using Apache Flink
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
data = env.add_source(FlumeSource())
result = data \
.map(lambda x: process(x)) \
.filter(lambda x: x.is_valid) \
.key_by(lambda x: x.category) \
.window(TumblingEventTimeWindows.of(Time.minutes(5))) \
.sum(1)
result.add_sink(Sink())
env.execute()
Conclusion
Big data technologies solve problems at scale. Start with understanding your data and requirements, then choose appropriate tools. Spark has become the default choice, but cloud services simplify operations significantly.
Comments