Skip to main content
โšก Calmops

Streaming Architecture: Building Real-Time Data Pipelines with Kafka and Flink

Introduction

Modern applications require processing data in real-time rather than in batch jobs. From fraud detection to personalized recommendations, organizations need immediate insights from their data streams. Streaming architecture, powered by Apache Kafka and Apache Flink, enables building robust real-time data pipelines that process millions of events per second with fault tolerance and exactly-once semantics.

This article explores streaming architecture patterns, the roles of Kafka and Flink, implementation strategies, and best practices for building production-ready real-time systems.

Understanding Streaming Architecture

Batch vs. Stream Processing

Traditional batch processing collects data over time and processes it in scheduled jobs. While simple, this approach introduces latency โ€” data collected today might not be available until tomorrow.

Stream processing handles data as it arrives:

  • Event-by-event processing: Each event is processed individually
  • Windowing: Events are grouped into time windows for aggregations
  • Continuous computation: Pipelines run 24/7 without scheduled jobs
  • Low latency: Results are available in milliseconds to seconds

Core Components

A streaming architecture typically includes:

  1. Event Sources: Databases, IoT devices, application logs, user interactions
  2. Message Broker: Kafka serves as the backbone for event transport
  3. Stream Processing Engine: Flink processes and transforms streams
  4. State Storage: RocksDB, Kafka Streams state stores
  5. Sink Systems: Data warehouses, databases, notification systems

Apache Kafka: The Event Backbone

Apache Kafka is a distributed event streaming platform that serves as the central nervous system for real-time data architectures.

Core Concepts

Topics and Partitions

  • Topics are named streams of events
  • Partitions provide parallelism and fault tolerance
  • Each partition is an ordered, immutable sequence of events

Producers and Consumers

  • Producers publish events to topics
  • Consumers subscribe to topics and process events
  • Consumer groups enable parallel processing with coordination

Replication and Fault Tolerance

  • Events are replicated across brokers
  • ISR (In-Sync Replicas) ensure durability
  • Automatic failover maintains availability

Kafka Architecture

[Producers] โ†’ [Kafka Cluster] โ†’ [Consumers]
                  โ†“
            [Partitions]
                  โ†“
            [Replicas]

Performance Characteristics

Metric Value
Throughput Millions of events/second
Latency < 5ms end-to-end
Durability Configurable replication (3x default)
Scalability Horizontal partition scaling

Apache Flink is a distributed stream processing framework that provides sophisticated event-time processing, windowing, and exactly-once guarantees.

Core Capabilities

Event Time Processing

  • Processes events based on occurrence time, not arrival time
  • Watermarks handle late-arriving events
  • Window functions operate on event time

Windowing

  • Tumbling windows: Fixed-size, non-overlapping
  • Sliding windows: Fixed-size, overlapping
  • Session windows: Based on activity gaps

Exactly-Once Semantics

  • Checkpointing ensures exactly-once processing
  • Two-phase commit for sinks supporting it
  • Stateful operators maintain processing state
[Source] โ†’ [Transformation] โ†’ [Window] โ†’ [Sink]
              โ†“
        [State Backend]
              โ†“
        [Checkpointing]

Key APIs

DataStream API

DataStream<String> input = env.addSource(new FlinkKafkaConsumer<>(...));
DataStream<Order> parsed = input
    .map(json -> parseOrder(json))
    .keyBy(Order::getCustomerId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .sum("totalAmount");

Table API / SQL

SELECT 
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
    customer_id,
    SUM(total_amount) AS total
FROM orders
GROUP BY 
    TUMBLE(event_time, INTERVAL '5' MINUTE),
    customer_id

Building Real-Time Pipelines

Pattern 1: Event Aggregation

Continuously aggregate metrics across time windows:

DataStream<Metric> metrics = env.addSource(metricsSource);
metrics
    .keyBy(Metric::getSensorId)
    .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
    .reduce((a, b) -> {
        a.setValue(a.getValue() + b.getValue());
        return a;
    })
    .addSink(new MetricsSink());

Pattern 2: Pattern Detection

Detect sequences of events matching patterns:

Pattern<Transaction, ?> pattern = Pattern.<Transaction>begin("start")
    .where(evt -> evt.getAmount() > 10000)
    .next("suspicious")
    .where(evt -> evt.getLocation() != null && 
                  !evt.getLocation().equals(evt.getPreviousLocation()))
    .within(Time.minutes(1));

DataStream<Alert> alerts = CEP.pattern(transactions, pattern)
    .select(evt -> new Alert("Possible fraud: " + evt.getCustomerId()));

Pattern 3: Stream Joins

Join multiple streams:

DataStream<Order> orders = env.addSource(orderSource);
DataStream<Inventory> inventory = env.addSource(inventorySource);

DataStream<OrderConfirmation> confirmed = orders
    .join(inventory)
    .where(Order::getProductId)
    .equalTo(Inventory::getProductId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .filter((o, i) -> o.getQuantity() <= i.getAvailable())
    .map((o, i) -> new OrderConfirmation(o, i));

Pattern 4: Real-Time ETL

Continuous data movement with transformations:

DataStream<String> raw = env.addSource(kafkaSource);
raw
    .map(JSON::parseObject)
    .filter(obj -> obj.getString("status").equals("active"))
    .map(obj -> {
        obj.remove("sensitive_data");
        obj.put("processed_at", System.currentTimeMillis());
        return obj.toJSONString();
    })
    .addSink(kafkaSink);

Streaming with AI/ML Integration

Feature Store Integration

Stream processing can compute ML features in real-time:

DataStream<UserEvent> events = env.addSource(userEventSource);
events
    .keyBy(UserEvent::getUserId)
    .flatMap(new FeatureExtractor(featureStore))
    .addSink(featureStoreSink);

Real-Time Inference

Process predictions as events flow through the system:

DataStream<PredictionRequest> requests = env.addSource(requestSource);
DataStream<PredictionResult> predictions = requests
    .map(req -> {
        req.setFeatures(featureStore.getFeatures(req.getUserId()));
        return mlModel.predict(req);
    })
    .filter(result -> result.getConfidence() > 0.8)
    .addSink(notificationSink);

The Flink Agents project enables running AI agents at scale:

  • Long-running, system-triggered AI agents
  • Integration with LLMs, tools, and emerging protocols
  • Enterprise-grade Agentic AI deployment

Handling Stateful Streaming

State Backends

Flink manages state using pluggable backends:

Backend Use Case
HashMap Small state, maximum performance
RocksDB Large state, disk-backed
EmbeddedRocksDB Streaming state with Flink Kubernetes

Checkpointing

Enable fault tolerance with periodic checkpoints:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60000); // Checkpoint every minute
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(300000);
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);

Savepoints

Create savepoints for deployment and migration:

flink savepoint <job-id> <target-directory>

Scaling Considerations

Kafka Partition Strategy

Choose partition keys based on query patterns:

  • User-based: user_id for per-user processing
  • Geographic: region_id for regional aggregation
  • Composite: Combine keys for balanced distribution

Scale processing by adjusting parallelism:

env.setParallelism(4); // Global parallelism
source.setParallelism(8); // Source parallelism

Capacity Planning

Estimate required resources:

Throughput Partitions Flink Task Managers Memory
10K events/sec 10-20 2-4 16GB
100K events/sec 50-100 8-16 64GB
1M events/sec 200+ 32+ 256GB+

Best Practices

  1. Design for late data: Use watermarks with allowed lateness
  2. Monitor processing lag: Track event-time vs. processing-time gap
  3. Backpressure handling: Design for burst traffic
  4. State size management: Regularly clear unnecessary state
  5. Schema evolution: Plan for schema changes in events
  6. Testing: Use Flink’s testing utilities for stream jobs

Challenges and Solutions

Challenge: Out-of-Order Events

Solution: Use event-time processing with watermarks:

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.assignTimestampsAndWatermarks(
    WatermarkStrategy
        .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
        .withTimestampAssigner((event, timestamp) -> event.getTimestamp())
);

Challenge: Exactly-Once Sinks

Solution: Use transactional sinks with two-phase commit:

KafkaSink<String> sink = KafkaSink.<String>builder()
    .setBootstrapServers("localhost:9092")
    .setRecordSerializer(new SimpleStringSerializer())
    .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
    .setTransactionalIdPrefix("flink-")
    .build();

Challenge: State Explosion

Solution: Implement state TTL and regular cleanup:

StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Duration.ofHours(24))
    .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp)
    .build();

Resources

Conclusion

Streaming architecture with Kafka and Flink enables organizations to process data in real-time at massive scale. Kafka provides the durable, scalable event backbone while Flink offers sophisticated stream processing capabilities. Together, they power real-time analytics, ML pipelines, and event-driven applications that transform raw events into actionable insights in milliseconds.

Comments