Introduction
Modern applications require processing data in real-time rather than in batch jobs. From fraud detection to personalized recommendations, organizations need immediate insights from their data streams. Streaming architecture, powered by Apache Kafka and Apache Flink, enables building robust real-time data pipelines that process millions of events per second with fault tolerance and exactly-once semantics.
This article explores streaming architecture patterns, the roles of Kafka and Flink, implementation strategies, and best practices for building production-ready real-time systems.
Understanding Streaming Architecture
Batch vs. Stream Processing
Traditional batch processing collects data over time and processes it in scheduled jobs. While simple, this approach introduces latency โ data collected today might not be available until tomorrow.
Stream processing handles data as it arrives:
- Event-by-event processing: Each event is processed individually
- Windowing: Events are grouped into time windows for aggregations
- Continuous computation: Pipelines run 24/7 without scheduled jobs
- Low latency: Results are available in milliseconds to seconds
Core Components
A streaming architecture typically includes:
- Event Sources: Databases, IoT devices, application logs, user interactions
- Message Broker: Kafka serves as the backbone for event transport
- Stream Processing Engine: Flink processes and transforms streams
- State Storage: RocksDB, Kafka Streams state stores
- Sink Systems: Data warehouses, databases, notification systems
Apache Kafka: The Event Backbone
Apache Kafka is a distributed event streaming platform that serves as the central nervous system for real-time data architectures.
Core Concepts
Topics and Partitions
- Topics are named streams of events
- Partitions provide parallelism and fault tolerance
- Each partition is an ordered, immutable sequence of events
Producers and Consumers
- Producers publish events to topics
- Consumers subscribe to topics and process events
- Consumer groups enable parallel processing with coordination
Replication and Fault Tolerance
- Events are replicated across brokers
- ISR (In-Sync Replicas) ensure durability
- Automatic failover maintains availability
Kafka Architecture
[Producers] โ [Kafka Cluster] โ [Consumers]
โ
[Partitions]
โ
[Replicas]
Performance Characteristics
| Metric | Value |
|---|---|
| Throughput | Millions of events/second |
| Latency | < 5ms end-to-end |
| Durability | Configurable replication (3x default) |
| Scalability | Horizontal partition scaling |
Apache Flink: Stream Processing Power
Apache Flink is a distributed stream processing framework that provides sophisticated event-time processing, windowing, and exactly-once guarantees.
Core Capabilities
Event Time Processing
- Processes events based on occurrence time, not arrival time
- Watermarks handle late-arriving events
- Window functions operate on event time
Windowing
- Tumbling windows: Fixed-size, non-overlapping
- Sliding windows: Fixed-size, overlapping
- Session windows: Based on activity gaps
Exactly-Once Semantics
- Checkpointing ensures exactly-once processing
- Two-phase commit for sinks supporting it
- Stateful operators maintain processing state
Flink Architecture
[Source] โ [Transformation] โ [Window] โ [Sink]
โ
[State Backend]
โ
[Checkpointing]
Key APIs
DataStream API
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<>(...));
DataStream<Order> parsed = input
.map(json -> parseOrder(json))
.keyBy(Order::getCustomerId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.sum("totalAmount");
Table API / SQL
SELECT
TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
customer_id,
SUM(total_amount) AS total
FROM orders
GROUP BY
TUMBLE(event_time, INTERVAL '5' MINUTE),
customer_id
Building Real-Time Pipelines
Pattern 1: Event Aggregation
Continuously aggregate metrics across time windows:
DataStream<Metric> metrics = env.addSource(metricsSource);
metrics
.keyBy(Metric::getSensorId)
.window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
.reduce((a, b) -> {
a.setValue(a.getValue() + b.getValue());
return a;
})
.addSink(new MetricsSink());
Pattern 2: Pattern Detection
Detect sequences of events matching patterns:
Pattern<Transaction, ?> pattern = Pattern.<Transaction>begin("start")
.where(evt -> evt.getAmount() > 10000)
.next("suspicious")
.where(evt -> evt.getLocation() != null &&
!evt.getLocation().equals(evt.getPreviousLocation()))
.within(Time.minutes(1));
DataStream<Alert> alerts = CEP.pattern(transactions, pattern)
.select(evt -> new Alert("Possible fraud: " + evt.getCustomerId()));
Pattern 3: Stream Joins
Join multiple streams:
DataStream<Order> orders = env.addSource(orderSource);
DataStream<Inventory> inventory = env.addSource(inventorySource);
DataStream<OrderConfirmation> confirmed = orders
.join(inventory)
.where(Order::getProductId)
.equalTo(Inventory::getProductId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.filter((o, i) -> o.getQuantity() <= i.getAvailable())
.map((o, i) -> new OrderConfirmation(o, i));
Pattern 4: Real-Time ETL
Continuous data movement with transformations:
DataStream<String> raw = env.addSource(kafkaSource);
raw
.map(JSON::parseObject)
.filter(obj -> obj.getString("status").equals("active"))
.map(obj -> {
obj.remove("sensitive_data");
obj.put("processed_at", System.currentTimeMillis());
return obj.toJSONString();
})
.addSink(kafkaSink);
Streaming with AI/ML Integration
Feature Store Integration
Stream processing can compute ML features in real-time:
DataStream<UserEvent> events = env.addSource(userEventSource);
events
.keyBy(UserEvent::getUserId)
.flatMap(new FeatureExtractor(featureStore))
.addSink(featureStoreSink);
Real-Time Inference
Process predictions as events flow through the system:
DataStream<PredictionRequest> requests = env.addSource(requestSource);
DataStream<PredictionResult> predictions = requests
.map(req -> {
req.setFeatures(featureStore.getFeatures(req.getUserId()));
return mlModel.predict(req);
})
.filter(result -> result.getConfidence() > 0.8)
.addSink(notificationSink);
Flink for Agentic AI
The Flink Agents project enables running AI agents at scale:
- Long-running, system-triggered AI agents
- Integration with LLMs, tools, and emerging protocols
- Enterprise-grade Agentic AI deployment
Handling Stateful Streaming
State Backends
Flink manages state using pluggable backends:
| Backend | Use Case |
|---|---|
| HashMap | Small state, maximum performance |
| RocksDB | Large state, disk-backed |
| EmbeddedRocksDB | Streaming state with Flink Kubernetes |
Checkpointing
Enable fault tolerance with periodic checkpoints:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60000); // Checkpoint every minute
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(300000);
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);
Savepoints
Create savepoints for deployment and migration:
flink savepoint <job-id> <target-directory>
Scaling Considerations
Kafka Partition Strategy
Choose partition keys based on query patterns:
- User-based:
user_idfor per-user processing - Geographic:
region_idfor regional aggregation - Composite: Combine keys for balanced distribution
Flink Parallelism
Scale processing by adjusting parallelism:
env.setParallelism(4); // Global parallelism
source.setParallelism(8); // Source parallelism
Capacity Planning
Estimate required resources:
| Throughput | Partitions | Flink Task Managers | Memory |
|---|---|---|---|
| 10K events/sec | 10-20 | 2-4 | 16GB |
| 100K events/sec | 50-100 | 8-16 | 64GB |
| 1M events/sec | 200+ | 32+ | 256GB+ |
Best Practices
- Design for late data: Use watermarks with allowed lateness
- Monitor processing lag: Track event-time vs. processing-time gap
- Backpressure handling: Design for burst traffic
- State size management: Regularly clear unnecessary state
- Schema evolution: Plan for schema changes in events
- Testing: Use Flink’s testing utilities for stream jobs
Challenges and Solutions
Challenge: Out-of-Order Events
Solution: Use event-time processing with watermarks:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.getTimestamp())
);
Challenge: Exactly-Once Sinks
Solution: Use transactional sinks with two-phase commit:
KafkaSink<String> sink = KafkaSink.<String>builder()
.setBootstrapServers("localhost:9092")
.setRecordSerializer(new SimpleStringSerializer())
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("flink-")
.build();
Challenge: State Explosion
Solution: Implement state TTL and regular cleanup:
StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Duration.ofHours(24))
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp)
.build();
Resources
- Apache Kafka Documentation
- Apache Flink Documentation
- Confluent Kafka Resources
- Flink ML Documentation
- Kafka Streams API
Conclusion
Streaming architecture with Kafka and Flink enables organizations to process data in real-time at massive scale. Kafka provides the durable, scalable event backbone while Flink offers sophisticated stream processing capabilities. Together, they power real-time analytics, ML pipelines, and event-driven applications that transform raw events into actionable insights in milliseconds.
Comments