101 Guide

Streaming vs. Batch Processing: When to Use What

A practical guide to batch and streaming data processing — how they differ, when to use each, and how modern lakehouses combine both patterns.

by Alphyn.ai Engineering Team·9 min read

Every data platform moves data from point A to point B. The difference is when that movement happens.

Batch vs. Streaming in Plain Language

Batch Processing — collect, then process in bulk

  • Run at scheduled intervals — hourly, nightly, weekly
  • Process large volumes at once
  • Nightly ETL, end-of-day reports, monthly aggregations
  • Tools: Spark, traditional ETL, stored procedures

Stream Processing — process data as it arrives

  • Event by event, continuously, no waiting
  • Sub-second to seconds latency
  • Fraud detection, real-time dashboards, IoT monitoring
  • Tools: Apache Flink, Kafka Streams, Spark Structured Streaming

CDC (Change Data Capture) — stream database changes

  • Capture every INSERT, UPDATE, DELETE from a source database
  • Keep the lakehouse in sync with Oracle, PostgreSQL, SAP, and others
  • Essential for migration and real-time replication
  • Tools: Debezium, GoldenGate, Qlik Replicate

Batch = doing laundry. You collect dirty clothes all week, then wash everything on Sunday. Efficient per load, but you wait.

Streaming = a conveyor belt at a factory. Each item gets processed the moment it arrives. No waiting, but you need the belt running continuously.

CDC = a security camera on your source database. It records every change as it happens and streams that footage to your lakehouse.


When to Use Each Pattern

Use Case What You Need Why
Nightly financial reports Batch Data only needs to be current as of close-of-business
Real-time fraud detection Streaming Every second of delay = money lost
Dashboard refreshing every hour Batch is fine Hourly latency is acceptable — no need for streaming complexity
Dashboard with sub-minute data Streaming Users expect near-live numbers
Keeping lakehouse in sync with Oracle CDC (streaming) Capture every transaction as it happens in the source
Historical analysis, ML model training Batch Working with terabytes of historical data — throughput matters, not latency
IoT sensor monitoring Streaming Thousands of events per second, need immediate anomaly detection
Regulatory reporting (end of day) Batch Regulators want a point-in-time snapshot, not a live feed
Migrating from Oracle to lakehouse CDC + Batch Batch for initial load, CDC to keep delta sync during cutover

Most enterprises need both batch and streaming. The question is never "batch or streaming?" — it is "what is the ratio, and can your platform handle both without buying two separate products?"


The Modern Pattern: Streaming + Iceberg

The industry is converging on a standard architecture: stream data into Apache Iceberg tables, query them with SQL engines, maintain them with Spark.

The Streaming-to-Iceberg Pipeline:

Source DB (Oracle, PG, MySQL)
  → CDC / Flink (capture changes)
  → Iceberg Tables (equality-delete files)
  → SQL Engines (Impala, StarRocks, Trino)
  → Dashboards (BI tools, reports)

Background Maintenance (Batch):

Spark (compaction, cleanup)
  → Iceberg Tables (merge delete files, optimize)

Here is what each piece does:

  • Flink + CDC (Debezium) — Captures every change from source databases and writes them as streaming events into Iceberg tables. Updates create "equality-delete" files — Iceberg's way of saying "this old row is replaced by this new one."
  • Iceberg Tables — The single source of truth. Both streaming-ingested data and batch-loaded data land in the same tables. ACID transactions. Time travel. Schema evolution.
  • SQL Engines (Impala, StarRocks, Trino) — Query the Iceberg tables with standard SQL. No need to know whether the data arrived via batch or streaming.
  • Spark (maintenance) — Runs in the background to compact small files created by streaming, merge equality-delete files, and keep query performance sharp.

How the Alphyn Lakehouse Handles Both

Alphyn ships both batch and streaming built in, with unified Iceberg storage underneath.

Batch stack

  • Spark 4.0.1 — ETL, ML pipelines, Iceberg maintenance (compaction, cleanup)
  • Impala + LPSQL — Batch stored procedures migrated from Oracle PL/SQL
  • Flex Loader (Alphyn DR) — Batch extraction from source systems (Oracle, SAP, files)
  • Airflow — Orchestration and scheduling of all batch workflows

Streaming / CDC stack

  • Alphyn ASM (Analytical Stream Manager) — Built on Apache Flink + Debezium
  • CDC from: Oracle, PostgreSQL, MS SQL, MySQL, SAP, MongoDB
  • Writes to Iceberg — Streaming data lands in the same tables as batch data
  • Equality-delete optimization — Alphyn's Iceberg innovation for efficient streaming writes

What is the equality-delete optimization?

When Flink streams CDC updates into Iceberg, every UPDATE creates a small "delete file" plus a new data file. At scale, thousands of tiny delete files pile up and degrade query performance. Stock Spark struggles to compact them efficiently. Alphyn's Iceberg fork includes optimized equality-delete compaction that keeps streaming tables performant — a real differentiator that other Iceberg platforms struggle to match.


Vendor Comparison: Streaming + Batch Support

Confluent (Kafka)

Verdict: Streaming only

Best-in-class event streaming infrastructure. Apache Kafka for message transport, ksqlDB for stream processing, Flink-based stream processing recently added.

  • Strength: Gold standard for event streaming and Kafka management
  • Gap: Not a data platform — no batch, no SQL analytics, no data lake. Must pair with Databricks, Snowflake, or similar.

Databricks

Verdict: Full batch + streaming

Spark Structured Streaming for stream processing. Delta Live Tables for unified batch + streaming ETL pipelines. Auto Loader for incremental file ingestion.

  • Strength: Unified Spark-based batch + streaming, excellent auto-scaling
  • Gap: Cloud-only (no on-prem). No procedural SQL. Delta Lake format (now also supports Iceberg). Expensive at scale.

Snowflake

Verdict: Near-real-time (not true streaming)

Snowpipe for near-real-time ingestion (micro-batch). Snowpipe Streaming for lower-latency ingestion. Dynamic Tables for incremental materialized views.

  • Strength: Simple, fully managed, good enough for near-real-time use cases
  • Gap: Not true streaming (seconds-to-minutes latency). Cloud-only. Consumption-based pricing adds up fast.

Cloudera CDP

Verdict: Full batch + streaming

Spark for batch, Flink for streaming (Cloudera Stream Processing), Kafka, NiFi for data flow. Full Hadoop-era stack modernized.

  • Strength: Complete stack — batch, streaming, Kafka, Flink, NiFi all included
  • Gap: Complex to operate. Expensive licensing. Kubernetes wrapper (not native). Heavy operational overhead. Not Iceberg-first.

Starburst (Trino)

Verdict: Query only — no streaming

Query engine only. Can read from Kafka topics via Trino connector (read-only, not stream processing). No CDC. No stream-to-Iceberg pipeline.

  • Strength: Can federate queries across streaming and batch sources
  • Gap: No processing, no CDC, no ingestion pipeline. Must buy and integrate a separate streaming platform.

Dremio

Verdict: Query only — no streaming

Query engine with Arctic (Iceberg catalog) for table management. No native streaming or CDC. Must pair with external Flink/Kafka/Debezium.

  • Strength: Good Iceberg catalog (Arctic) and query acceleration
  • Gap: Analytics only — no data ingestion, no streaming, no processing. Similar gap to Starburst.

ClickHouse

Verdict: Kafka ingestion only

Kafka engine for consuming from Kafka topics. MaterializedView for continuous aggregation. Very fast ingestion and real-time aggregation.

  • Strength: Extremely fast ingestion and aggregation from Kafka
  • Gap: No CDC, no Flink-equivalent. Limited to Kafka input. No procedural SQL. Proprietary format (not Iceberg).

CelerData (StarRocks)

Verdict: Kafka ingestion only

StarRocks routine load from Kafka topics. Fast real-time analytics on ingested data. Single-engine platform.

  • Strength: Fast real-time analytics on Kafka-ingested data
  • Gap: No CDC, no Flink, limited to Kafka input. No batch ETL or procedural SQL.

Oracle

Verdict: Mature CDC, proprietary

GoldenGate for CDC (expensive add-on). Full integration with Oracle DB ecosystem. Mature but locked-in.

  • Strength: GoldenGate is battle-tested CDC for Oracle sources
  • Gap: Extremely expensive. Oracle-only ecosystem. Proprietary everything. GoldenGate is a separate product with its own licensing.

Teradata

Verdict: No modern streaming

QueryGrid for cross-system queries. Mature batch analytics, legacy streaming capabilities.

  • Strength: Mature analytics engine for batch workloads
  • Gap: No modern streaming. No Flink/Kafka native integration. Appliance model. No open-format lakehouse.

Quick Comparison Matrix

Vendor Batch Streaming CDC Built-In Iceberg On-Prem
Alphyn Spark, Impala, LPSQL Flink (ASM) Yes (Debezium) Native Yes
Confluent No Kafka, ksqlDB, Flink Debezium connectors No Yes
Databricks Spark Structured Streaming Via partners Supported No
Snowflake Yes Snowpipe (micro-batch) No Supported No
Cloudera CDP Spark, Hive Flink, NiFi NiFi-based Supported Yes
Starburst Query only No No Query only Yes
Dremio Query only No No Arctic catalog Yes
ClickHouse Yes Kafka consumer No No Yes
CelerData Limited Kafka consumer No Read only Yes
Oracle Yes No GoldenGate ($$$) No Yes
Teradata Yes Limited No No Yes (appliance)

Why an Integrated Stack Wins

When an organization buys a query-only engine (Starburst, Dremio) or a streaming-only platform (Confluent), they still need to buy, integrate, and operate the missing pieces. That means multiple vendors, multiple contracts, multiple support queues — and the inevitable "it broke between Flink and Trino and neither vendor will own it" finger-pointing.

An integrated stack avoids this entirely:

  • Unified Iceberg storage — Streaming data and batch data land in the same tables. No data silos. One source of truth.
  • Unified security — Ranger policies apply to all data, whether it arrived via CDC, Spark ETL, or manual load. No security gaps between components.
  • Equality-delete optimization — An optimized Iceberg fork efficiently handles the small-file problem that streaming creates. Other platforms either struggle with this or require expensive manual tuning.
  • No integration tax — ASM (Flink) writes to the same Iceberg tables that Impala, StarRocks, and Spark read. No connectors, no glue code, no "data pipeline engineering."
  • On-prem and air-gapped — Unlike Databricks and Snowflake, a self-hosted lakehouse runs where the data lives. Streaming is included, not a cloud add-on.

Questions Worth Asking

The following questions are useful for understanding whether an existing architecture has a streaming gap — and what closing that gap would unlock.

"How do you get data from your operational systems — Oracle, SAP, PostgreSQL — into your analytics platform today? Is it a nightly extract, or something more continuous?"

"Is there a delay between when a transaction happens in your source system and when it appears in reports? How long is that delay — and is it acceptable?"

"Do you have any real-time requirements today? Fraud detection, live dashboards, IoT monitoring, compliance alerts?"

"Are you running Kafka or any event streaming infrastructure today? If so, what consumes from it?"

"If you could get your Oracle data into the lakehouse within seconds of a transaction, what use cases would that unlock for you?"

That last question is particularly revealing — it shifts the conversation from "do we need streaming?" to "what would we do with it?" and typically surfaces high-value use cases that haven't been pursued simply because the current platform can't handle them.

streamingbatchflinkkafkacdcreal-time101

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

By clicking "Subscribe" you agree to receive Alphyn communications. We respect your privacy.