Engineering

Why a Lakehouse Cannot Be Built Without Spark

Apache Spark remains essential for Lakehouse architectures — from Iceberg table maintenance and fault-tolerant ETL to distributed data integration.

by Alphyn.ai Engineering Team·15 min read

Why a Lakehouse Cannot Be Built Without Spark

The Lakehouse concept is actively promoted as the "golden mean" between a Data Lake and a Data Warehouse: it promises to combine the flexibility of data storage, advanced analytics, and transactional compliance in a unified architecture using modern open table formats such as Iceberg (which has already become a de facto accepted standard for building Data Lakehouses). This article examines a fundamental question: why can a full-featured Lakehouse not be built without Spark?

We discuss the role Spark plays in the Lakehouse approach, which tasks it handles better than others, its close relationship with Iceberg, and why alternatives often fall short of the required level of universality, scalability, and reliability in a large production environment. We also explain why in Alphyn Lakehouse we use Spark as the engine for maintaining Iceberg tables and as the tool for migrating data into the Lakehouse.

Why Many People Dislike Spark

Nothing in the world is perfect — and Spark is no exception. What are the characteristics that often cause frustration among members of the data community? Let's go through them in order:

  1. Cumbersome session configuration for the environment in which queries/computations run. With something like Impala/Trino/StarRocks, the instance is usually already running and some configurations have been done for the user. All they need to do is write a SQL query and get data. The user can configure their session if they wish, but it is not required at all.

    With Spark things are different. Every application (whether a scheduled process or a Jupyter session) is a separate instance that must be spun up, and the user needs to configure a session for it. Questions arise: how many cores does my application need? How much memory? What should I set for spark.dynamicAllocation.maxExecutors? Wait, what is dynamicAllocation? Do I enable the Comet plugin for the application? If so, how much off-heap memory should I allocate? The natural reaction is: "Stop this torture, I'm an analyst, I want to analyze data and create business value, not rack my brain over how to efficiently launch my application."

  2. Speed. Spark is not the fastest engine for executing analytical queries, and there are many reasons for this.

  3. The engine is poorly suited for fast, simple queries. In analytics there are often situations where you need to quickly select, look at, and draw conclusions from data. By the time a Spark application launches on the cluster, you could have already analyzed the query result on Impala, walked from home to the office, and verbally presented your conclusions to your manager. The root cause is the heavy startup cost of drivers and executors. Each Spark application is a separate instance requiring its own resource allocation on the cluster. While the Cluster Manager in Kubernetes decides which pods to run on which nodes... With other engines, the application instance is shared across all users and already running — just connect and start writing queries.

  4. Debugging problematic applications. Spark History Server was invented to help with this, and it contains an enormous amount of information about Spark applications. Not just a lot — a LOT. On first visit you can get lost for a long time, and a true understanding of what is there, where it comes from, and why requires extended hands-on practice with Spark.

You can't build a production-grade Lakehouse without Spark — but you also can't use only Spark. The key is knowing where it excels and where other engines take over.

Why Spark Is Necessary for Building a Data Lakehouse

Lakehouse is a modern data management architecture that combines the benefits of a Data Lake and a Data Warehouse and includes business intelligence and machine learning capabilities on stored data. Its distinguishing characteristics include:

  • Use of open table data formats (Iceberg, Delta, Hudi)
  • Separation of compute and storage resources
  • Support for batch/streaming workloads
  • Unified data governance center

To realize these principles, a reliable and scalable engine is needed. And looking at existing Data Lakehouse solutions, it is clear that Spark has long since de facto become one of the standard frameworks for distributed big data processing. Spark is widely used in Data Lake implementations as well as in many Lakehouse platforms: Databricks, Alphyn Lakehouse, AWS Glue, Google Dataproc, and others use Spark as one of their tools for working with data.

It is worth noting that among table formats, Iceberg has become the most widespread on the market. Some open table formats (OTFs) are tightly tied to a specific vendor (hello, Delta from Databricks). Some OTFs fall short of Iceberg in terms of performance and applicability. Additionally, Iceberg is developed by the Spark community, which means it has the largest number of working features on this engine.

Most Effective Maintenance of Iceberg Tables

Apache Iceberg is an open table data format that, along with writing data files, includes a metadata writing and tracking mechanism that enables:

  • ACID transactional compliance
  • Schema evolution and hidden partitioning
  • Accelerated table scanning (by skipping irrelevant data files)

Every action on a table (its creation, data insertion, deletion, etc.) creates a new snapshot in the metadata describing which files represent the table at a specific point in time. As actions accumulate, both data files and metadata files grow in number. To achieve maximum performance when working with a table, it must be maintained in a timely and correct manner:

  • Delete old unnecessary data and its relevant metadata
  • Merge (compact) small data files into larger ones (optionally applying optimization techniques such as z-ordering)
  • Find and delete data files not referenced by any snapshot (remove orphan files), and so on

To perform these important procedures, it is desirable to have maximum control over the process with fine-grained configuration. Spark provides exactly this control.

Spark allows representing maintenance as a set of independent tasks per file groups. Instead of a monolithic rewrite of the entire table, we form file groups (e.g., within partitions or by file size/count) and process them in parallel (using the max-concurrent-file-group-rewrites setting). This approach offers several advantages:

  • You can divide the load (maintain only part of a table per run, limit the volume of rewritten data or the number of deleted objects)
  • Flexibly control parallelism (to avoid overloading the Data Lakehouse)
  • Improve fault tolerance through Spark retries at the individual group level (a failure in one task does not cause the entire maintenance job to fail and allows safe process restart), as well as through the partial-progress.enabled configuration, which allows committing rewrite progress in parts — specifically per individual file group

No other engine gives us such fine-grained control over table maintenance.

Providing a Reliable Lakehouse Integration Layer

Integrating data from external systems (such as Postgres, Oracle, Greenplum, Kafka, company portals, and other corporate sources) into the Lakehouse is a typical and important task. A good solution should provide the ability to work with many sources out of the box, add custom data sources from scratch when needed, process (transform) the extracted data as necessary, and guarantee reliability and repeatability of loads.

It is also worth separately noting that it must be possible to work with enormous data volumes, up to terabytes. Among other engines, Spark is the most suitable solution for this task thanks to its architecture, ecosystem of existing connectors, and capabilities for adding new ones. Among the architectural characteristics, the fault-tolerance mechanism deserves special mention — it helps ensure the reliability of extractions and computations.

Distributed Data Extraction from Sources

For data extraction, Spark uses the standard interface for database interaction — JDBC (Java Database Connectivity).

The extraction process works as follows:

  1. A JDBC connection is established to the source
  2. The query to the source is split into several parallel parts (if options governing query partitioning are configured)
  3. Each Spark application executor executes its own SQL query via JDBC and extracts the results

Because the entire extraction process is broken down into independent unique queries, each executed on a separate executor, data extraction parallelism is achieved.

Diversity of Connectors and the Ability to Add Custom Ones

Spark supports a huge number of sources, including:

  1. Relational databases (Oracle, Postgres)
  2. Cloud storage (MinIO, AWS)
  3. NoSQL stores (Cassandra, ElasticSearch)
  4. Streaming sources (Kafka, Kinesis, socket)

If you have a special source that Spark doesn't natively support but you want to extract data from, you can write a custom connector. This can be done using Java/Scala, and starting with Spark 4 — also using Python.

Supporting Heavy ETL

If you need not only to extract data but also to transform it further, fault-tolerance becomes very important. Every organization has critical scheduled computations that must meet a specific SLA.

Imagine a process that collects data from multiple sources over hours and builds a data mart for critical reporting (for example, AML/CFT compliance). If a failure occurs in a task thirty minutes before the application finishes, you very much want the application not to crash and require a restart from the beginning — but instead to resume from exactly where the failure occurred (or a couple of steps before it).

This is precisely the fault-tolerance capability that Spark provides. Yes, this is not the fastest computation tool, but it is unquestionably the most reliable — and also the most transparent (no other engine produces as many varied logs during operation).

Fault-tolerance is achieved through several features:

  1. Data lineage tracking during computations. Each DataFrame is a unique object that resulted from applying certain transformations to other DataFrames (or was created by reading data from a source such as a file).

  2. Landing intermediate shuffle files. If a problem occurs at some stage and the computation attempt fails, Spark does not necessarily need to restart the entire job from the beginning. It can take shuffle files from the previous stage and restart only the last failed stage. This is especially useful for long-running batch loads (or other user computations).

  3. Data spill to disk. If Spark runs out of RAM, data begins to "spill" to disk. Although spilling to disk is not a good practice in itself, it is one of the tools for ensuring extraction reliability. In Trino, for example, any spill is disabled by default (using spill in Trino is considered bad practice, since Trino is primarily designed for fast interactive queries), and all data is stored and processed entirely in RAM. If there isn't enough memory — the extraction query fails.

In addition, Spark scales its applications more predictably and safely than other engines thanks to its architecture. Each Spark application deploys as a separate, independent working unit in the cluster. If our application fails for some reason — that's the way it goes. But only our application failed, not the entire service. There is no situation here where a poorly written SQL query in Spark can bring down all running Spark applications on the cluster.

Unlike Spark, Trino and Impala function as persistent daemons with many users running simultaneously. There is always the possibility that an awkward query could take down the entire service. Every query from any user executes within the shared thread and memory pool (or its fraction — the resource group) belonging to these persistent processes. And if one query consumes too much memory, in the best case it will simply consume all cluster resources, and in the worst case cause an OutOfMemory error or freeze the entire node.

In other words, Spark provides a greater degree of isolation for individual applications/queries compared to other engines. This isolation combined with the fault-tolerance mechanism is what allows us to build truly reliable pipelines for computing critical and heavy ETL processes.

Additional Advantages of Spark

Spark includes not only batch data processing but also a number of other useful components that are absent in other engines. For example, on Spark we can solve ML-domain tasks.

Spark MLLib was originally designed for distributed data processing (as are other Spark modules). Using the built-in module for ML tasks provides the following advantages:

  • The ability to train models on extremely large datasets (hundreds of gigabytes of data)
  • Data does not need to be transferred to a separate environment — it is stored in the Lakehouse storage, and Spark MLlib works within that same Lakehouse
  • Model training itself is implemented as distributed computation, natively achieving parallelization

Spark ML functions are used for various transformations in standard ETL processes (standard scaler, minmax scaler for feature normalization), or when there is a genuine need to train a model on a massive dataset (for example, training gradient boosting to forecast sales for all SKUs across all stores in a retail chain).

Although Spark ML has far fewer features than Python ML libraries and its documentation is far from comprehensive, the ability to train models on enormous datasets is a distinctive advantage of Spark over other data processing engines that can genuinely prove useful in day-to-day work.

Real-time data processing is a common task today. Therefore it is important for a Data Lakehouse to have the capability to handle it. Spark enables near-real-time data processing in a micro-batch manner using the streaming module. With streaming, data can be loaded from streaming sources such as Kafka or sockets. The streaming module is integrated with the other components and uses a unified compute engine. This means streaming and batch data can be processed using the same APIs, reducing the complexity of developing and maintaining applications.

For example, in our Alphyn Lakehouse platform we needed to implement an audit of Spark application execution based on Spark History Server logs saved to S3. Key challenges included the dynamic log structure and the presence of non-standard events whose structure is not known in advance and may change over time. Using Spark, already embedded in the platform, allowed us to quickly arrive at a ready solution without introducing additional components or complicating the architecture.

Spark's Weaknesses — and How They Are Being Addressed

In the section on disadvantages we identified slow driver/executor startup. But starting with Spark 3.4, the Spark Connect functionality appeared — a modern client-server architecture. In the standard implementation, the client (for example, via a Jupyter notebook) connects to Spark directly within the same environment. The client itself consumes resources locally, and the client and cluster must have identical library versions. With Spark Connect, a thin client is created that allows the user to work with a remote Spark cluster. A logical query execution plan is built from the code, then sent in binary form to the Spark Connect Server. The Connect Server then interprets the plan and executes it. Results are returned as pyarrow packets. This protocol opens enormous possibilities for extending and broadening Spark usage — beyond the obvious use cases of remote interactive analytics (for analysts or data scientists), it also enables lightweight Spark integration in various microservices.

Spark has a large community that actively develops it and continuously adds new features. For example, Spark 4.0 made a significant leap forward in the further development of Spark SQL functionality and other components. New features substantially expand the user experience and enable new things that are not available in other engines. A couple of notable capabilities from the new version:

  • PIPE syntax for data queries. The ability to write queries in this way looks unconventional for users familiar with SQL, but the feature may be useful for people who don't know SQL at all.
  • Creating complex SQL scripts in Spark using loops and conditional execution. The introduction of this capability brings Spark closer to traditional (and familiar) relational databases in terms of user experience. Developers can now implement complex logic in SQL, as in something like Postgres.

The list of significant changes introduced in Spark 4.0 is actually very long and deserves a dedicated article.

Can You Build a Lakehouse Without Spark — or with Spark Alone?

If you really want to, you can of course build a Data Lakehouse without Spark. But in that case you will need to cover the tasks that Spark handles out of the box with an entire zoo of alternative solutions, since no single base engine will solve them all individually.

Undoubtedly, Spark has broad functionality and impressive distinctive capabilities that make it a reliable and powerful engine necessary for the operation of a Data Lakehouse. However, there are use cases where other engines perform much better than Spark.

In particular, this refers to executing interactive ad hoc queries whose results must be returned to the user as quickly as possible. Here, engines like Impala/StarRocks leave Spark far behind in execution speed, since they were specifically designed for such tasks (processing all data in RAM, pushing filter and projection operations closer to the data source, parallelizing computations, etc.). Speaking of Trino, its ability to execute federated queries — where within a single script you can work with many databases simultaneously — is a clear advantage for analyst work. It is genuinely convenient when you can access multiple sources from one place without additional complexity.

Spark is a necessary part of Lakehouse platforms and opens the doors to reliable ETL processes, data integration, and a broad spectrum of available data operations. But one should not forget: to achieve the most effective solution to heterogeneous data tasks, it is worth taking the best from the world of engines. This is why in Alphyn Lakehouse we do not force users to use a specific engine for data computations. In our platform, Spark is present as the mandatory engine for Iceberg maintenance and the default engine for data ingestion, which we regularly develop further ourselves.

sparkiceberglakehouseetlfault-tolerancespark-connect

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

By clicking "Subscribe" you agree to receive Alphyn communications. We respect your privacy.