Spark SQL Scripting: Procedural Extension for Apache Spark 4
Until recently, implementing complex multi-step logic in the Apache Spark ecosystem required stepping outside of declarative SQL. Orchestrating sequential calls, computing intermediate variables, and branching logic required external programming languages such as Python (PySpark) or Scala, as well as additional tools.
Spark SQL Scripting, which became available starting with version 4, radically changes this approach, representing a procedural extension of classic Spark SQL. Developers can now write full multi-step scenarios directly at the SQL artifact level, embedding control logic within them.
In this publication, we will explore the capabilities of Spark Scripting in practice.
Let us first identify typical use cases where using Spark SQL Scripting functionality will be a valuable aid:
- Batch DDL/DML processing sequences — Sequential data preparation statements and execution of data transformations, creation or recreation of tables, populating them with new computations, updating objects in the target warehouse layer (SCD1 and SCD2 scenarios);
- Data mart preparation and computation — Implementing classic loading steps from the staging layer into target tables with mandatory control of intermediate states;
- Data Quality / Sanity checks — Computing quality metrics directly in the course of script execution, comparing them with threshold values, and deciding: continue the process or stop with an error;
- Runbook operations — Launching scheduled data infrastructure maintenance scenarios manually or automatically (for example, cleaning up temporary objects or collecting system statistics).
As a rule, all these cases use dynamic SQL, which we will not single out as a separate scenario — since it is often precisely for generating executable SQL code based on variables or metadata that we use any procedural extensions.
The ability to use Spark Scripting is controlled by a configuration flag:
spark.sql.scripting.enabled
Spark SQL Script Structure
Any script within the procedural extension of Spark SQL is a so-called compound statement. This means that a logical block is always strictly bounded by the keywords BEGIN ... END.
Inside this block, two logical parts are traditionally distinguished:
- Declaration section — Here the developer can declare local variables, define conditions, and specify error handlers.
- Script body — The actual sequence of statements.
The script body allows a wide spectrum of constructs:
- Conditional branching
IFandCASE; - Loop constructs
LOOP,WHILE,REPEAT, and the specializedFORloop for iterating over rows of a SQL query result; - Loop control operators:
LEAVE(analogous to break in traditional languages) andITERATE(analogous to continue); - Standard DDL and DML commands (e.g.,
CREATE,DROP,INSERT); - Regular
SELECTqueries, which can return result sets to the calling side; - Variable assignment constructs
SETandSET VAR; - Dynamic execution mechanism
EXECUTE IMMEDIATE; - Nested
BEGIN ... ENDblocks for scope isolation.
Critical architectural aspect: A compound statement executes in NOT ATOMIC mode. In practice this means that if one of the statements inside the block fails, previously successfully executed steps are not automatically rolled back. Logic must be designed with an absolute emphasis on idempotency and safety. If the scenario fails midway, you will have to either manually compensate for the changes or design processing steps such that rerunning them does not cause duplication or data corruption.
Another important limitation is that (as of April 2026) the CREATE STORED PROCEDURE command for subsequent invocation via a CALL statement is not yet present.
Working with Variables
Procedural SQL is unthinkable without mechanisms for saving intermediate state. Spark SQL Scripting offers two levels of variable handling.
Two types of variables are distinguished:
- Local variables — They are declared strictly inside a
BEGIN ... ENDblock using the keywordDECLAREand exist only during the execution of that block. If you use nestedBEGIN ... ENDblocks, they create their own local scopes. It is important to remember about "shadowing": if an inner block declares a variable with the same name as the outer block, the inner declaration will override the outer value for the duration of the sub-block; - Session variables — If you need to pass values between physically different SQL queries or scripts within the same Spark session, session variables come to the rescue:
- Declaration:
DECLARE VARIABLE ... - Changing value:
SET VAR ...
- Declaration:
Note on syntax: The SET command without the keyword VAR applies to Spark configuration settings (SQLConf) and cannot work with SQL variables.
Practical Examples
To independently reproduce the practical examples, create a test object with the following script:
DROP TABLE IF EXISTS demo_orders;
CREATE TABLE demo_orders (
order_id BIGINT,
customer_id BIGINT,
region STRING,
amount DECIMAL(12,2),
order_ts TIMESTAMP,
status STRING, -- NEW / OK / CANCELLED
ingestion_dt DATE
) USING parquet;
INSERT INTO demo_orders VALUES
(1001, 501, 'EU', 10.00, TIMESTAMP '2026-03-01 10:00:00', 'NEW', DATE '2026-03-01'),
(1002, 501, 'EU', 35.50, TIMESTAMP '2026-03-01 11:00:00', 'NEW', DATE '2026-03-01'),
(1003, 502, 'EU', 120.00, TIMESTAMP '2026-03-01 12:00:00', 'OK', DATE '2026-03-01'),
(1004, 503, 'US', 5.00, TIMESTAMP '2026-03-01 10:30:00', 'NEW', DATE '2026-03-01'),
(1005, 503, 'US', 60.00, TIMESTAMP '2026-03-01 13:00:00', 'OK', DATE '2026-03-01'),
(1006, 504, 'US', 15.00, TIMESTAMP '2026-03-02 09:00:00', 'OK', DATE '2026-03-02'),
(1007, 505, 'APAC',200.00, TIMESTAMP '2026-03-02 10:00:00', 'OK', DATE '2026-03-02'),
(1008, 506, 'APAC', 12.00, TIMESTAMP '2026-03-02 11:00:00', 'NEW', DATE '2026-03-02'),
(1009, 507, 'APAC', 90.00, TIMESTAMP '2026-03-03 08:00:00', 'OK', DATE '2026-03-03'),
(1010, 508, 'EU', 1.00, TIMESTAMP '2026-03-03 09:00:00', 'NEW', DATE '2026-03-03');
Session Variable Parameters
Scenario: the same script is launched with different parameters (date/region/threshold) without modifying the text.
Launch parameters (declare once per session, then only SET VAR):
DECLARE VARIABLE p_dt DATE DEFAULT DATE '2026-03-01';
DECLARE VARIABLE p_region STRING DEFAULT 'EU';
DECLARE VARIABLE p_min_amount DECIMAL(12,2) DEFAULT 50.00;
SET VAR p_dt = DATE '2026-03-02';
SET VAR p_region = 'APAC';
SET VAR p_min_amount = 10.00;
Script:
BEGIN
DECLARE v_dt DATE DEFAULT session.p_dt;
DECLARE v_region STRING DEFAULT session.p_region;
DECLARE v_min DECIMAL(12,2) DEFAULT session.p_min_amount;
SELECT
v_dt AS ingestion_dt,
v_region AS region,
v_min AS min_amount,
COUNT(*) AS cnt,
CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2)) AS sum_amount
FROM demo_orders
WHERE ingestion_dt = v_dt
AND region = v_region
AND amount >= v_min;
END;
Computing Batch Metrics
Scenario: compute metrics for a batch for the passed date.
BEGIN
DECLARE v_dt DATE DEFAULT session.p_dt;
DECLARE v_cnt BIGINT;
DECLARE v_sum DECIMAL(12,2);
SET v_cnt = (SELECT COUNT(*) FROM demo_orders WHERE ingestion_dt = v_dt);
SET v_sum = (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE ingestion_dt = v_dt
);
SELECT v_dt AS ingestion_dt, v_cnt AS rows_cnt, v_sum AS amount_sum;
END;
In complex queries, a conflict can arise between variable names and column names in tables. By default, Spark tries to resolve a name as a column or alias in a table. Only if such a name is not found will it be interpreted as a variable.
To avoid accidental logical errors and collisions, it is recommended to explicitly qualify variables with prefixes:
session.<variable_name>system.session.<variable_name>
Limitation: session variables cannot be used in persistent (stored in the metastore) objects — for example in persisted views, default column expressions, or generated columns. Their lifecycle is bound only to the current active session.
Dynamic SQL
EXECUTE IMMEDIATE executes a SQL statement formed as a string.
Supports:
USING— parameter binding (positional or by name);INTO— writing the result of a single-row query into variables:- 0 rows returns NULL;
- More than 1 row returns an error;
INTOis allowed only for queries (not for DDL/DML).
Important limitations:
- Nested
EXECUTE IMMEDIATEis forbidden; EXECUTE IMMEDIATEcannot be passed a SQL Script: the string must not containBEGIN ... END.
If you need to substitute a dynamic table name, schema, or function name into a query, never do it through ordinary string concatenation. Use the built-in function:
IDENTIFIER(strExpr)— a mechanism for safe identifier templating, reducing the risks of syntax errors and malicious injections.
EXECUTE IMMEDIATE + USING + INTO
Scenario: dynamic query with value binding + obtaining scalars into variables.
BEGIN
DECLARE v_region STRING DEFAULT 'EU';
DECLARE v_min DECIMAL(12,2) DEFAULT 20.00;
DECLARE v_cnt BIGINT;
DECLARE v_sum DECIMAL(12,2);
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM demo_orders WHERE region = ? AND amount >= ?'
INTO v_cnt
USING v_region, v_min;
EXECUTE IMMEDIATE
'SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE region = :r AND amount >= :m'
INTO v_sum
USING (v_region AS r, v_min AS m);
SELECT v_region AS region, v_min AS min_amount, v_cnt AS cnt, v_sum AS sum_amount;
END;
IDENTIFIER(strExpr) — Safe Identifier Substitution
Scenario: table name is given as a string (safe identifier substitution).
BEGIN
DECLARE v_table STRING DEFAULT 'demo_orders';
DECLARE v_cnt BIGINT;
SET v_cnt = (SELECT COUNT(*) FROM IDENTIFIER(v_table));
SELECT v_table AS table_name, v_cnt AS row_count;
END;
Control Constructs: Branching and Loops
Control constructs are the most important functionality, enabling responses to data "on the fly." Together with dynamic SQL, this is one of the most sought-after features of procedural extensions.
Inside blocks, standard conditional branching operators are available:
IF ... THEN ... ELSEIF ... ELSE ... END IFCASE ... END CASE
Data Quality Gate
Scenario: DataQuality gate on a batch: if there are negative amounts — stop; otherwise — proceed.
BEGIN
DECLARE v_dt DATE DEFAULT session.p_dt;
DECLARE v_bad BIGINT;
DECLARE v_decision STRING;
SET v_bad = (SELECT COUNT(*) FROM demo_orders WHERE ingestion_dt = v_dt AND amount < 0);
IF v_bad = 0 THEN
SET v_decision = 'PROCEED';
ELSEIF v_bad < 10 THEN
SET v_decision = 'PROCEED_WITH_WARNING';
ELSE
SET v_decision = 'STOP';
END IF;
SELECT v_dt AS ingestion_dt, v_bad AS bad_rows, v_decision AS decision;
END;
CASE Statement — Fee Rate by Region
Scenario: CASE statement selects fee rate by region (parameter session.p_region); CASE expression marks rows by order size.
BEGIN
DECLARE v_region STRING DEFAULT session.p_region;
DECLARE v_fee_rate DECIMAL(6,4);
CASE v_region
WHEN 'EU' THEN SET v_fee_rate = CAST(0.0200 AS DECIMAL(6,4));
WHEN 'US' THEN SET v_fee_rate = CAST(0.0300 AS DECIMAL(6,4));
WHEN 'APAC' THEN SET v_fee_rate = CAST(0.0500 AS DECIMAL(6,4));
ELSE SET v_fee_rate = CAST(0.0100 AS DECIMAL(6,4));
END CASE;
SELECT
order_id,
region,
amount,
CASE
WHEN amount >= 100 THEN 'HIGH'
WHEN amount >= 50 THEN 'MEDIUM'
ELSE 'LOW'
END AS amount_band,
CAST(amount * v_fee_rate AS DECIMAL(12,2)) AS fee
FROM demo_orders
WHERE region = v_region
ORDER BY order_id;
END;
For loop processing, the following constructs are available:
LOOP ... END LOOP— infinite loop requiring explicit exit;WHILE <condition> DO ... END WHILE— pre-condition loop;REPEAT ... UNTIL <condition> END REPEAT— post-condition loop;FOR [var AS] <query> DO ... END FOR— a powerful tool for iterating over a result set of rows.
For flexible iteration control, use:
LEAVE <label>— immediate exit from a labeled block or loop;ITERATE <label>— advance to the next iteration of a labeled loop.
LOOP with ITERATE and LEAVE
Scenario: scan orders by order_id, skip "small" amounts < 10 (ITERATE), exit when accumulated threshold is reached (LEAVE).
BEGIN
DECLARE v_threshold DECIMAL(12,2) DEFAULT CAST(250 AS DECIMAL(12,2));
DECLARE v_sum DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
DECLARE v_last BIGINT DEFAULT 0;
DECLARE v_next BIGINT;
DECLARE v_amt DECIMAL(12,2);
scan: LOOP
SET v_next = (SELECT MIN(order_id) FROM demo_orders WHERE order_id > v_last);
IF v_next IS NULL THEN LEAVE scan; END IF;
SET v_amt = (SELECT amount FROM demo_orders WHERE order_id = v_next);
SET v_last = v_next;
IF v_amt < 10 THEN
ITERATE scan; -- continue
END IF;
SET v_sum = v_sum + v_amt;
IF v_sum >= v_threshold THEN
LEAVE scan; -- break
END IF;
END LOOP scan;
SELECT v_last AS last_order_id, v_sum AS accumulated_sum;
END;
WHILE Loop Over Date Range
Scenario: iterate over a date range (parameters) and compute total volume.
Range parameters:
DECLARE VARIABLE p_from_dt DATE DEFAULT DATE '2026-03-01';
DECLARE VARIABLE p_to_dt DATE DEFAULT DATE '2026-03-03';
SET VAR p_from_dt = DATE '2026-03-01';
SET VAR p_to_dt = DATE '2026-03-03';
Script:
BEGIN
DECLARE v_dt DATE DEFAULT session.p_from_dt;
DECLARE v_end DATE DEFAULT session.p_to_dt;
DECLARE v_days INT DEFAULT 0;
DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
DECLARE v_day_sum DECIMAL(12,2);
WHILE v_dt <= v_end DO
SET v_day_sum = (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE ingestion_dt = v_dt
);
SET v_total = v_total + v_day_sum;
SET v_days = v_days + 1;
SET v_dt = date_add(v_dt, 1);
END WHILE;
SELECT v_days AS days_processed, v_total AS total_amount;
END;
REPEAT Loop — Polling with Attempt Limit
Scenario: poll with attempt limit — wait for "NEW >= target", otherwise exit on limit.
BEGIN
DECLARE v_target BIGINT DEFAULT 10; -- intentionally more than exists in data
DECLARE v_attempt INT DEFAULT 0;
DECLARE v_max_attempts INT DEFAULT 4;
DECLARE v_cnt BIGINT DEFAULT 0;
DECLARE v_reason STRING;
REPEAT
SET v_attempt = v_attempt + 1;
SET v_cnt = (SELECT COUNT(*) FROM demo_orders WHERE status = 'NEW');
UNTIL v_cnt >= v_target OR v_attempt >= v_max_attempts
END REPEAT;
IF v_cnt >= v_target THEN
SET v_reason = 'FOUND';
ELSE
SET v_reason = 'MAX_ATTEMPTS';
END IF;
SELECT v_cnt AS new_cnt, v_attempt AS attempts, v_reason AS stop_reason;
END;
FOR Loop — Iteration Over Result Set
Scenario: iterate over regions and compute sum for orders with amount >= min.
BEGIN
DECLARE v_min DECIMAL(12,2) DEFAULT CAST(10 AS DECIMAL(12,2));
DECLARE v_regions INT DEFAULT 0;
DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
FOR r AS
SELECT DISTINCT region
FROM demo_orders
ORDER BY region
DO
SET v_total = v_total + (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE region = r.region AND amount >= v_min
);
SET v_regions = v_regions + 1;
END FOR;
SELECT v_regions AS regions_cnt, v_min AS min_amount, v_total AS total_amount;
END;
Exception Handling and System Handlers
To prevent scripts from failing silently or leaving locked resources and dirty data, you can describe error handlers.
The construct DECLARE EXIT HANDLER FOR allows reacting to failures. Such a handler intercepts the error, executes the defined logic, and completely stops execution of the current compound statement.
Handlers can be configured for:
- Specific named error classes;
- SQLSTATE standards;
- The generic exception
SQLEXCEPTION; - The
NOT FOUNDcondition (useful when reading data where the absence of rows is an acceptable business scenario).
DIVIDE_BY_ZERO Handler
Scenario: intercept division by zero and return a clear result instead of crashing.
-- To guarantee that division by 0 is an error (not NULL), enable ANSI mode.
SET spark.sql.ansi.enabled = true;
BEGIN
DECLARE EXIT HANDLER FOR DIVIDE_BY_ZERO
h: BEGIN
SELECT 'FAILED' AS status, 'DIVIDE_BY_ZERO' AS error;
END h;
-- provoke the error
SELECT 1 / 0;
SELECT 'OK' AS status; -- will not execute
END;
SQLEXCEPTION Handler — Any Error
Scenario: any error produces a clear final result.
BEGIN
DECLARE v_step STRING DEFAULT 'start';
DECLARE EXIT HANDLER FOR SQLEXCEPTION
h: BEGIN
SELECT 'FAILED' AS status, v_step AS failed_step;
END h;
SET v_step = 'compute';
SELECT 1 / 0; -- error for demonstration
SET v_step = 'never';
SELECT 'OK' AS status;
END;
Per-Iteration Error Isolation in FOR Loop
Scenario: an error on one loop element does not fail the entire run — count warnings and continue.
BEGIN
DECLARE v_warn INT DEFAULT 0;
DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
FOR r AS
SELECT DISTINCT region FROM demo_orders ORDER BY region
DO
maybe_fail: BEGIN
DECLARE EXIT HANDLER FOR SQLEXCEPTION
h: BEGIN
SET v_warn = v_warn + 1;
END h;
IF r.region = 'APAC' THEN
SELECT 1 / 0; -- "break" one step
END IF;
END maybe_fail;
SET v_total = v_total + (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE region = r.region
);
END FOR;
SELECT v_total AS sum_amount, v_warn AS warnings;
END;
NOT ATOMIC Demonstration — Successful Steps Are Not Rolled Back
Scenario: row was inserted, then an error occurs — the insertion is already visible. Successful steps are not rolled back automatically.
BEGIN
DECLARE EXIT HANDLER FOR SQLEXCEPTION
h: BEGIN
SELECT COUNT(*) AS test_rows
FROM demo_orders
WHERE status = 'TEST_FAIL';
END h;
INSERT INTO demo_orders VALUES
(9999, 999, 'EU', 0.01, current_timestamp(), 'TEST_FAIL', current_date());
SELECT 1 / 0; -- error
END;
Nested Blocks
Working with nested BEGIN ... END blocks in Spark SQL Scripting is not only a way to make code "pretty" — it is a way to manage error isolation, visibility control, and structural code clarity.
Nested blocks allow visually and logically dividing a script into stages: initialization block, transformation block, cleanup block. Blocks can be given labels (e.g., step_1: BEGIN ... END step_1). This makes debugging easier and clarifies at exactly which stage a failure or delay occurred.
Variable Shadowing in Nested Blocks
Scenario: demonstrate that a variable in a nested BEGIN...END can shadow an outer one without breaking the outer context.
BEGIN
DECLARE v_region STRING DEFAULT 'EU';
DECLARE v_outer_sum DECIMAL(12,2);
DECLARE v_inner_sum DECIMAL(12,2);
SET v_outer_sum = (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE region = v_region AND status IN ('NEW','OK','CANCELLED')
);
inner_block: BEGIN
DECLARE v_region STRING DEFAULT 'APAC'; -- shadows outer v_region
SET v_inner_sum = (
SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
FROM demo_orders
WHERE region = v_region AND status IN ('NEW','OK','CANCELLED')
);
END inner_block;
SELECT
v_outer_sum AS eu_sum,
v_inner_sum AS apac_sum;
END;
Practical Recommendations and Architectural Patterns
Based on the architectural characteristics of Spark SQL Scripting, we have identified a set of development rules in practice:
- Remember that in the current version (April 2026) there is no ability to save a procedure for later invocation via the
CALLcommand. It is worth architecturally planning for code storage; - Design with NOT ATOMIC in mind — forget classic transactions. Your operations must be idempotent: use
CREATE IF NOT EXISTS,INSERT OVERWRITEinstead of ordinary inserts, and the atomicMERGEoperator; - Isolate variable scopes. When working in large scripts with nested blocks, avoid duplicating variable names so as not to get confused in the context override logic;
- Strictly separate templating and value passing. Never concatenate dynamic SQL strings manually. Use strictly
IDENTIFIER()for objects andUSINGfor parameters; - Control cardinality in
EXECUTE IMMEDIATE INTO. If you are querying a scalar value into a variable viaINTO, guarantee at the SQL level that no more than one row is returned (e.g., using aggregation orLIMIT 1); - Remember the nature of distributed systems. Loops are excellent for orchestration tasks (iterate over a list of dates or a finite list of values), but they are categorically not suitable for row-by-row processing of large data volumes — we are dealing with a distributed system, not an operational OLTP DBMS. For heavy computations, always use classic relational operations (JOIN, GROUP BY), leaving loops only for control structures.
Spark SQL Scripting is not just syntactic sugar — it is an evolutionary step toward bridging the classic functionality of analytical DBMSs (such as Oracle PL/SQL, MS SQL Server T-SQL) with the power of Apache Spark distributed computing. Using Scripting allows data engineers to assemble processing pipelines in "pure SQL," without resorting to third-party components and development languages, thereby reducing the code base and lowering the entry barrier for data analysts.