Spark SQL Scripting: Procedural Extension for Apache Spark 4

Until recently, implementing complex multi-step logic in the Apache Spark ecosystem required stepping outside of declarative SQL. Orchestrating sequential calls, computing intermediate variables, and branching logic required external programming languages such as Python (PySpark) or Scala, as well as additional tools.

Spark SQL Scripting, which became available starting with version 4, radically changes this approach, representing a procedural extension of classic Spark SQL. Developers can now write full multi-step scenarios directly at the SQL artifact level, embedding control logic within them.

In this publication, we will explore the capabilities of Spark Scripting in practice.

Let us first identify typical use cases where using Spark SQL Scripting functionality will be a valuable aid:

Batch DDL/DML processing sequences — Sequential data preparation statements and execution of data transformations, creation or recreation of tables, populating them with new computations, updating objects in the target warehouse layer (SCD1 and SCD2 scenarios);
Data mart preparation and computation — Implementing classic loading steps from the staging layer into target tables with mandatory control of intermediate states;
Data Quality / Sanity checks — Computing quality metrics directly in the course of script execution, comparing them with threshold values, and deciding: continue the process or stop with an error;
Runbook operations — Launching scheduled data infrastructure maintenance scenarios manually or automatically (for example, cleaning up temporary objects or collecting system statistics).

As a rule, all these cases use dynamic SQL, which we will not single out as a separate scenario — since it is often precisely for generating executable SQL code based on variables or metadata that we use any procedural extensions.

The ability to use Spark Scripting is controlled by a configuration flag:

spark.sql.scripting.enabled

Spark SQL Script Structure

Any script within the procedural extension of Spark SQL is a so-called compound statement. This means that a logical block is always strictly bounded by the keywords BEGIN ... END.

Inside this block, two logical parts are traditionally distinguished:

Declaration section — Here the developer can declare local variables, define conditions, and specify error handlers.
Script body — The actual sequence of statements.

The script body allows a wide spectrum of constructs:

Conditional branching IF and CASE;
Loop constructs LOOP, WHILE, REPEAT, and the specialized FOR loop for iterating over rows of a SQL query result;
Loop control operators: LEAVE (analogous to break in traditional languages) and ITERATE (analogous to continue);
Standard DDL and DML commands (e.g., CREATE, DROP, INSERT);
Regular SELECT queries, which can return result sets to the calling side;
Variable assignment constructs SET and SET VAR;
Dynamic execution mechanism EXECUTE IMMEDIATE;
Nested BEGIN ... END blocks for scope isolation.

Critical architectural aspect: A compound statement executes in NOT ATOMIC mode. In practice this means that if one of the statements inside the block fails, previously successfully executed steps are not automatically rolled back. Logic must be designed with an absolute emphasis on idempotency and safety. If the scenario fails midway, you will have to either manually compensate for the changes or design processing steps such that rerunning them does not cause duplication or data corruption.

Another important limitation is that (as of April 2026) the CREATE STORED PROCEDURE command for subsequent invocation via a CALL statement is not yet present.

Working with Variables

Procedural SQL is unthinkable without mechanisms for saving intermediate state. Spark SQL Scripting offers two levels of variable handling.

Two types of variables are distinguished:

Local variables — They are declared strictly inside a BEGIN ... END block using the keyword DECLARE and exist only during the execution of that block. If you use nested BEGIN ... END blocks, they create their own local scopes. It is important to remember about "shadowing": if an inner block declares a variable with the same name as the outer block, the inner declaration will override the outer value for the duration of the sub-block;
Session variables — If you need to pass values between physically different SQL queries or scripts within the same Spark session, session variables come to the rescue:
- Declaration: DECLARE VARIABLE ...
- Changing value: SET VAR ...

Note on syntax: The SET command without the keyword VAR applies to Spark configuration settings (SQLConf) and cannot work with SQL variables.

Practical Examples

To independently reproduce the practical examples, create a test object with the following script:

DROP TABLE IF EXISTS demo_orders;

CREATE TABLE demo_orders (
  order_id      BIGINT,
  customer_id   BIGINT,
  region        STRING,
  amount        DECIMAL(12,2),
  order_ts      TIMESTAMP,
  status        STRING,          -- NEW / OK / CANCELLED
  ingestion_dt  DATE
) USING parquet;

INSERT INTO demo_orders VALUES
  (1001, 501, 'EU',   10.00, TIMESTAMP '2026-03-01 10:00:00', 'NEW', DATE '2026-03-01'),
  (1002, 501, 'EU',   35.50, TIMESTAMP '2026-03-01 11:00:00', 'NEW', DATE '2026-03-01'),
  (1003, 502, 'EU',  120.00, TIMESTAMP '2026-03-01 12:00:00', 'OK',  DATE '2026-03-01'),
  (1004, 503, 'US',    5.00, TIMESTAMP '2026-03-01 10:30:00', 'NEW', DATE '2026-03-01'),
  (1005, 503, 'US',   60.00, TIMESTAMP '2026-03-01 13:00:00', 'OK',  DATE '2026-03-01'),
  (1006, 504, 'US',   15.00, TIMESTAMP '2026-03-02 09:00:00', 'OK',  DATE '2026-03-02'),
  (1007, 505, 'APAC',200.00, TIMESTAMP '2026-03-02 10:00:00', 'OK',  DATE '2026-03-02'),
  (1008, 506, 'APAC', 12.00, TIMESTAMP '2026-03-02 11:00:00', 'NEW', DATE '2026-03-02'),
  (1009, 507, 'APAC', 90.00, TIMESTAMP '2026-03-03 08:00:00', 'OK',  DATE '2026-03-03'),
  (1010, 508, 'EU',    1.00, TIMESTAMP '2026-03-03 09:00:00', 'NEW', DATE '2026-03-03');

Session Variable Parameters

Scenario: the same script is launched with different parameters (date/region/threshold) without modifying the text.

Launch parameters (declare once per session, then only SET VAR):

DECLARE VARIABLE p_dt DATE DEFAULT DATE '2026-03-01';
DECLARE VARIABLE p_region STRING DEFAULT 'EU';
DECLARE VARIABLE p_min_amount DECIMAL(12,2) DEFAULT 50.00;

SET VAR p_dt = DATE '2026-03-02';
SET VAR p_region = 'APAC';
SET VAR p_min_amount = 10.00;

Script:

BEGIN
  DECLARE v_dt DATE DEFAULT session.p_dt;
  DECLARE v_region STRING DEFAULT session.p_region;
  DECLARE v_min DECIMAL(12,2) DEFAULT session.p_min_amount;

  SELECT
    v_dt AS ingestion_dt,
    v_region AS region,
    v_min AS min_amount,
    COUNT(*) AS cnt,
    CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2)) AS sum_amount
  FROM demo_orders
  WHERE ingestion_dt = v_dt
    AND region = v_region
    AND amount >= v_min;
END;

Computing Batch Metrics

Scenario: compute metrics for a batch for the passed date.

BEGIN
  DECLARE v_dt DATE DEFAULT session.p_dt;
  DECLARE v_cnt BIGINT;
  DECLARE v_sum DECIMAL(12,2);

  SET v_cnt = (SELECT COUNT(*) FROM demo_orders WHERE ingestion_dt = v_dt);
  SET v_sum = (
    SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
    FROM demo_orders
    WHERE ingestion_dt = v_dt
  );
  SELECT v_dt AS ingestion_dt, v_cnt AS rows_cnt, v_sum AS amount_sum;
END;

In complex queries, a conflict can arise between variable names and column names in tables. By default, Spark tries to resolve a name as a column or alias in a table. Only if such a name is not found will it be interpreted as a variable.

To avoid accidental logical errors and collisions, it is recommended to explicitly qualify variables with prefixes:

session.<variable_name>
system.session.<variable_name>

Limitation: session variables cannot be used in persistent (stored in the metastore) objects — for example in persisted views, default column expressions, or generated columns. Their lifecycle is bound only to the current active session.

Dynamic SQL

EXECUTE IMMEDIATE executes a SQL statement formed as a string.

Supports:

USING — parameter binding (positional or by name);
INTO — writing the result of a single-row query into variables:
- 0 rows returns NULL;
- More than 1 row returns an error;
- INTO is allowed only for queries (not for DDL/DML).

Important limitations:

Nested EXECUTE IMMEDIATE is forbidden;
EXECUTE IMMEDIATE cannot be passed a SQL Script: the string must not contain BEGIN ... END.

If you need to substitute a dynamic table name, schema, or function name into a query, never do it through ordinary string concatenation. Use the built-in function:

IDENTIFIER(strExpr) — a mechanism for safe identifier templating, reducing the risks of syntax errors and malicious injections.

EXECUTE IMMEDIATE + USING + INTO

Scenario: dynamic query with value binding + obtaining scalars into variables.

BEGIN
  DECLARE v_region STRING DEFAULT 'EU';
  DECLARE v_min DECIMAL(12,2) DEFAULT 20.00;
  DECLARE v_cnt BIGINT;
  DECLARE v_sum DECIMAL(12,2);
  EXECUTE IMMEDIATE
    'SELECT COUNT(*) FROM demo_orders WHERE region = ? AND amount >= ?'
    INTO v_cnt
    USING v_region, v_min;
  EXECUTE IMMEDIATE
    'SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
       FROM demo_orders
      WHERE region = :r AND amount >= :m'
    INTO v_sum
    USING (v_region AS r, v_min AS m);
  SELECT v_region AS region, v_min AS min_amount, v_cnt AS cnt, v_sum AS sum_amount;
END;

IDENTIFIER(strExpr) — Safe Identifier Substitution

Scenario: table name is given as a string (safe identifier substitution).

BEGIN
  DECLARE v_table STRING DEFAULT 'demo_orders';
  DECLARE v_cnt BIGINT;
  SET v_cnt = (SELECT COUNT(*) FROM IDENTIFIER(v_table));
  SELECT v_table AS table_name, v_cnt AS row_count;
END;

Control Constructs: Branching and Loops

Control constructs are the most important functionality, enabling responses to data "on the fly." Together with dynamic SQL, this is one of the most sought-after features of procedural extensions.

Inside blocks, standard conditional branching operators are available:

IF ... THEN ... ELSEIF ... ELSE ... END IF
CASE ... END CASE

Data Quality Gate

Scenario: DataQuality gate on a batch: if there are negative amounts — stop; otherwise — proceed.

BEGIN

  DECLARE v_dt DATE DEFAULT session.p_dt;
  DECLARE v_bad BIGINT;
  DECLARE v_decision STRING;

  SET v_bad = (SELECT COUNT(*) FROM demo_orders WHERE ingestion_dt = v_dt AND amount < 0);
  IF v_bad = 0 THEN
    SET v_decision = 'PROCEED';
  ELSEIF v_bad < 10 THEN
    SET v_decision = 'PROCEED_WITH_WARNING';
  ELSE
    SET v_decision = 'STOP';
  END IF;
  SELECT v_dt AS ingestion_dt, v_bad AS bad_rows, v_decision AS decision;

END;

CASE Statement — Fee Rate by Region

Scenario: CASE statement selects fee rate by region (parameter session.p_region); CASE expression marks rows by order size.

BEGIN

  DECLARE v_region STRING DEFAULT session.p_region;
  DECLARE v_fee_rate DECIMAL(6,4);

  CASE v_region
    WHEN 'EU'   THEN SET v_fee_rate = CAST(0.0200 AS DECIMAL(6,4));
    WHEN 'US'   THEN SET v_fee_rate = CAST(0.0300 AS DECIMAL(6,4));
    WHEN 'APAC' THEN SET v_fee_rate = CAST(0.0500 AS DECIMAL(6,4));
    ELSE            SET v_fee_rate = CAST(0.0100 AS DECIMAL(6,4));
  END CASE;
  SELECT
    order_id,
    region,
    amount,
    CASE
      WHEN amount >= 100 THEN 'HIGH'
      WHEN amount >= 50  THEN 'MEDIUM'
      ELSE 'LOW'
    END AS amount_band,
    CAST(amount * v_fee_rate AS DECIMAL(12,2)) AS fee
  FROM demo_orders
  WHERE region = v_region
  ORDER BY order_id;

END;

For loop processing, the following constructs are available:

LOOP ... END LOOP — infinite loop requiring explicit exit;
WHILE <condition> DO ... END WHILE — pre-condition loop;
REPEAT ... UNTIL <condition> END REPEAT — post-condition loop;
FOR [var AS] <query> DO ... END FOR — a powerful tool for iterating over a result set of rows.

For flexible iteration control, use:

LEAVE <label> — immediate exit from a labeled block or loop;
ITERATE <label> — advance to the next iteration of a labeled loop.

LOOP with ITERATE and LEAVE

Scenario: scan orders by order_id, skip "small" amounts < 10 (ITERATE), exit when accumulated threshold is reached (LEAVE).

BEGIN
  DECLARE v_threshold DECIMAL(12,2) DEFAULT CAST(250 AS DECIMAL(12,2));
  DECLARE v_sum DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
  DECLARE v_last BIGINT DEFAULT 0;
  DECLARE v_next BIGINT;
  DECLARE v_amt  DECIMAL(12,2);

scan: LOOP

    SET v_next = (SELECT MIN(order_id) FROM demo_orders WHERE order_id > v_last);
    IF v_next IS NULL THEN LEAVE scan; END IF;
    SET v_amt  = (SELECT amount FROM demo_orders WHERE order_id = v_next);
    SET v_last = v_next;
    IF v_amt < 10 THEN
      ITERATE scan;   -- continue
    END IF;
    SET v_sum = v_sum + v_amt;
    IF v_sum >= v_threshold THEN
      LEAVE scan;     -- break
    END IF;

  END LOOP scan;

  SELECT v_last AS last_order_id, v_sum AS accumulated_sum;

END;

WHILE Loop Over Date Range

Scenario: iterate over a date range (parameters) and compute total volume.

Range parameters:

DECLARE VARIABLE p_from_dt DATE DEFAULT DATE '2026-03-01';
DECLARE VARIABLE p_to_dt   DATE DEFAULT DATE '2026-03-03';

SET VAR p_from_dt = DATE '2026-03-01';
SET VAR p_to_dt   = DATE '2026-03-03';

Script:

BEGIN
  DECLARE v_dt  DATE DEFAULT session.p_from_dt;
  DECLARE v_end DATE DEFAULT session.p_to_dt;

  DECLARE v_days INT DEFAULT 0;
  DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));
  DECLARE v_day_sum DECIMAL(12,2);

  WHILE v_dt <= v_end DO
    SET v_day_sum = (
      SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
      FROM demo_orders
      WHERE ingestion_dt = v_dt
    );
    SET v_total = v_total + v_day_sum;
    SET v_days = v_days + 1;
    SET v_dt = date_add(v_dt, 1);

  END WHILE;

  SELECT v_days AS days_processed, v_total AS total_amount;
END;

REPEAT Loop — Polling with Attempt Limit

Scenario: poll with attempt limit — wait for "NEW >= target", otherwise exit on limit.

BEGIN
  DECLARE v_target BIGINT DEFAULT 10;      -- intentionally more than exists in data
  DECLARE v_attempt INT DEFAULT 0;
  DECLARE v_max_attempts INT DEFAULT 4;
  DECLARE v_cnt BIGINT DEFAULT 0;
  DECLARE v_reason STRING;

  REPEAT

    SET v_attempt = v_attempt + 1;
    SET v_cnt = (SELECT COUNT(*) FROM demo_orders WHERE status = 'NEW');
  UNTIL v_cnt >= v_target OR v_attempt >= v_max_attempts

  END REPEAT;

  IF v_cnt >= v_target THEN
    SET v_reason = 'FOUND';
  ELSE
    SET v_reason = 'MAX_ATTEMPTS';
  END IF;

  SELECT v_cnt AS new_cnt, v_attempt AS attempts, v_reason AS stop_reason;

END;

FOR Loop — Iteration Over Result Set

Scenario: iterate over regions and compute sum for orders with amount >= min.

BEGIN

  DECLARE v_min DECIMAL(12,2) DEFAULT CAST(10 AS DECIMAL(12,2));
  DECLARE v_regions INT DEFAULT 0;
  DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));

  FOR r AS
    SELECT DISTINCT region
    FROM demo_orders
    ORDER BY region
  DO
    SET v_total = v_total + (
      SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
      FROM demo_orders
      WHERE region = r.region AND amount >= v_min
    );

    SET v_regions = v_regions + 1;
  END FOR;

  SELECT v_regions AS regions_cnt, v_min AS min_amount, v_total AS total_amount;

END;

Exception Handling and System Handlers

To prevent scripts from failing silently or leaving locked resources and dirty data, you can describe error handlers.

The construct DECLARE EXIT HANDLER FOR allows reacting to failures. Such a handler intercepts the error, executes the defined logic, and completely stops execution of the current compound statement.

Handlers can be configured for:

Specific named error classes;
SQLSTATE standards;
The generic exception SQLEXCEPTION;
The NOT FOUND condition (useful when reading data where the absence of rows is an acceptable business scenario).

DIVIDE_BY_ZERO Handler

Scenario: intercept division by zero and return a clear result instead of crashing.

-- To guarantee that division by 0 is an error (not NULL), enable ANSI mode.
SET spark.sql.ansi.enabled = true;

BEGIN
  DECLARE EXIT HANDLER FOR DIVIDE_BY_ZERO

  h: BEGIN
    SELECT 'FAILED' AS status, 'DIVIDE_BY_ZERO' AS error;
  END h;

  -- provoke the error
  SELECT 1 / 0;
  SELECT 'OK' AS status;  -- will not execute
END;

SQLEXCEPTION Handler — Any Error

Scenario: any error produces a clear final result.

BEGIN

  DECLARE v_step STRING DEFAULT 'start';
  DECLARE EXIT HANDLER FOR SQLEXCEPTION

  h: BEGIN

    SELECT 'FAILED' AS status, v_step AS failed_step;

  END h;

  SET v_step = 'compute';
  SELECT 1 / 0;  -- error for demonstration
  SET v_step = 'never';
  SELECT 'OK' AS status;

END;

Per-Iteration Error Isolation in FOR Loop

Scenario: an error on one loop element does not fail the entire run — count warnings and continue.

BEGIN

  DECLARE v_warn INT DEFAULT 0;
  DECLARE v_total DECIMAL(12,2) DEFAULT CAST(0 AS DECIMAL(12,2));

  FOR r AS
    SELECT DISTINCT region FROM demo_orders ORDER BY region
  DO
    maybe_fail: BEGIN
      DECLARE EXIT HANDLER FOR SQLEXCEPTION
      h: BEGIN
        SET v_warn = v_warn + 1;
      END h;
      IF r.region = 'APAC' THEN
        SELECT 1 / 0;   -- "break" one step
      END IF;
    END maybe_fail;
    SET v_total = v_total + (
      SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
      FROM demo_orders
      WHERE region = r.region
    );
  END FOR;

  SELECT v_total AS sum_amount, v_warn AS warnings;

END;

NOT ATOMIC Demonstration — Successful Steps Are Not Rolled Back

Scenario: row was inserted, then an error occurs — the insertion is already visible. Successful steps are not rolled back automatically.

BEGIN

  DECLARE EXIT HANDLER FOR SQLEXCEPTION

  h: BEGIN
    SELECT COUNT(*) AS test_rows
    FROM demo_orders
    WHERE status = 'TEST_FAIL';
  END h;

  INSERT INTO demo_orders VALUES
    (9999, 999, 'EU', 0.01, current_timestamp(), 'TEST_FAIL', current_date());
  SELECT 1 / 0;  -- error

END;

Nested Blocks

Working with nested BEGIN ... END blocks in Spark SQL Scripting is not only a way to make code "pretty" — it is a way to manage error isolation, visibility control, and structural code clarity.

Nested blocks allow visually and logically dividing a script into stages: initialization block, transformation block, cleanup block. Blocks can be given labels (e.g., step_1: BEGIN ... END step_1). This makes debugging easier and clarifies at exactly which stage a failure or delay occurred.

Variable Shadowing in Nested Blocks

Scenario: demonstrate that a variable in a nested BEGIN...END can shadow an outer one without breaking the outer context.

BEGIN

  DECLARE v_region STRING DEFAULT 'EU';
  DECLARE v_outer_sum DECIMAL(12,2);
  DECLARE v_inner_sum DECIMAL(12,2);

  SET v_outer_sum = (
    SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
    FROM demo_orders
    WHERE region = v_region AND status IN ('NEW','OK','CANCELLED')
  );
  inner_block: BEGIN
    DECLARE v_region STRING DEFAULT 'APAC';  -- shadows outer v_region
    SET v_inner_sum = (
      SELECT CAST(COALESCE(SUM(amount), 0) AS DECIMAL(12,2))
      FROM demo_orders
      WHERE region = v_region AND status IN ('NEW','OK','CANCELLED')
    );
  END inner_block;

  SELECT
    v_outer_sum AS eu_sum,
    v_inner_sum AS apac_sum;

END;

Practical Recommendations and Architectural Patterns

Based on the architectural characteristics of Spark SQL Scripting, we have identified a set of development rules in practice:

Remember that in the current version (April 2026) there is no ability to save a procedure for later invocation via the CALL command. It is worth architecturally planning for code storage;
Design with NOT ATOMIC in mind — forget classic transactions. Your operations must be idempotent: use CREATE IF NOT EXISTS, INSERT OVERWRITE instead of ordinary inserts, and the atomic MERGE operator;
Isolate variable scopes. When working in large scripts with nested blocks, avoid duplicating variable names so as not to get confused in the context override logic;
Strictly separate templating and value passing. Never concatenate dynamic SQL strings manually. Use strictly IDENTIFIER() for objects and USING for parameters;
Control cardinality in EXECUTE IMMEDIATE INTO. If you are querying a scalar value into a variable via INTO, guarantee at the SQL level that no more than one row is returned (e.g., using aggregation or LIMIT 1);
Remember the nature of distributed systems. Loops are excellent for orchestration tasks (iterate over a list of dates or a finite list of values), but they are categorically not suitable for row-by-row processing of large data volumes — we are dealing with a distributed system, not an operational OLTP DBMS. For heavy computations, always use classic relational operations (JOIN, GROUP BY), leaving loops only for control structures.

Spark SQL Scripting is not just syntactic sugar — it is an evolutionary step toward bridging the classic functionality of analytical DBMSs (such as Oracle PL/SQL, MS SQL Server T-SQL) with the power of Apache Spark distributed computing. Using Scripting allows data engineers to assemble processing pipelines in "pure SQL," without resorting to third-party components and development languages, thereby reducing the code base and lowering the entry barrier for data analysts.

Spark SQL Scripting: Procedural Extension for Apache Spark 4

Spark SQL Scripting: Procedural Extension for Apache Spark 4

Spark SQL Script Structure

Working with Variables

Practical Examples

Session Variable Parameters

Computing Batch Metrics

Dynamic SQL

EXECUTE IMMEDIATE + USING + INTO

IDENTIFIER(strExpr) — Safe Identifier Substitution

Control Constructs: Branching and Loops

Data Quality Gate

CASE Statement — Fee Rate by Region

LOOP with ITERATE and LEAVE

WHILE Loop Over Date Range

REPEAT Loop — Polling with Attempt Limit

FOR Loop — Iteration Over Result Set

Exception Handling and System Handlers

DIVIDE_BY_ZERO Handler

SQLEXCEPTION Handler — Any Error

Per-Iteration Error Isolation in FOR Loop

NOT ATOMIC Demonstration — Successful Steps Are Not Rolled Back

Nested Blocks

Variable Shadowing in Nested Blocks

Practical Recommendations and Architectural Patterns

Get the latest posts in your inbox

Continue Reading

Why a Lakehouse Cannot Be Built Without Spark

Open Table Formats — Iceberg vs Paimon — Practical Experience

Terabytes of Data from Teradata to Trino — Efficient Transfer Method