Data Security and Compliance 101: What Every Data Platform Needs

Data platforms hold the crown jewels of every organization: customer records, financial transactions, personally identifiable information (PII), health records, and proprietary analytics. If you can query it, someone in compliance is worrying about who else can query it.

In regulated industries — banking, government, healthcare, telecom — security is not optional. It is a mandatory checkbox that gets evaluated before features, performance, or price. Three things drive this reality:

CISOs and compliance teams have veto power. They may not choose the platform, but they can block any platform that fails their review. If a system can't answer their questions, it doesn't make it through procurement — regardless of how fast its queries run.
Regulated industries are where enterprise-scale deployments live. Banks, government agencies, and telecoms require the highest standard of security as a baseline condition.
Security breaches are career-ending events. Nobody gets fired for choosing the more secure option. Decision-makers are risk-averse on security for very personal reasons.

A platform that can't demonstrate security controls loses the deal regardless of features.

Security Concepts in Plain Language

Here are the core security terms that come up in any serious data platform evaluation.

Authentication

Proving your identity before you get access. Think of it as the security badge at the building entrance — you swipe your card, and the system confirms you're a real employee.

In practice: Keycloak with OIDC/OAuth2, integrating with Active Directory, LDAP, and any corporate SSO.

Authorization (RBAC)

Once you're in the building, which rooms can you enter? Role-Based Access Control means your access is defined by your role — "analyst," "data engineer," "admin" — not configured per-person.

In practice: Apache Ranger policies, centrally managed and applied across all query engines.

Row-Level Security

Different users see different rows in the same table. A regional manager in one city only sees that city's data; a manager in another city sees only theirs. Same table, different views — automatically enforced.

In practice: Ranger row-level filters applied consistently across StarRocks, Impala, Trino, and Spark.

Column-Level Masking

Sensitive columns — SSN, salary, email, phone number — are masked or hidden based on role. An analyst sees ***-**-1234 while an authorized HR user sees the full value.

In practice: Ranger column masking policies, with the same policy enforced across all engines.

Encryption at Rest

Data files stored on disk are encrypted. If someone steals a hard drive, they get gibberish. This is a compliance checkbox in virtually every regulation.

In practice: Storage-level encryption via S3/MinIO server-side encryption and Kubernetes Secrets encryption.

Encryption in Transit (TLS/mTLS)

Data moving between components — browser to server, engine to storage, service to service — is encrypted. Nobody can intercept it by sniffing the network.

In practice: TLS everywhere via cert-manager with auto-rotating certificates; mTLS between internal services.

Audit Trail

Every query, every data access attempt, every login is logged. When an auditor asks "who accessed this table in the last 90 days?" — you have the answer in seconds.

In practice: Full audit trail written to OpenSearch — every query and access attempt, searchable and exportable.

Data Sovereignty

Laws require data to stay within specific country borders. EU data must stay in the EU, GCC data must comply with local sovereignty requirements. The platform must guarantee data never leaves the jurisdiction.

In practice: On-premise Kubernetes — data never leaves the customer's data center. No cloud call-home.

Air-Gap Deployment

The system operates with absolutely zero internet connectivity. No outbound network access. No license phone-home. No telemetry. No update checks. The platform runs completely isolated from the outside world. This is mandatory for government, defense, and critical infrastructure.

In practice: Air-gapped by default. Helm charts plus a private container registry equals fully offline. No special tooling needed.

The Regulatory Landscape

Security requirements don't emerge in a vacuum — they are driven by specific regulations. Here is what's driving the conversation across industries.

Banking and Finance — Basel III, PCI DSS, Central Bank Regulations

Basel III mandates operational risk management including data security. PCI DSS governs any system touching payment card data — strict access control, encryption, audit trail. Every country's central bank adds local requirements on top.

Healthcare — HIPAA and Local Equivalents

HIPAA requires "minimum necessary" access to patient data, full audit trails, encryption, and breach notification. Other countries have similar medical data protection laws. Row-level security is critical — a doctor should only see their own patients' records.

Government and Defense — Air-Gap Mandatory, Sovereign Data

Classified and sensitive government systems must be air-gapped. No internet connectivity, period. Data sovereignty is non-negotiable. This immediately eliminates all SaaS vendors and most cloud-dependent platforms.

EU / International — GDPR

Europe's General Data Protection Regulation covers data residency requirements, the right to deletion (organizations must be able to find and delete a person's data), consent tracking, and breach notification within 72 hours. Fines reach up to 4% of global revenue.

GCC / Middle East — Data Localization and Sovereignty

Countries across the GCC region are enacting strict data sovereignty laws. The UAE's Federal Decree-Law on Data Protection, Saudi Arabia's PDPL, and similar frameworks require personal data to remain within national borders and mandate specific technical security controls. Air-gap capability and on-premise deployment are frequently required for government and financial sector systems.

Telecom — Subscriber Data Protection

Telecom operators hold subscriber data (call records, location, usage patterns) under strict regulation. Lawful intercept readiness is mandatory in most jurisdictions. Access control and audit trail on subscriber data are non-negotiable.

Cross-Industry — SOX (Sarbanes-Oxley)

Any publicly traded company must ensure financial reporting data integrity. Access controls on financial data, an audit trail of who changed what, and segregation of duties are required. Applies to any data platform that feeds financial reports.

Cross-Industry — ISO 27001

The international standard for information security management. Many enterprises require ISO 27001 compliance from the platforms they deploy. It covers access control, encryption, monitoring, and incident response.

The Multi-Engine Security Challenge

Modern data platforms often run multiple query engines: a fast OLAP engine for dashboards, a SQL engine for ad-hoc queries, a distributed engine for federation, a big data engine for ETL. Each engine was built by a different project, with a different security model.

When a deployment runs multiple data tools, security must be maintained separately in each one:

Define a row-level policy in StarRocks — does it apply in Trino? No. Different policy engine.
Create a role in Impala — does it exist in Spark? No. Separate user management.
Mask a column in one engine — a user queries the same data through another engine and sees the unmasked value.
Audit trail? Four separate logs in four different formats.

The result: security policies inevitably drift. Engine A has the updated policy, Engine B still has last month's version. A compliance auditor finds the gap.

Consider a building with four entrances, each managed by a different security company. They each have their own badge system, their own access lists, their own visitor logs. Somebody gets terminated? You have to call all four companies to revoke access. Miss one, and you have a breach. That's what multi-engine security looks like without a unified policy layer.

Unified Security Architecture

The key architectural principle for securing a multi-engine data platform is one security policy, enforced across all engines. Define a policy once and it is automatically applied everywhere.

A robust implementation has three layers that work together:

Layer 1 — Authentication: Keycloak

Keycloak serves as the central identity provider. It handles login, SSO, multi-factor authentication, and integrates with whatever the organization already uses — Active Directory, LDAP, OIDC, SAML. One login equals access to the entire platform. No separate credentials per engine.

Layer 2 — Authorization: Apache Ranger

Apache Ranger is the central policy engine. All access control policies — table access, row-level security, column masking — are defined once in Ranger's UI and enforced everywhere.

Layer 3 — Engine Integration: Profile Parsers and Policy Mappers

Open-source Ranger doesn't natively understand every query engine. The gap is bridged with custom components:

Component	What It Does	Why It Matters
Profile Parsers	Read engine-specific settings and configuration, convert them into Ranger-compatible policy definitions	Ranger always has an accurate picture of what resources exist in each engine
Policy Mappers	Convert Ranger policies into engine-specific ACLs for StarRocks, Impala, Trino, and Spark	One policy in Ranger = four engine-specific enforcement points. No drift, no gaps.
Keycloak–Ranger User Sync	Automatically syncs users, groups, and roles from Keycloak to Ranger	New user in Active Directory automatically appears in Ranger with correct group membership. No manual sync.

Encryption and Certificates

TLS everywhere via Kubernetes cert-manager. Certificates are automatically generated, rotated, and renewed — no manual certificate management. mTLS between internal services means even internal network traffic is encrypted.

Unified Audit Trail

Every query from every engine flows to a single destination — one place to search, one place to export for auditors. A query like "show me every access to the customers table in the last 90 days across all engines" returns a single, coherent result.

Air-Gap by Default

A properly architected on-premise deployment requires zero outbound network access. No license servers to phone home to. No telemetry. No cloud dependencies. This isn't a special configuration — it should be the default deployment model.

Security Capability Comparison

How common data platform options compare on security capabilities:

Capability	Unified Kubernetes Platform (e.g., Alphyn)	Oracle	Teradata	Snowflake	Cloudera	Starburst
Unified Auth Across Engines	Yes — single IdP + policy engine across 4 engines	Single engine only	Single engine only	Single engine only	Ranger, but per-service config	Trino only
RBAC	Yes — centralized	Yes	Yes	Yes	Yes	Yes
Row-Level Security	Yes — unified across engines	Yes (VPD)	Yes	Yes	Yes — Ranger	Yes
Column Masking	Yes — unified across engines	Yes (Redaction)	Limited	Yes	Yes — Ranger	Yes
Encryption at Rest	Yes	Yes (TDE)	Yes	Yes	Yes	Yes
Encryption in Transit	Yes — TLS/mTLS everywhere, auto-rotating certs	Yes	Yes	Yes	Configurable, not default	Yes
Audit Trail	Yes — unified, single destination	Yes (Unified Audit)	Yes	Yes	Per-service logs	Query log only
Air-Gap Deployment	Yes — default model	Yes — very high cost	Yes — very high cost	No — SaaS only	Partial — some components need connectivity	Possible, not default
SSO / OIDC	Yes — native	Yes	Yes	Yes	Yes	Yes
Data Sovereignty (On-Prem)	Yes — Kubernetes on-prem	Yes — appliance on-prem	Yes — appliance on-prem	No — cloud regions only	Yes — on-prem	On-prem option available

The key observation: most vendors can check individual security boxes. The differentiator for multi-engine platforms is unified enforcement — one policy, one audit trail, one identity provider. When a deployment uses multiple tools managed independently, it has multiple security silos.

Air-Gap Deployments

Air-gap is often the first filter in government, defense, critical infrastructure, and banking in certain jurisdictions. If a platform can't operate air-gapped, it is out of the conversation before it starts.

Who requires air-gap:

Government and defense — classified systems, national security data
Critical infrastructure — power grids, transportation, water systems
Banking — in regulated jurisdictions (GCC, parts of Asia, Central Asia), core banking data platforms must be air-gapped
Healthcare — some hospital networks and research systems

What air-gap means for vendor selection:

Vendor	Air-Gap Capability	Reality
Snowflake	Impossible	SaaS-only. No on-premise deployment option. Eliminated immediately.
Databricks	Impossible	Cloud-native SaaS. No air-gap path. Eliminated immediately.
On-Prem Kubernetes Platforms	Default	Helm charts + private container registry. Standard Kubernetes air-gap pattern. No license server call-home. Air-gap is the default, not an add-on.
Oracle Exadata	Possible, expensive	On-prem appliance can be air-gapped. But proprietary hardware, massive cost, and Oracle licensing in air-gap require special arrangements.
Teradata VantageCore	Possible, expensive	On-prem hardware appliance can be air-gapped. Same story as Oracle — very high cost, proprietary infrastructure.
Cloudera CDP Private	Partial	Can run on-prem, but some components expect internet access for updates, license validation, or management console connectivity. Not air-gap by default.
Starburst / Dremio	Possible, not default	Can be deployed on-prem on Kubernetes. Air-gap is possible but not the designed-for deployment model. License management may need special handling.

A Kubernetes-based platform doesn't need special "air-gap mode" because there's nothing to turn off. There is no telemetry to disable, no license server to redirect, no cloud dependency to work around. It's just Kubernetes, Helm charts, and container images. If you can run Kubernetes, you can run it — air-gapped or not.

Key Questions for Evaluating Any Data Platform on Security

When evaluating a data platform's security posture, these questions surface the most important gaps:

On data sovereignty and air-gap: Does the platform have air-gap or data sovereignty requirements? Are there specific jurisdictions where data must physically reside? This immediately filters the competitive field — SaaS vendors are eliminated the moment the answer is yes.

On policy consistency across tools: How are security policies managed across different data tools? If an access policy changes in one tool, how does that change propagate to other tools? Most environments will reveal that policies are managed separately per tool — the multi-engine drift problem described above.

On compliance history: When was the last compliance audit? Were there findings related to data access controls? Audit findings are pain points with remediation budget attached. A unified audit trail and consistent policy enforcement directly address the most common findings.

On data segmentation: Do different teams need to see different subsets of the same data? How is that enforced today? Most organizations handle this through application logic or manual views — fragile, error-prone, and hard to audit. Policy-driven, engine-agnostic row-level security is the robust alternative.