Skip to content

schubergphilis/sqe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,182 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SQE: Sovereign Query Engine

License

An Iceberg-first SQL server that scales. Run SQE embedded as one binary on your laptop, or distributed across a cluster of stateless workers behind an Arrow Flight SQL or Trino HTTP endpoint. Same SQL surface, same Iceberg semantics, same identity model.

SQE is a Rust-based SQL query engine for Apache Iceberg tables, built on DataFusion 53.1 and iceberg-rust. Every query runs as the authenticated user. No service account. No shared root.

# Point SQE at AWS S3 Tables (managed Iceberg) and run it as a SQL server.
cat > sqe.toml <<'EOF'
[catalog]
type             = "s3tables"
table_bucket_arn = "arn:aws:s3tables:us-east-1:ACCOUNT:bucket/sales"
EOF

cargo run --release --bin sqe-coordinator -- sqe.toml &
cargo run --bin sqe-cli -- --host localhost --port 50051

# Iceberg time travel + manifest-derived stats + per-query identity.
sqe> SELECT customer_id, sum(amount)
  ...> FROM s3tables.sales.orders FOR TIMESTAMP AS OF '2026-04-01'
  ...> WHERE region = 'EU' GROUP BY customer_id;

# Snapshot history straight from the metadata.
sqe> SELECT snapshot_id, committed_at FROM s3tables.sales."orders$snapshots";

Why it is cool

Top-five on the public Iceberg matrix. 167/189. 88.4%. Only non-Spark engine in the top five. Per-cell breakdown in docs/iceberg-matrix.md; side-by-side against 20 other engines in docs/iceberg-matrix-compare.md.

Wins six of seven benchmark suites against Trino 465 at SF1. TPC-H, SSB, TPC-DS, TPC-C, TPC-E, TPC-BB, ClickBench. 222 of 222 queries pass. Tables and method below.

One binary scales from CLI to cluster. sqe-cli --embedded is a DuckDB-class single-process engine with the same SQL surface as the distributed coordinator. Persistent SQLite-backed Iceberg catalogs at ~/.sqe/warehouse/ survive restarts. Cross-catalog joins across multiple --catalog NAME=PATH mounts, plus runtime mounts via SQL ATTACH against any of the six supported backends (REST, Glue, S3 Tables, HMS, JDBC, SQLite).

Multi-catalog and multi-cloud, in one engine. Apache Polaris, Project Nessie, Unity Catalog OSS, AWS Glue (native SDK), AWS S3 Tables (native SDK), Hive Metastore, JDBC (Postgres, MySQL, SQLite), and Hadoop storage-only. Object stores: S3 (with endpoint override for Ceph, R2, Garage, MinIO), Azure ADLS, GCS, local filesystem, HuggingFace hf://.

Identity flows end to end. OIDC password grant. The user's bearer token is passed through to Polaris and S3 on every query. Row filters and column masks via OPA or Cedar are enforced at the LogicalPlan layer before the optimizer touches it. No information leakage. PostgreSQL-style RLS semantics.

Lineage shipped. Coordinator emits OpenLineage 2-0-2 events with column-level lineage on writes. File and HTTP sinks. Disk-spool fallback for collector outages. Off by default. docs/book/src/operations/openlineage.md.

How SQE differs

SQE Trino DuckDB
Embedded mode (one binary, no cluster) yes no yes
Distributed mode (coordinator + workers) yes yes no
Iceberg V2 + V3 read + write native V2 + partial V3 extension, read-only
Per-query OIDC bearer passthrough yes service account only n/a (single-tenant)
OPA / Cedar policy at LogicalPlan yes no no
Multi-catalog in one engine 7 backends one at a time per-extension
Wire protocols Arrow Flight SQL + Trino HTTP Trino HTTP extension
Runtime Rust binary, no JVM JVM C++ binary
Cold start sub-second tens of seconds sub-second
OpenLineage emitter native, column-level plugin no

Two longer comparison docs trace the lineage of these positions:

Performance receipts (SF1, vs Trino 465)

Suite SQE Trino Speedup Pass
TPC-H (22) 19.3s 26.6s 2.3x 22/22
SSB (13) 7.6s 8.3s 1.1x 13/13
TPC-DS (99) 57.1s 39.7s 1.4x slower 93/99
TPC-C (8 read) 0.45s 3.4s 9.6x 8/8
TPC-E (11) 10.4s 138.8s 7.8x 11/11
TPC-BB (10) 36.9s 323.6s 5.5x 10/10
ClickBench (43) 1.7s 6.3s 4.6x 43/43

SQE wins six of seven suites. The TPC-DS regression is concentrated in TPC-DS Q72, where DataFusion's lack of full CBO with NDV statistics costs SQE 13x against Trino. The story of that one query is docs/blog/2026-04-16-our-nemesis-q72.md. The May 2026 numbers reflect manifest-derived column statistics flowing into DataFusion's optimizer for the first time, plus Path B+B-2 runtime-filter pushdown (docs/features/runtime-filter-pushdown.md).

Run your own:

BENCH_SCALE=1 ./scripts/benchmark-test.sh --compare-trino tpch tpcds ssb clickbench

Architecture

Client (JDBC / Flight SQL / HTTP)
        |
        v
   +-----------+     OIDC Provider
   |Coordinator|<-- (Keycloak, Auth0,
   |           |     Okta, or any IdP)
   | DataFusion|
   |  + Policy |---> OPA / Cedar
   +-----+-----+
         | Bearer token passthrough
    +----+----+
    v    v    v
   +---++---++---+   Stateless workers
   | W1|| W2|| W3|  (distributed mode)
   +---++---++---+
    |     |     |
    v     v     v
   +-----------+
   |  Polaris  |---> S3-compatible storage
   |REST Catalog|
   +-----------+

Detailed Mermaid diagrams (query pipeline, crate dependencies, caching layers, distributed execution, write path) in docs/architecture.md.

Get started

Five-minute walkthrough covering all seven catalog backends with sample TOML and verification queries: QUICKSTART.md.

Embedded mode (one binary, no cluster)

cargo install --path crates/sqe-cli
sqe-cli --embedded                  # persistent warehouse at ~/.sqe/warehouse/

sqe> SELECT * FROM '/data/sales.parquet' LIMIT 5;
sqe> SELECT * FROM read_csv('s3://bucket/orders.tsv.gz');
sqe> SELECT * FROM 'hf://datasets/squad/plain_text/train-00000-of-00001.parquet' LIMIT 5;
sqe> SELECT * FROM read_delta('/data/delta/sales', version => '5');

sqe> CREATE SECRET partner (TYPE bearer, TOKEN 'eyJ...');
sqe> ATTACH 'http://catalog.example.com/api/catalog' AS partner_cat
       (TYPE iceberg_rest, WAREHOUSE 'analytics', SECRET partner);
sqe> SELECT * FROM partner_cat.sales.orders LIMIT 10;

Full embedded reference: docs/cli-embedded.md. Runtime ATTACH / SECRET reference: docs/book/src/operations/catalogs.md.

Cluster mode (Polaris + S3 + SQE locally)

docker compose -f docker-compose.test.yml up -d
./scripts/bootstrap-test.sh
cargo run --release --bin sqe-coordinator -- tests/sqe-test.toml

# Connect with the CLI
cargo run --bin sqe-cli -- --host localhost --port 50051 --username root --protocol flight

sqe> SHOW CATALOGS;
sqe> SELECT * FROM test_warehouse.default.my_table LIMIT 10;

Same binary against external infrastructure (Glue, S3 Tables, HMS, JDBC, Hadoop): see QUICKSTART.md and docs/catalogs.md.

Docker, Kubernetes, TLS, and auth provider setup: docs/deployment.md.

Documentation

The reference docs:

Doc What
Architecture Mermaid diagrams across the engine
Deployment Docker Compose, K8s, TLS, auth providers, monitoring
Iceberg Matrix Per-cell SQE coverage on the public scoreboard
Iceberg Matrix Comparison V2/V3 side-by-side against 20 engines
Trino Compatibility SQL function and feature matrix vs Trino
DuckDB Comparison Symmetry between SQE and DuckDB on the embedded side
Embedded CLI Reference All flags, dot-commands, TVFs, catalog backends, storage backends
SQL Feature Comparison SQE vs Trino vs Spark SQL vs DuckDB across windows, aggregates, DML, Iceberg, file-format TVFs
SQL Reference (book) Every function, statement, operator, TVF, CALL procedure, GRANT extension, with origin and Trino / Snowflake / Spark / DuckDB alias columns
Catalog Backends Per-backend TOML, credentials, verification queries
Storage Backends S3, R2, MinIO/Ceph, Azure ADLS Gen2, Google Cloud Storage, HTTPS, hf://
Operations: OpenLineage Lineage emit, sinks, troubleshooting
Benchmark history Per-suite, per-scale, per-query plots over time
Roadmap Full feature checklist
Security Audit 43 findings, all resolved

The book

SQE's design and development journey is documented in "Sovereign by Design: Building a Production Query Engine on DataFusion".

Twenty chapters across five parts. Roughly 370 pages. The story of choosing DataFusion, surviving the Iceberg fork rebase, lifting the matrix from 31% to 88%, building the embedded mode that turned out to be a DuckDB-shaped surprise, and shipping column-level lineage. Source in docs/ebook/. Build with cd docs/ebook && make.

A few chapters worth reading first:

Blog

Engineering posts that double as design rationale:

Post Topic
Why We Replaced Trino with Rust The decision to build SQE
Five Layers of Caching and an 8.8x Speedup Caching strategy across the stack
Security Hardening: 43 Findings Production audit
DataFusion 53 and the Iceberg Fork DF 53 upgrade and vendoring decision
Our Nemesis: TPC-DS Q72 The one query we cannot beat
Why a Public Iceberg Matrix Beats Vendor Spec Sheets A scoreboard for the lakehouse
SQE Talks to Five Catalogs Now The live verification phase plus AWS SigV4
How We Accidentally Created a DuckDB V8 to V12: file-format TVFs, hf://, Delta, smarter read_csv
Shipping OpenLineage Column-level lineage from idea to merged MR

Full archive in docs/blog/.

Crate structure

Crate Purpose
sqe-core Shared types, config (TOML), errors
sqe-sql SQL parser, statement classifier, GRANT/REVOKE
sqe-auth Pluggable auth chain (10 providers), token cache
sqe-catalog Iceberg REST client, caching, scan execution
sqe-policy Policy enforcement (passthrough, OPA)
sqe-planner Plan splitting, star-schema reorder, join strategy
sqe-coordinator Flight SQL server, query handler, Trino HTTP
sqe-worker Stateless DataFusion executor
sqe-cli Interactive SQL client (cluster + embedded modes)
sqe-metrics Prometheus, OpenTelemetry, audit logger
sqe-lineage OpenLineage 2-0-2 emitter; column-level lineage
sqe-trino-compat Trino wire protocol
sqe-bench Benchmark suite (7 suites, 222 queries)

Tech stack

Component Technology
Language Rust
Query Engine Apache DataFusion 53.1
Table Format Apache Iceberg V2 / V3
Catalogs Polaris, Nessie, Unity Catalog OSS, AWS Glue, AWS S3 Tables, Hive Metastore, JDBC, Hadoop
Wire Protocols Arrow Flight SQL + Trino HTTP
Storage S3, Ceph, R2, ADLS Gen2, GCS, local filesystem, HuggingFace hf://
Observability OpenTelemetry, Prometheus, OpenLineage 2-0-2
License Apache 2.0

Contributing

Issues, pull requests, and how to run tests: CONTRIBUTING.md.

License

Apache License 2.0. See LICENSE.

About

Sovereign Query Engine

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages