DuckDB¶

Crate: crates/duckdb · Binary: datapress-duckdb

DataPress wraps the bundled DuckDB library (libduckdb-sys) and exposes its query engine over the standard HTTP API.

Highlights¶

Battle-tested SQL. Full SQL surface, mature optimiser, robust type coercion, well-understood NULL semantics.
Eager or lazy, your choice. By default each dataset is materialised into an in-memory DuckDB table at startup. Set lazy = true to register the dataset as a view that streams parquet on demand — from local files or S3 URLs — instead, with predicate and projection pushdown into DuckDB's parquet reader and no resident copy.
httpfs + delta. DuckDB autoloads httpfs and delta extensions when the dataset URL requires them.
Arrow IPC. Paged /query?format=arrow responses and full /query/stream exports write DuckDB's native query_arrow batches into the HTTP response stream; no JSON round-trip on the server side.
Experimental Quack server. Opt into DuckDB's Quack remote protocol with [server.quack] to let DuckDB clients attach to the same in-process database over quack:localhost.
Transactional reload. Dataset reload uses DuckDB's ACID transaction path (CREATE OR REPLACE TABLE ... AS SELECT ...), so failed reloads leave the existing table live. See Operations › Dataset reload.
Small binary. No DataFusion plan trees, no in-memory chunk store, no equality index — just DuckDB.

Trade-offs¶

No equality index. Every eq / in predicate runs through DuckDB's SQL optimiser. That's still fast (zone maps, parallel hash join), but the in-memory O(1) row-id lookup the DataFusion backend offers is not available here.
[dataset.index] is ignored. The DataFusion-specific block in datasets.toml doesn't apply.
Lazy datasets are streamed from source via a view, so they trade the resident table's RAM for a parquet scan on each query.

S3 reads¶

For S3 parquet and Delta datasets, DataPress loads DuckDB's httpfs extension and creates a temporary DuckDB S3 secret for each dataset. The secret is scoped to the dataset bucket, for example s3://bucket. That scope reliably matches single files, globs, partitioned paths, and reloads without leaking credentials across buckets.

When no inline access_key_id / secret_access_key pair is configured, DataPress asks DuckDB to use the AWS environment/profile chain (env;config). This avoids accidentally probing instance metadata from local or S3-compatible deployments, which can otherwise show up as a confusing 503 from read_parquet.

For S3-compatible endpoints such as MinIO, R2, or Wasabi, set endpoint, addressing_style, and allow_http the same way as the DataFusion backend:

DataPress config uses addressing_style = "virtual" | "path". For DuckDB, DataPress translates virtual to DuckDB's URL_STYLE 'vhost' and path to URL_STYLE 'path'.

[[dataset]]
name = "warehouse"
source.kind = "parquet"
source.location = "s3://warehouse/exports/*.parquet"

[dataset.s3]
region = "us-east-1"
endpoint = "http://minio.local:9000"
addressing_style = "path"
allow_http = true

When to pick DuckDB¶

You want SQL semantics you trust and rich type coverage.
You need fast startup on huge datasets — no full scan at boot.
You query datasets that don't fit in RAM.
You want DuckDB-native clients, such as the DuckDB CLI, to attach to the running DataPress process via Quack.
You don't need sub-millisecond point lookups on indexed columns.

Quack remote protocol¶

Quack is DuckDB's experimental remote protocol. DataPress starts it only when explicitly configured:

[server]
backend = "duckdb"

[server.quack]
enabled = true
uri = "quack:localhost"
token = "analytics-token"
read_only = true

The Quack server starts after DataPress registers datasets, so remote clients can query the same tables as the HTTP API. By default DataPress keeps Quack on localhost and installs a read-only authorization hook. For non-local exposure, set allow_other_hostname = true and place a TLS-terminating reverse proxy in front of the Quack port.

DuckDB CLI example:

INSTALL quack;
LOAD quack;

ATTACH 'quack:localhost' AS datapress (TOKEN 'analytics-token');
FROM datapress.accidents LIMIT 10;

For any host other than localhost, Quack defaults to HTTPS. When the server is reached over plain HTTP (development, or before a TLS proxy is in place), add DISABLE_SSL true:

ATTACH 'quack:remote_ip' AS remote_db (TOKEN 'analytics-token', DISABLE_SSL true);
FROM remote_db.accidents LIMIT 10;

When to skip DuckDB¶

You need sub-millisecond eq / in lookups on indexed columns.
You want zero-copy Arrow access into the resident chunks from in-process Rust (DataFusion backend uses native RecordBatch).