Arrow + DataFusion¶

Crate: crates/datafusion · Binary: datapress-datafusion

DataPress holds the dataset as Arrow RecordBatches in process memory and queries it with Apache DataFusion.

Highlights¶

In-memory columnar. Data is loaded once into Arrow chunks; every request operates on resident memory — no parquet read on the hot path.
Equality index. A per-column value → [row ids] map (see Configuration › Indexing) backs O(1) resolution of eq / in predicates. Combined predicates on multiple indexed columns merge sorted row-id lists without touching DataFusion.
Arrow IPC. Paged /query?format=arrow responses and full /query/stream exports write Arrow batches into the HTTP response stream. Resident no-filter streams can reuse existing batches directly; SQL fallback paths may still collect DataFusion execution batches before DataPress encodes them.
Lazy mode. lazy = true streams the dataset from disk / S3 instead of materialising it into RAM. Parquet registers a ListingTable; delta registers deltalake's own DataFusion provider (transaction log read once for the file list, then row groups stream per query). DataFusion handles projection & predicate pushdown for datasets too big to materialise.
Delta tables. kind = "delta" reads the table (local or S3) via the deltalake crate. By default it materialises into the resident Arrow chunks; with lazy = true it is streamed via the deltalake DataFusion provider instead.
Hot reload. POST /api/v1/datasets/{name}/reload swaps the resident chunks atomically using an ArcSwap double buffer; queries in flight see the old data, queries arriving after the swap see the new. See Operations › Dataset reload.

Trade-offs¶

Startup cost. Materialising terabytes of parquet at boot is expensive in both time and RAM. Use lazy = true or mode = "none" for those cases.
RAM-bound. Non-lazy datasets must fit in process memory (including index maps).
Wide-schema indexing. Auto-indexing 200+ columns concurrently can blow up memory — switch to mode = "list".

When to pick DataFusion¶

You need sub-millisecond eq / in point lookups on indexed columns (dashboards, search-as-you-type).
The dataset fits in RAM (or you're happy to run lazy mode).
You want zero-copy Arrow all the way out to the client.
You need atomic hot reload of a dataset without dropping in-flight queries.

When to skip DataFusion¶

The dataset is too large for RAM and you don't want the lazy-mode trade-offs.
You need DuckDB-specific SQL features.
Startup time / memory footprint of the index matter more than point-lookup latency.