Arrow + DataFusion¶
Crate: crates/datafusion · Binary: datapress-datafusion
DataPress holds the dataset as Arrow RecordBatches in process memory
and queries it with Apache DataFusion.
Highlights¶
- In-memory columnar. Data is loaded once into Arrow chunks; every request operates on resident memory — no parquet read on the hot path.
- Equality index. A per-column
value → [row ids]map (see Configuration › Indexing) backsO(1)resolution ofeq/inpredicates. Combined predicates on multiple indexed columns merge sorted row-id lists without touching DataFusion. - Arrow IPC. Paged
/query?format=arrowresponses and full/query/streamexports write Arrow batches into the HTTP response stream. Resident no-filter streams can reuse existing batches directly; SQL fallback paths may still collect DataFusion execution batches before DataPress encodes them. - Lazy parquet mode.
lazy = trueregisters aListingTablepointing at parquet files; DataFusion handles projection & predicate pushdown for datasets too big to materialise. - Hot reload.
POST /api/v1/datasets/{name}/reloadswaps the resident chunks atomically using anArcSwapdouble buffer; queries in flight see the old data, queries arriving after the swap see the new. See Operations › Dataset reload.
Trade-offs¶
- Startup cost. Materialising terabytes of parquet at boot is
expensive in both time and RAM. Use
lazy = trueormode = "none"for those cases. - RAM-bound. Non-lazy datasets must fit in process memory (including index maps).
- Wide-schema indexing. Auto-indexing 200+ columns concurrently
can blow up memory — switch to
mode = "list".
When to pick DataFusion¶
- You need sub-millisecond
eq/inpoint lookups on indexed columns (dashboards, search-as-you-type). - The dataset fits in RAM (or you're happy to run lazy mode).
- You want zero-copy Arrow all the way out to the client.
- You need atomic hot reload of a dataset without dropping in-flight queries.
When to skip DataFusion¶
- The dataset is too large for RAM and you don't want the lazy-mode trade-offs.
- You need DuckDB-specific SQL features.
- Startup time / memory footprint of the index matter more than point-lookup latency.