Dataset reload¶
POST /api/v1/datasets/{name}/reload rebuilds a configured dataset from
its existing source and publishes the new contents without a service
restart. The endpoint is admin-only: it requires X-Admin-Token to match
ADMIN_TOKEN, or a bearer token with the configured reload scope when
OIDC auth is enabled.
curl -s -X POST \
-H "X-Admin-Token: $ADMIN_TOKEN" \
http://localhost:8080/api/v1/datasets/accidents/reload | jq
# { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }
Reloads are serialized per dataset name, so two reloads of accidents
queue behind each other. Reloads of different datasets may run in
parallel. If a reload fails, the previously published dataset stays live.
DataFusion backend¶
For materialized DataFusion datasets, reload uses a service-level double-buffer:
- The backend reads the dataset source and builds a fresh
DatasetStateoff to the side, including ArrowRecordBatchchunks and any equality indexes. - The new provider is registered in the shared
SessionContext. - An
ArcSwappublication step swaps the dataset snapshot map. - Requests that already captured the old
Arc<DatasetState>keep running against it; new requests see the new state. - The old Arrow buffers are freed once the last in-flight request drops its reference.
This gives zero-downtime publication and failure safety, but it has a memory trade-off: while reload is building, the old and new copies of the dataset coexist. For large materialized datasets, peak RSS can approach roughly twice the dataset size plus index overhead. Lazy DataFusion datasets avoid resident Arrow buffers, but still re-register their table provider and publish new metadata through the same snapshot mechanism.
DuckDB backend¶
DuckDB does not need DataPress to hold a second full Arrow copy of the dataset. Reload is delegated to DuckDB as an ACID transaction with:
or the equivalent scan for Delta/S3 sources. DuckDB executes that as a transactional catalog/table replacement: if the source read or table creation fails, the existing table remains available; if it succeeds, the replacement becomes visible atomically to later queries. In-flight queries continue against the snapshot they started with, using DuckDB's own transaction and MVCC semantics.
After DuckDB publishes the replacement table, DataPress refreshes the
cached schema and row count used by /schema and /api/v1/datasets.
Those metadata maps are small and are swapped under short-lived Rust
locks. The heavy data path is handled by DuckDB rather than by a
DataPress-owned double buffer.
The practical result is similar at the HTTP API level: clients either see the old dataset or the new dataset, never a partially loaded one. The resource profile differs: DuckDB relies on its engine and buffer manager, whereas materialized DataFusion temporarily keeps old and new Arrow resident data in process memory.