Register datasets at runtime¶
DataPress can add new datasets to a running server without a restart. Three admin-only endpoints cover the workflow:
| Endpoint | Purpose |
|---|---|
POST /api/v1/datasets |
Register one new dataset from a JSON body. |
POST /api/v1/datasets/persist |
Append a dataset's [[dataset]] block to datasets.toml so it survives a restart. |
POST /api/v1/config/reload |
Re-read datasets.toml and register every newly-added [[dataset]]. |
All three require the same permission as dataset reload: an
X-Admin-Token header matching ADMIN_TOKEN, or a bearer token carrying the
configured reload scope when OIDC auth is enabled.
Newly-registered datasets are held in memory only. They answer queries
immediately, but disappear on restart unless you also persist them (or add
them to datasets.toml and hot-reload).
Register a single dataset¶
POST /api/v1/datasets takes a JSON body with the same shape as a
[[dataset]] block (see Configuration › Datasets).
The backend validates the config, opens the source, and makes the dataset
queryable. It returns the fresh dataset summary.
import os, requests
resp = requests.post(
"http://localhost:8080/api/v1/datasets",
headers={"X-Admin-Token": os.environ["ADMIN_TOKEN"]},
json={
"name": "events",
"source": {"kind": "parquet", "location": "/data/events/*.parquet"},
},
)
resp.raise_for_status()
print(resp.json()) # {'name': 'events', 'rows': 1240333, 'columns': 12}
The full body accepts every field a [[dataset]] block does:
{
"name": "events",
"source": { "kind": "parquet", "location": "s3://lake/events/*.parquet" },
"columns": ["id", "ts", "kind", "payload"],
"dict_encode": true,
"lazy": false,
"index": { "mode": "auto", "columns": [], "max_cardinality": 100000 },
"s3": {
"region": "us-east-1",
"endpoint": "http://minio.local:9000",
"addressing_style": "path",
"allow_http": true,
"partitioning": "hive"
}
}
A 400 is returned when the name is already registered, the name uses
characters outside [A-Za-z0-9_.-], or the source cannot be opened
(not found, access denied, empty).
Persist a dataset to the config file¶
Registration alone is in-memory. To make a dataset survive a restart, append
its [[dataset]] block to the datasets.toml the server was started from:
curl -s -X POST \
-H "X-Admin-Token: $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "events",
"source": { "kind": "parquet", "location": "/data/events/*.parquet" }
}' \
http://localhost:8080/api/v1/datasets/persist | jq
# { "persisted": true, "path": "/etc/datapress/datasets.toml" }
The body is the same DatasetConfig shape as POST /datasets. This only
works when the server was started from a config file — a server
configured in-process (for example through the Python bindings)
has no file to append to and returns 400.
Register, then persist
A common pattern is to POST /datasets first, confirm the dataset loads
and answers queries, then POST /datasets/persist with the same body to
keep it. The explorer UI does exactly this from its
Register tab.
Hot-reload the config file¶
If you edit datasets.toml on disk — for example a data pipeline appends a new
[[dataset]] when it publishes fresh files — trigger a hot reload to pick up
the additions without a restart:
curl -s -X POST \
-H "X-Admin-Token: $ADMIN_TOKEN" \
http://localhost:8080/api/v1/config/reload | jq
# {
# "registered": ["events", "orders"],
# "skipped": ["accidents"],
# "errors": []
# }
The server re-reads and validates the file, then registers every
[[dataset]] whose name is not already registered:
registered— datasets that were newly added and loaded.skipped— datasets that were already live (left untouched).errors— datasets that failed to load, each as{ "dataset", "error" }. One bad dataset does not abort the others, so the response can report a partial success with200.
What hot-reload does not do:
- It does not rebuild datasets that already exist — use
POST /datasets/{name}/reloadto refresh a dataset whose files changed under the same name. - It does not re-apply server-level settings (
port,workers,[sql],[auth], …). Those still require a restart. - It does not remove datasets that were deleted from the file.
Automating on new data¶
Because the trigger is a single authenticated POST, any external event loop
can drive it — a Kafka consumer, a cron job, or a pipeline post-publish hook:
import os, requests
def refresh_datapress():
r = requests.post(
"http://localhost:8080/api/v1/config/reload",
headers={"X-Admin-Token": os.environ["ADMIN_TOKEN"]},
)
r.raise_for_status()
added = r.json()["registered"]
if added:
print(f"registered new datasets: {added}")
Explorer UI¶
When the explorer is enabled, its Register
tab provides a form for the same workflow: fill in the name, source kind,
location, optional projection / index / S3 settings, and submit. On success
the tab shows the live row and column counts plus the exported
[[dataset]] TOML, and — when the server was loaded from a config file —
a button to persist the block to it.
Security¶
These endpoints mutate server state and read from arbitrary local paths or
object-store URLs you supply, so they are gated behind the admin/reload
permission. Keep ADMIN_TOKEN secret (or enforce OIDC reload scopes), and put
the server behind TLS before exposing the admin surface beyond localhost. See
Authentication for the OIDC path.