First server¶

1. A parquet file¶

Drop a parquet file somewhere reachable by the process. For the examples in these docs we use the Kaggle US accidents 2016–2023 dataset:

ls data/us_accidents/march_2023.parquet

A directory of *.parquet, a glob, or an s3://... URL all work too — see Configuration › Datasets.

2. `datasets.toml`¶

A minimal config:

[server]
listen = "127.0.0.1"
port   = 8080

[[dataset]]
name = "accidents"

[dataset.source]
kind     = "parquet"
location = "data/us_accidents/march_2023.parquet"

Save it as datasets.toml in the working directory. Override the path with the DATASETS_CONFIG env var if you keep it elsewhere.

3. Run a backend¶

DuckDBArrow + DataFusionPython

task run:duckdb
# or, without taskfile:
RUST_LOG=info ./target/release/datapress-duckdb

task run:datafusion
# or:
RUST_LOG=info ./target/release/datapress-datafusion

import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig

async def main() -> None:
    ds = DatasetConfig(
        name="accidents",
        source="data/us_accidents/march_2023.parquet",
        format="parquet",
    )
    cfg = DataPressConfig(backend="duckdb", port=8000)
    await DataPress(cfg, datasets=[ds]).run()

asyncio.run(main())

Startup logs print the bind address, worker count, route table, and a summary line including the active backend and shutdown grace period.

4. Talk to it¶

curl http://localhost:8080/api/v1/datasets
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version

See the quick tour for a tour of every route.

First server¶

1. A parquet file¶

2. datasets.toml¶

3. Run a backend¶

4. Talk to it¶

2. `datasets.toml`¶