Indexing¶
The DataFusion backend builds an in-memory value → [row ids] map at
startup so that eq and in predicates resolve in O(1).
DataFusion only
DuckDB ignores the [dataset.index] block entirely — its own
optimiser handles equality filters via zone maps and parallel
hash/vector ops.
Reference¶
| Field | Default | Meaning |
|---|---|---|
mode |
auto |
auto, none, or list. |
columns |
[] |
Explicit column list. Required for mode = "list". |
max_cardinality |
100000 |
Auto mode: stop indexing a column once distinct values exceed this. |
Index-eligible Arrow types: Utf8 (including dictionary-encoded),
Boolean, signed integers (Int8/Int16/Int32/Int64). Floats,
temporals and binary columns always go through SQL.
mode = "auto" (default)¶
Indexes every eligible column whose distinct-value count stays below
max_cardinality. Each column is built in parallel and abandoned if
the cap is exceeded.
Wide schemas (≳ 50 columns)
Auto can blow up memory. The index keys are heap-allocated
Strings; hundreds of maps building concurrently easily reach
tens of GB. For wide tables, switch to mode = "list" and name
the columns you actually filter on.
mode = "none"¶
All predicates go through DataFusion SQL (still vectorised and multi-threaded). Use this when:
- the dataset is wide and you don't have a fixed query pattern,
- startup time matters more than first-query latency,
- you mostly filter on ranges /
LIKE(the index doesn't help those).
mode = "list"¶
Best for wide tables with a known query pattern. Only the listed
columns get an index; max_cardinality is ignored.
[[dataset]]
name = "us_accidents"
[dataset.source]
kind = "parquet"
location = "data/us_accidents/*.parquet"
[dataset.index]
mode = "list"
columns = ["state", "severity", "weather_condition", "city"]
An empty columns list with mode = "list" is caught at startup:
Lazy datasets¶
lazy = true skips index building entirely — predicate pushdown is
delegated to DataFusion's parquet reader instead.