Skip to content

Indexing

The DataFusion backend builds an in-memory value → [row ids] map at startup so that eq and in predicates resolve in O(1).

DataFusion only

DuckDB ignores the [dataset.index] block entirely — its own optimiser handles equality filters via zone maps and parallel hash/vector ops.

Reference

Field Default Meaning
mode auto auto, none, or list.
columns [] Explicit column list. Required for mode = "list".
max_cardinality 100000 Auto mode: stop indexing a column once distinct values exceed this.

Index-eligible Arrow types: Utf8 (including dictionary-encoded), Boolean, signed integers (Int8/Int16/Int32/Int64). Floats, temporals and binary columns always go through SQL.

mode = "auto" (default)

Indexes every eligible column whose distinct-value count stays below max_cardinality. Each column is built in parallel and abandoned if the cap is exceeded.

[dataset.index]
mode            = "auto"
max_cardinality = 50_000     # tighten the cap if RAM is tight

Wide schemas (≳ 50 columns)

Auto can blow up memory. The index keys are heap-allocated Strings; hundreds of maps building concurrently easily reach tens of GB. For wide tables, switch to mode = "list" and name the columns you actually filter on.

mode = "none"

All predicates go through DataFusion SQL (still vectorised and multi-threaded). Use this when:

  • the dataset is wide and you don't have a fixed query pattern,
  • startup time matters more than first-query latency,
  • you mostly filter on ranges / LIKE (the index doesn't help those).
[dataset.index]
mode = "none"

mode = "list"

Best for wide tables with a known query pattern. Only the listed columns get an index; max_cardinality is ignored.

[[dataset]]
name = "us_accidents"

[dataset.source]
kind     = "parquet"
location = "data/us_accidents/*.parquet"

[dataset.index]
mode    = "list"
columns = ["state", "severity", "weather_condition", "city"]

An empty columns list with mode = "list" is caught at startup:

dataset 'foo': index.mode = "list" requires a non-empty index.columns

Lazy datasets

lazy = true skips index building entirely — predicate pushdown is delegated to DataFusion's parquet reader instead.