Raw SQL¶
The structured /query endpoint covers projection,
predicates, sorting, grouping, and pagination with a JSON body. When you
need the full expressiveness of SQL — window functions, CASE,
sub-selects, arithmetic, string functions — use the raw-SQL endpoint:
It is disabled by default. Raw SQL is a much larger attack surface than the structured query API, so you opt in explicitly and the server parses and validates every statement before any engine sees it.
Phase 1: one dataset per query
Today a statement may reference exactly one registered dataset — no cross-dataset joins yet. The validation gate is built so that raising this limit is the only change needed to allow joins later; see Roadmap.
Enabling the endpoint¶
Add a [sql] block to your config (see
Configuration):
[sql]
enabled = true # default false — endpoint returns 404 when off
max_rows = 100000 # hard cap on rows returned by one query
From the Python API, set the equivalent fields on
DataPressConfig:
from datap_rs.datapress import DataPressConfig
cfg = DataPressConfig(
backend="datafusion",
port=8000,
sql_enabled=True, # exposes POST /api/v1/sql
sql_max_rows=100_000,
)
| Field | Default | Notes |
|---|---|---|
enabled |
false |
When false, the route responds 404 so its presence isn't even revealed. |
max_rows |
100000 |
Server-side hard cap. The result is wrapped in an outer LIMIT so this always applies. |
While disabled, POST /api/v1/sql returns 404 Not Found — identical to
an unmounted route — so probing for it leaks nothing.
Request body¶
{
"sql": "SELECT state, COUNT(*) AS n FROM accidents GROUP BY state ORDER BY n DESC",
"max_rows": 500
}
| Field | Type | Required | Notes |
|---|---|---|---|
sql |
string | yes | A single read-only SELECT / WITH … SELECT, or a DESCRIBE/DESC <table>, referencing one dataset. |
max_rows |
integer | no | Client row cap. Clamped into [1, [sql].max_rows]; it can never raise the server cap. Omit to use the server cap. |
The dataset is named directly in the SQL FROM clause using its
configured name (the same slug used in /api/v1/datasets/{name}/...).
Matching is case-insensitive.
Response¶
data is the result set as a JSON array of row objects; max_rows echoes
the effective row cap that was applied. Column types follow the engine's
inferred output schema.
Arrow IPC¶
Like the structured /query endpoint, the response is
content-negotiated. Ask for Arrow and you get an Arrow IPC
stream instead of the JSON envelope — proper
typed columns, no JSON stringification, and the body is streamed as it is
encoded. The same [sql].max_rows cap still applies.
Opt in with either the Accept header or a ?format=arrow query param:
The response carries Content-Type: application/vnd.apache.arrow.stream
and an X-Max-Rows header echoing the applied cap.
import io
import requests
import pyarrow.ipc as ipc
resp = requests.post(
"http://localhost:8080/api/v1/sql",
headers={"Accept": "application/vnd.apache.arrow.stream"},
json={"sql": "SELECT state, COUNT(*) AS n FROM accidents GROUP BY state"},
)
resp.raise_for_status()
table = ipc.open_stream(io.BytesIO(resp.content)).read_all()
df = table.to_pandas()
There is no separate paging for raw SQL: a statement returns a single
result bounded by max_rows, so the Arrow stream already delivers the
whole (capped) result in one response.
Examples¶
A CTE name is local to the query and is not treated as a dataset,
so this still references only accidents:
Window functions run over a single dataset. This ranks states by accident count without a self-join:
Use CASE to derive categories on the fly, then aggregate by them:
Several CTEs can be chained and joined together. None of the CTE names
(by_state, totals) count as datasets, so the query still references
only accidents:
A CTE feeds a window function to keep, per state, only the worst severity rows above the state's own average:
Table-less queries reference zero datasets and are always allowed.
What is rejected¶
The shared validation gate runs identically for the DuckDB and DataFusion
backends. A request is rejected with 400 Bad Request when the statement:
- is not a single read-only statement — multiple statements, or
anything other than
SELECT/WITH … SELECT/DESCRIBE/DESC(noINSERT,UPDATE,DELETE,CREATE,DROP,ALTER,COPY,ATTACH,INSTALL,PRAGMA,EXPLAIN, …); - references an unknown table — every relation must be a registered dataset (or a CTE defined in the same query);
- references more than one dataset (Phase 1 limit);
- uses a file-reading or external-access function in any position —
read_parquet,read_csv,read_json,read_text,read_blob,glob,parquet_scan, and similar are denied even in scalar position (e.g.SELECT read_text('/etc/passwd')).
// 400 — DML is not allowed
{ "error": "only read-only SELECT and DESCRIBE statements are allowed" }
// 400 — more than one statement
{ "error": "exactly one SQL statement is allowed" }
// 400 — unknown / file-function table
{ "error": "could not parse SQL: ..." }
// 400 — too many datasets (Phase 1)
{ "error": "this endpoint allows at most 1 dataset(s) per query; the statement references 2" }
See Reference › Errors for the full status-code table.
Security model¶
- Off by default. No
[sql]block, no endpoint. - Parse-then-allowlist. Statements are parsed with
sqlparserand every referenced relation is checked against the set of registered datasets before execution — the engine never sees an unvalidated string. - No file access. File-reading table and scalar functions are denied, so a query can't escape the configured datasets to read arbitrary paths or URLs.
- Bounded results. Every query is wrapped in an outer
LIMIT([sql].max_rows), so a runawaySELECTcan't stream unbounded rows. - Read scopes apply. When authentication is
enabled, the endpoint enforces the same
readscopes as the structured query API.
The legacy un-versioned alias POST /api/sql is also mounted and behaves
identically.
Roadmap¶
The validation gate already tracks which datasets a statement touches
and enforces a configurable maximum (Phase 1 passes 1). Cross-dataset
joins become available by raising that bound — the allowlist, file-function
denial, single-statement, and read-only guarantees all stay in force. No
isolated per-dataset connections are used, so a multi-dataset JOIN is an
additive change rather than a rewrite.