Spatial Indexing Internals

DuckDB’s spatial indexing model diverges from traditional RDBMS overlay indexes. The spatial extension supports a persistent R-tree index (created with CREATE INDEX ... USING RTREE) on GEOMETRY columns. Without an explicit R-tree, the planner falls back to a hash or nested-loop join, constructing ephemeral bounding-box filters from row statistics. This vectorized, columnar approach eliminates background index maintenance overhead but shifts the computational burden to the scan phase when no persistent index exists. For a comprehensive overview of the execution pipeline, consult the DuckDB Spatial Architecture & Fundamentals. Production workloads require explicit tuning of memory allocation, thread parallelism, and geometry serialization to prevent degenerate full-table scans.

Index Architecture & Predicate Pushdown

The spatial extension constructs an R-tree over minimum bounding rectangles (MBRs) extracted from GEOMETRY columns. During query planning, spatial predicates (ST_Intersects, ST_Contains, ST_DWithin) trigger a two-phase evaluation: a fast MBR overlap filter followed by exact topology validation. The optimizer pushes the MBR filter into the TABLE_SCAN operator, reducing I/O before expensive exact checks execute.

-- Create a persistent R-tree index
CREATE INDEX idx_parcels_geom ON parcels USING RTREE (geom);

-- Verify the planner uses the index
EXPLAIN SELECT * FROM parcels
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'));

Typical EXPLAIN Output:

graph TD
  S["TABLE_SCAN<br/>parcels"] --> F["FILTER · MBR<br/>ST_Intersects(geom, …)"]
  F --> P["PROJECTION<br/>*"]

Performance Trade-off: The MBR filter is highly selective for well-distributed geometries but degrades to a full scan when bounding boxes exhibit high overlap or when the dataset lacks spatial locality. A persistent RTREE index lets the planner perform an index scan. Without one, it falls back to a hash or nested-loop join based on cardinality estimates. Materializing inputs in Hilbert-sorted order keeps MBR selectivity high for range queries.

Storage Tiers & Materialization Behavior

Without a persistent R-tree, bounding-box structures are computed per query. For high-throughput pipelines, persisting geometry in GeoParquet format with spatially sorted rows preserves locality and reduces parsing latency. The storage engine dynamically switches between vectorized buffers and memory-mapped files based on dataset size and memory_limit configuration. Review In-Memory vs Disk Storage for tiering thresholds and spill-to-disk behavior.

-- Configure runtime memory and thread allocation for spatial workloads
SET memory_limit = '8GB';
SET threads = 12;

-- Materialize with explicit spatial clustering to optimize scan locality
COPY (
    SELECT * FROM parcels
    ORDER BY ST_XMin(geom), ST_YMin(geom)
) TO 'parcels_sorted.parquet' (FORMAT PARQUET);

-- Subsequent reads leverage preserved MBR locality without recomputing sort order
SELECT COUNT(*) FROM read_parquet('parcels_sorted.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'));

Diagnostic Boundary: Monitor execution metrics via EXPLAIN ANALYZE. If TABLE_SCAN consumes >80% of execution time without a FILTER (MBR) node, the optimizer bypassed spatial indexing due to insufficient memory for the R-tree build or missing sort order.

Geometry Serialization & Index Key Generation

Spatial indexes operate on MBR coordinates extracted from serialized binary representations. DuckDB normalizes GEOMETRY objects to WKB during index traversal. Understanding the overhead of geometry conversion is critical when tuning join performance. Refer to Understanding ST_Geometry vs WKB for binary layout specifics. Coordinate reference system alignment directly impacts MBR accuracy; mismatched projections cause incorrect bounding boxes and silent join failures. See CRS Mapping & Transformations for projection normalization strategies.

import duckdb

con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("CREATE TABLE test AS SELECT * FROM read_parquet('parcels_sorted.parquet');")
con.execute("CREATE INDEX idx_test_geom ON test USING RTREE (geom);")

# Benchmark indexed vs sequential scan timing
import time

start = time.perf_counter()
rows = con.execute("""
    SELECT count(*)
    FROM test
    WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'))
""").fetchone()[0]
elapsed = time.perf_counter() - start
print(f"Indexed scan: {rows} rows in {elapsed:.3f}s")

Performance Trade-off: WKB parsing adds ~15–30% CPU overhead compared to raw numeric column scans. Pre-parsing geometries into separate x_min, y_min, x_max, y_max columns eliminates runtime WKB extraction at the cost of storage duplication. For GeoJSON ingestion, the extension automatically converts to WKB via st_read(). Consult the GeoParquet specification for columnar encoding best practices and the OGC Simple Features standard for binary compliance requirements.

Diagnostic Boundaries & Production Tuning

Symptom	Root Cause	Resolution
`rows_scanned` ≈ `rows_returned` in `EXPLAIN ANALYZE`	R-tree bypassed due to missing index or unsorted input	Build `USING RTREE` index; materialize sorted GeoParquet
`ST_Intersects` latency spikes on large datasets	WKB deserialization contention across threads	Pre-extract MBR columns; reduce `threads` to physical core count; batch queries
Incorrect spatial join results	CRS mismatch between source tables	Normalize projections using `ST_Transform` prior to materialization
`Out of Memory` during spatial join	R-tree build exceeds available memory	Set `temp_directory` so the engine can spill, or partition input data

Execution Checklist:

Verify EXPLAIN output contains FILTER (MBR) before TABLE_SCAN when an R-tree index exists.
Confirm memory_limit is sufficient for the working set; monitor spill via SELECT * FROM duckdb_temporary_files();.
Align thread count with physical cores (SET threads = <n>) to avoid context-switching penalties.
Enforce explicit ST_Transform in ETL pipelines to prevent CRS drift during key generation.

DuckDB’s spatial indexing prioritizes execution-time vectorization and supports persistent R-tree structures for high-selectivity workloads. By aligning storage locality, memory allocation, and serialization formats, engineers can achieve sub-second spatial query performance at scale. Continuous monitoring of EXPLAIN ANALYZE metrics and strict adherence to WKB normalization boundaries prevent degenerate query plans in production.

Spatial Indexing Internals

Index Architecture & Predicate Pushdown #

Storage Tiers & Materialization Behavior #

Geometry Serialization & Index Key Generation #

Diagnostic Boundaries & Production Tuning #

Index Architecture & Predicate Pushdown

Storage Tiers & Materialization Behavior

Geometry Serialization & Index Key Generation

Diagnostic Boundaries & Production Tuning