Spatial Indexing Internals
DuckDB’s spatial indexing model diverges from traditional RDBMS overlay indexes. The spatial extension supports a persistent R-tree index (created with CREATE INDEX ... USING RTREE) on GEOMETRY columns. Without an explicit R-tree, the planner falls back to a hash or nested-loop join, constructing ephemeral bounding-box filters from row statistics. This vectorized, columnar approach eliminates background index maintenance overhead but shifts the computational burden to the scan phase when no persistent index exists. For a comprehensive overview of the execution pipeline, consult the DuckDB Spatial Architecture & Fundamentals. Production workloads require explicit tuning of memory allocation, thread parallelism, and geometry serialization to prevent degenerate full-table scans.
Index Architecture & Predicate Pushdown
The spatial extension constructs an R-tree over minimum bounding rectangles (MBRs) extracted from GEOMETRY columns. During query planning, spatial predicates (ST_Intersects, ST_Contains, ST_DWithin) trigger a two-phase evaluation: a fast MBR overlap filter followed by exact topology validation. The optimizer pushes the MBR filter into the TABLE_SCAN operator, reducing I/O before expensive exact checks execute.
-- Create a persistent R-tree index
CREATE INDEX idx_parcels_geom ON parcels USING RTREE (geom);
-- Verify the planner uses the index
EXPLAIN SELECT * FROM parcels
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'));
Typical EXPLAIN Output:
graph TD S["TABLE_SCAN<br/>parcels"] --> F["FILTER · MBR<br/>ST_Intersects(geom, …)"] F --> P["PROJECTION<br/>*"]
Performance Trade-off: The MBR filter is highly selective for well-distributed geometries but degrades to a full scan when bounding boxes exhibit high overlap or when the dataset lacks spatial locality. A persistent RTREE index lets the planner perform an index scan. Without one, it falls back to a hash or nested-loop join based on cardinality estimates. Materializing inputs in Hilbert-sorted order keeps MBR selectivity high for range queries.
Storage Tiers & Materialization Behavior
Without a persistent R-tree, bounding-box structures are computed per query. For high-throughput pipelines, persisting geometry in GeoParquet format with spatially sorted rows preserves locality and reduces parsing latency. The storage engine dynamically switches between vectorized buffers and memory-mapped files based on dataset size and memory_limit configuration. Review In-Memory vs Disk Storage for tiering thresholds and spill-to-disk behavior.
-- Configure runtime memory and thread allocation for spatial workloads
SET memory_limit = '8GB';
SET threads = 12;
-- Materialize with explicit spatial clustering to optimize scan locality
COPY (
SELECT * FROM parcels
ORDER BY ST_XMin(geom), ST_YMin(geom)
) TO 'parcels_sorted.parquet' (FORMAT PARQUET);
-- Subsequent reads leverage preserved MBR locality without recomputing sort order
SELECT COUNT(*) FROM read_parquet('parcels_sorted.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'));
Diagnostic Boundary: Monitor execution metrics via EXPLAIN ANALYZE. If TABLE_SCAN consumes >80% of execution time without a FILTER (MBR) node, the optimizer bypassed spatial indexing due to insufficient memory for the R-tree build or missing sort order.
Geometry Serialization & Index Key Generation
Spatial indexes operate on MBR coordinates extracted from serialized binary representations. DuckDB normalizes GEOMETRY objects to WKB during index traversal. Understanding the overhead of geometry conversion is critical when tuning join performance. Refer to Understanding ST_Geometry vs WKB for binary layout specifics. Coordinate reference system alignment directly impacts MBR accuracy; mismatched projections cause incorrect bounding boxes and silent join failures. See CRS Mapping & Transformations for projection normalization strategies.
import duckdb
con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("CREATE TABLE test AS SELECT * FROM read_parquet('parcels_sorted.parquet');")
con.execute("CREATE INDEX idx_test_geom ON test USING RTREE (geom);")
# Benchmark indexed vs sequential scan timing
import time
start = time.perf_counter()
rows = con.execute("""
SELECT count(*)
FROM test
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'))
""").fetchone()[0]
elapsed = time.perf_counter() - start
print(f"Indexed scan: {rows} rows in {elapsed:.3f}s")
Performance Trade-off: WKB parsing adds ~15–30% CPU overhead compared to raw numeric column scans. Pre-parsing geometries into separate x_min, y_min, x_max, y_max columns eliminates runtime WKB extraction at the cost of storage duplication. For GeoJSON ingestion, the extension automatically converts to WKB via st_read(). Consult the GeoParquet specification for columnar encoding best practices and the OGC Simple Features standard for binary compliance requirements.
Diagnostic Boundaries & Production Tuning
| Symptom | Root Cause | Resolution |
|---|---|---|
rows_scanned ≈ rows_returned in EXPLAIN ANALYZE |
R-tree bypassed due to missing index or unsorted input | Build USING RTREE index; materialize sorted GeoParquet |
ST_Intersects latency spikes on large datasets |
WKB deserialization contention across threads | Pre-extract MBR columns; reduce threads to physical core count; batch queries |
| Incorrect spatial join results | CRS mismatch between source tables | Normalize projections using ST_Transform prior to materialization |
Out of Memory during spatial join |
R-tree build exceeds available memory | Set temp_directory so the engine can spill, or partition input data |
Execution Checklist:
- Verify
EXPLAINoutput containsFILTER (MBR)beforeTABLE_SCANwhen an R-tree index exists. - Confirm
memory_limitis sufficient for the working set; monitor spill viaSELECT * FROM duckdb_temporary_files();. - Align thread count with physical cores (
SET threads = <n>) to avoid context-switching penalties. - Enforce explicit
ST_Transformin ETL pipelines to prevent CRS drift during key generation.
DuckDB’s spatial indexing prioritizes execution-time vectorization and supports persistent R-tree structures for high-selectivity workloads. By aligning storage locality, memory allocation, and serialization formats, engineers can achieve sub-second spatial query performance at scale. Continuous monitoring of EXPLAIN ANALYZE metrics and strict adherence to WKB normalization boundaries prevent degenerate query plans in production.