Optimizing Point-in-Polygon Queries in DuckDB

Point-in-polygon (PIP) evaluation in DuckDB Spatial degrades predictably when geometry serialization overhead, missing spatial indexing, and unoptimized predicate ordering intersect. At scale, naive ST_Within or ST_Contains evaluations trigger full Cartesian scans, exhausting memory and saturating I/O bandwidth. This guide isolates execution bottlenecks and provides deterministic configuration paths for sub-second PIP resolution across analytical workloads.

Root-Cause Analysis of Execution Degradation

DuckDB’s vectorized execution engine processes geometries as GEOMETRY objects stored internally as WKB. Without a spatial index, every point is evaluated against every polygon vertex, forcing O(N×M)O(N \times M) complexity. The primary failure modes observed in production pipelines are:

  • Missing bounding box pre-filtering: Exact ST_Contains evaluation bypasses cheap bbox intersection checks, forcing unnecessary vertex traversal and SIMD pipeline stalls.
  • Predicate evaluation misalignment: Placing spatial functions only in a WHERE clause (rather than the ON clause of a join) can prevent the optimizer from applying spatial index routing.
  • Unbounded memory allocation during joins: DuckDB materializes geometry objects in memory before applying spatial predicates, causing heap spikes during large joins without explicit memory limits.

Deterministic Configuration & Memory Boundaries

Before rewriting query logic, enforce strict memory boundaries and disable non-essential runtime overhead. DuckDB Spatial does not auto-index geometry columns; create an R-tree index manually for effective predicate pushdown.

INSTALL spatial;
LOAD spatial;
SET memory_limit = '4GB';
SET threads = 8;
SET preserve_insertion_order = false;
SET enable_progress_bar = false;

For datasets exceeding 10M rows, partition by spatial grid or materialize bounding box columns to align I/O patterns with columnar cache boundaries.

Spatial Pre-Filtering & Index Materialization

Create an R-tree index on the boundary table to enable index-assisted bounding-box filtering. Envelope columns are not required as a separate step — the R-tree index on the native GEOMETRY column handles MBR filtering automatically.

-- Create an R-tree index for fast bounding-box lookups
CREATE INDEX idx_boundaries_geom ON boundaries USING RTREE (geom);

-- Verify the index exists
SELECT index_name, table_name, index_type
FROM duckdb_indexes()
WHERE table_name = 'boundaries';

Index materialization caps memory residency, prevents OOM during large spatial joins, and enables the query planner to push down bbox intersection checks before exact geometry evaluation. Reference implementations for proximity-based filtering patterns are documented in Spatial Joins & Proximity Filters.

Optimized Execution Pattern

Consider a baseline PIP query scanning 5M GPS points against 10K administrative boundaries:

-- Naive: triggers a nested loop join with no index utilization
SELECT p.id, p.lat, p.lon, b.name
FROM points p
JOIN boundaries b ON ST_Contains(b.geom, ST_Point(p.lon, p.lat));

This triggers a nested loop join with full geometry evaluation. Optimization requires explicit bbox filtering before exact evaluation, a pattern documented in Modern Spatial SQL Query Patterns where predicate ordering dictates memory residency:

-- Optimized: bbox filter in ON clause triggers index scan; exact check in WHERE
SELECT p.id, p.lon, p.lat, b.name
FROM points p
JOIN boundaries b
  ON b.geom && ST_Point(p.lon, p.lat)  -- Stage 1: bbox filter (R-tree index)
WHERE ST_Contains(b.geom, ST_Point(p.lon, p.lat));  -- Stage 2: exact check

The && operator in the ON clause leverages the R-tree index, reducing the candidate set by 85–99% before exact topology evaluation.

Diagnostic Queries & Plan Validation

Validate optimization impact using deterministic profiling:

EXPLAIN ANALYZE
SELECT p.id, b.name
FROM points p
JOIN boundaries b
  ON b.geom && ST_Point(p.lon, p.lat)
WHERE ST_Contains(b.geom, ST_Point(p.lon, p.lat));

Monitor key metrics in the EXPLAIN ANALYZE output: actual vs estimated row counts and operator timing. Vectorized aggregations on filtered result sets should execute within 200ms for datasets under 500k candidate rows. If execution time exceeds baseline, verify index utilization via EXPLAIN output and confirm && operator placement in the join condition.

Geometry Validation & Fallback Routing

Invalid geometries or topology errors cause silent query failures. Implement validation gates before PIP evaluation:

SELECT id, geom
FROM boundaries
WHERE NOT ST_IsValid(geom);

For malformed inputs, apply tolerance-based correction or fallback routing:

-- Repair invalid polygons before indexing
UPDATE boundaries SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);

-- Re-build the index after repairs
DROP INDEX IF EXISTS idx_boundaries_geom;
CREATE INDEX idx_boundaries_geom ON boundaries USING RTREE (geom);

-- Tolerance-based spatial join for GPS noise (500mm buffer in a metric CRS)
SELECT p.id, b.name
FROM points p
JOIN boundaries b
  ON b.geom && ST_Point(p.lon, p.lat)
WHERE ST_DWithin(b.geom, ST_Point(p.lon, p.lat), 0.0001); -- ~11m at equator in EPSG:4326

When memory limits are breached despite pre-filtering, implement chunked execution using row_number() partitioning or materialized intermediate tables. For persistent OOM conditions, aggregate points to a coarse grid cell first:

-- Grid-based pre-aggregation: count points per cell, then join cells to polygons
WITH grid_counts AS (
    SELECT
        floor(lon / 0.01) AS cell_x,
        floor(lat / 0.01) AS cell_y,
        count(*) AS point_count
    FROM points
    GROUP BY cell_x, cell_y
)
SELECT
    g.cell_x, g.cell_y, g.point_count, b.name
FROM grid_counts g
JOIN boundaries b
    ON b.geom && ST_Point(g.cell_x * 0.01 + 0.005, g.cell_y * 0.01 + 0.005)
WHERE ST_Contains(b.geom, ST_Point(g.cell_x * 0.01 + 0.005, g.cell_y * 0.01 + 0.005));

This reduces exact PIP evaluations by an order of magnitude while preserving analytical accuracy.