Understanding ST_Geometry vs WKB

The distinction between ST_Geometry (DuckDB’s native GEOMETRY type) and raw Well-Known Binary (WKB stored as BLOB) dictates vectorized execution paths, memory allocation strategies, and spatial index construction overhead. GEOMETRY is DuckDB’s strongly-typed spatial object; the engine recognizes it in predicate pushdown, index construction, and SIMD-accelerated bounding box evaluation. A raw BLOB column containing WKB bytes is opaque to the planner — it requires an explicit ST_GeomFromWKB() decode before any spatial operation, incurring heap allocations per row.

Memory Layout & Vectorized Execution

DuckDB’s columnar engine processes GEOMETRY columns natively through its spatial extension type. Coordinates are stored in contiguous, cache-aligned buffers, enabling SIMD-accelerated bounding box evaluation. A raw BLOB (WKB) column forces a decode pass per row. In high-cardinality joins or spatial predicates (ST_Intersects, ST_Contains), WKB triggers repeated heap allocations for temporary coordinate arrays.

-- WKB path: incurs decode overhead per predicate evaluation
SELECT id FROM parcels WHERE ST_Intersects(ST_GeomFromWKB(wkb_col), ST_MakePoint(-73.98, 40.75));

-- Native GEOMETRY path: the planner recognizes the type and applies bounding-box pruning
SELECT id FROM parcels WHERE ST_Intersects(geom_col, ST_MakePoint(-73.98, 40.75));

The query planner recognizes native GEOMETRY columns and pushes bounding box filters directly into the scan operator. BLOB columns bypass this optimization, materializing full rows before spatial evaluation. For datasets exceeding 10M rows, this manifests as a 3–5× increase in peak memory and sustained CPU utilization.

Spatial Index Construction & Query Planning

R-tree index builds require explicit coordinate extraction. Native GEOMETRY feeds directly into the R-tree builder without intermediate serialization. When operating on BLOB columns, you must materialize a GEOMETRY column first — the R-tree can only index GEOMETRY columns.

-- Promote a raw WKB BLOB column to native GEOMETRY, then build an R-tree index.
-- DuckDB has no ALTER TABLE ... ADD GENERATED column; add a plain column and populate it.
ALTER TABLE parcels ADD COLUMN geom_col GEOMETRY;
UPDATE parcels SET geom_col = ST_GeomFromWKB(wkb_col);
CREATE INDEX idx_parcels_geom ON parcels USING RTREE (geom_col);

See Spatial Indexing Internals for R-tree construction details and query plan validation patterns.

In-Memory vs Disk Storage & I/O Patterns

DuckDB’s buffer pool manages GEOMETRY vectors differently than raw BLOB columns. GEOMETRY columns benefit from columnar compression and direct memory mapping during spill-to-disk operations. BLOB columns remain opaque to the storage engine, preventing predicate pushdown and forcing full-page reads during out-of-core execution. When memory_limit is constrained, BLOB-heavy workloads trigger aggressive spilling, increasing disk I/O latency by 40–70%.

Mitigation requires explicit materialization before analytical joins:

CREATE TABLE parcels_opt AS
SELECT * EXCLUDE (wkb_col), ST_GeomFromWKB(wkb_col) AS geom_col FROM parcels;

CRS Mapping, Transformations, and Drift Troubleshooting

Neither DuckDB’s GEOMETRY nor raw WKB carries an inline SRID — DuckDB tracks no per-geometry CRS, so a layer’s projection must be tracked out-of-band (column or table metadata). Mixing layers in different CRSs produces silent coordinate drift or ST_Transform failures.

Diagnostic query for CRS drift detection (by coordinate range, since there is no ST_SRID):

-- Geographic (lon/lat) data sits within ±180/±90; projected metric data does not.
SELECT
    ST_XMin(geom_col) BETWEEN -180 AND 180
      AND ST_YMin(geom_col) BETWEEN -90 AND 90 AS looks_geographic,
    COUNT(*) AS row_count,
    MIN(ST_XMin(geom_col)) AS min_x,
    MAX(ST_XMax(geom_col)) AS max_x
FROM parcels
GROUP BY looks_geographic;

Fallback routing for mixed-projection ingestion:

-- Standardize to EPSG:4326 during ingestion (transform from the known source CRS)
INSERT INTO parcels_clean (id, geom_col)
SELECT id, ST_Transform(geom_col, 'EPSG:3857', 'EPSG:4326')
FROM parcels_raw
WHERE ST_IsValid(geom_col);

GeoParquet Parsing & GeoJSON Ingestion Pipelines

GeoParquet files write geometry as GEOMETRY extension types when written with compliant writers (e.g., pyarrow + geopandas). DuckDB parses these directly into vectorized buffers. GeoJSON ingestion via st_read() produces GEOMETRY columns directly. Manual ingestion via read_json_auto requires an explicit st_geomfromgeojson() call, which introduces JSON tokenization overhead before spatial struct allocation.

-- Direct GeoParquet scan — geometry column is already GEOMETRY type
SELECT id, geom FROM read_parquet('s3://bucket/data.parquet');

-- Optimized GeoJSON ingestion via st_read (preferred for well-formed GeoJSON)
SELECT id, geom FROM st_read('s3://bucket/data.geojson');

-- Manual GeoJSON extraction when features are in a generic JSON column
COPY (
    SELECT
        id,
        ST_GeomFromGeoJSON(geojson_col) AS geom_col
    FROM read_json_auto('s3://bucket/data.json', columns={'id': 'INTEGER', 'geojson_col': 'VARCHAR'})
) TO 's3://bucket/data_optimized.parquet' (FORMAT PARQUET);

Enterprise Deployment & Access Control

DuckDB has no GRANT/role system. Enforce access at the boundary: expose only curated views (dropping the raw wkb_col), attach the database read-only for analysts, and rely on filesystem permissions.

-- Expose a curated view with only the native GEOMETRY column
CREATE VIEW parcels_secured AS
SELECT id, geom_col FROM parcels;

-- Attach read-only for analysts
ATTACH 'production.duckdb' AS prod (READ_ONLY);

Diagnostic Queries & Fallback Routing

Identify WKB-induced bottlenecks before deployment:

-- Detect BLOB columns that should be promoted to native GEOMETRY
SELECT table_name, column_name, data_type
FROM duckdb_columns()
WHERE data_type = 'BLOB'
  AND column_name ILIKE '%wkb%';

-- Verify index utilization during spatial scan
EXPLAIN ANALYZE
SELECT id FROM parcels WHERE ST_Intersects(geom_col, ST_MakePoint(-73.98, 40.75));

Fallback routing when spatial indexes fail to materialize:

  1. Verify the predicate is selective enough for the planner to choose the R-tree (inspect EXPLAIN).
  2. Compare against a sequential scan by dropping the index temporarily.
  3. Refresh statistics with ANALYZE parcels;.
  4. If memory pressure persists, partition by a coordinate-derived grid cell (floor(ST_X(geom)/cell), floor(ST_Y(geom)/cell)) and process per partition.

Configuration Reference & Tuning Parameters

Parameter Default Recommended for Spatial Workloads Effect
preserve_insertion_order true false Unlocks parallel, out-of-order scans and aggregation
memory_limit auto 75% of host RAM Prevents aggressive WKB decode spilling
threads auto physical_cores Maximizes SIMD coordinate evaluation
enable_http_metadata_cache false true Caches remote file metadata for repeated S3/HTTP reads

Architecture-level tuning must align with DuckDB Spatial Architecture & Fundamentals to ensure planner optimizations propagate through the execution pipeline. Always validate spatial type consistency before production deployment.