In-Memory vs Disk Storage: Tactical Patterns for DuckDB Spatial Workflows
DuckDB’s vectorized execution engine defaults to aggressive in-memory processing, but geospatial workloads routinely violate that assumption. Complex polygon intersections, unbounded point clouds, and high-resolution rasters will exhaust RAM if query design, memory limits, and spill-to-disk thresholds are not explicitly managed. The decision between in-memory and disk-backed execution is not architectural dogma; it is a configuration and query-optimization problem. Understanding how DuckDB Spatial Architecture & Fundamentals manages buffer pools, operator fusion, and temporary storage is mandatory for production stability.
Memory Allocation & Thread Configuration
DuckDB Spatial inherits the core engine’s memory management model: it allocates contiguous blocks for columnar vectors, spills intermediate results to a temporary directory when thresholds are breached, and parallelizes spatial operators across available cores. In production, bound these behaviors explicitly:
-- Production memory & thread configuration
SET memory_limit = '8GB';
SET threads = 8;
SET temp_directory = '/mnt/duckdb-temp/spatial';
SET enable_progress_bar = true;
The memory_limit acts as a hard ceiling. When spatial joins or aggregations approach this limit, DuckDB automatically switches to external merge-sort and hash-join spilling. However, spatial operators (e.g., ST_Intersects, ST_DWithin) materialize bounding boxes and geometry arrays before evaluation. If your working set exceeds ~60% of memory_limit, expect I/O-bound execution. Monitor spill behavior using EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b ON ST_Intersects(a.geom, b.geom)
WHERE b.zone_type = 'commercial';
Representative output for a disk-spilled execution:
| Operator | Timing (ms) | Rows Produced | Memory (MB) | Spill (MB) |
|---|---|---|---|---|
| Projection | 12.4 | 15,200 | 0.0 | 0.0 |
| External Hash Join | 145.2 | 15,200 | 412.5 | 890.0 |
| Table Scan (parcels) | 18.1 | 45,000 | 0.0 | 0.0 |
| Table Scan (zoning) | 14.3 | 12,500 | 0.0 | 0.0 |
Inspect Physical Operators for Hash Join vs External Hash Join. The presence of External confirms disk spilling. If spilling occurs on every execution, increase temp_directory throughput (NVMe recommended) or pre-filter geometries using bounding box predicates before invoking exact topology functions.
Format-Specific Ingestion & Memory Footprints
GeoJSON ingestion is notoriously memory-inefficient. The format’s nested JSON structure forces row-by-row parsing, inflating memory usage by 3–5× compared to columnar alternatives. For production pipelines, convert GeoJSON to GeoParquet upstream or use DuckDB’s st_read with explicit column selection:
-- Direct GeoJSON ingestion via st_read (preferred for well-formed files)
CREATE OR REPLACE TABLE parcels_stream AS
SELECT geom, properties->>'name' AS name
FROM st_read('s3://bucket/parcels.geojson');
GeoParquet parsing leverages columnar compression, dictionary encoding, and predicate pushdown. The spatial extension reads geometry columns as binary WKB, deferring deserialization until spatial operators execute. This deferred evaluation reduces peak RAM by up to 70% during filtering phases. Detailed implementation specifics are covered in GeoParquet Parsing, which outlines how the extension handles coordinate precision and metadata retention during columnar reads.
Spatial Indexing Internals & Query Execution
DuckDB supports a persistent R-tree index (created with CREATE INDEX ... USING RTREE) on GEOMETRY columns. Without an explicit R-tree, the engine falls back to a hash or nested-loop join. For high-throughput pipelines, persisting geometry in GeoParquet format with Hilbert-sorted rows preserves spatial locality and reduces parsing latency for bounding-box selective queries.
To optimize disk-spilled joins, enforce a two-phase filter:
- Phase 1 (Disk-Friendly): Bounding-box pre-filter with the
&&operator (orST_DWithinwith coarse thresholds). - Phase 2 (In-Memory/Spilled): Exact topology checks (
ST_Intersects,ST_Contains).
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")
conn.execute("SET memory_limit = '4GB';")
conn.execute("SET temp_directory = '/mnt/duckdb_spill';")
# Sort by Hilbert index to cluster spatially proximate records,
# minimizing cross-partition I/O during disk spills.
conn.execute("""
CREATE TABLE indexed_parcels AS
SELECT * FROM parcels
ORDER BY ST_Hilbert(geom, ST_Extent(geom) OVER ());
""")
CRS Mapping, Transformations & Drift Troubleshooting
On-the-fly coordinate reference system transformations impose significant CPU and memory overhead. Each ST_Transform call materializes intermediate coordinate arrays before topology evaluation. When joining datasets with mismatched CRS definitions, normalize both sides to a common CRS before the join — DuckDB has no per-table SRID and cannot detect mismatches automatically.
-- Normalize both layers to a common CRS explicitly
CREATE OR REPLACE TABLE parcels_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom
FROM parcels;
CREATE OR REPLACE TABLE zoning_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom
FROM zoning;
If CRS drift occurs (e.g., legacy EPSG codes vs WKT strings), spatial joins degrade to full Cartesian products. Validate that both layers share a CRS by checking coordinate ranges before joining. Refer to CRS Mapping & Transformations for authoritative guidance on projection caching and transformation pipeline optimization.
Raster Workloads & Large Object Handling
DuckDB has no native raster type or TIFF reader; rasters must be converted to tabular/vector form (pixel points or tiles) before ingestion. A single 10,000×10,000 3-band export can exceed ~900MB uncompressed, so chunk the conversion upstream and load the result like any large vector table:
-- Convert rasters to tabular form first (e.g. GDAL: gdal2xyz / gdal_translate
-- → Parquet), then load the pixel points and bound memory as usual.
CREATE OR REPLACE TABLE raster_points AS
SELECT band, ST_Point(x, y) AS geom, value
FROM read_parquet('s3://bucket/ortho_pixels/*.parquet');
For enterprise deployments, isolate raster processing in dedicated worker pools and cap raster memory allocation independently of vector workloads. Comprehensive thresholds and spill configurations are documented in Memory Limits for Large Raster Data. External raster processing standards, such as those defined by the Open Geospatial Consortium (OGC) GeoTIFF specification, should guide your tiling and compression strategies.
Enterprise Deployment & Security Patterns
Production DuckDB Spatial instances require strict isolation boundaries. The temp_directory must reside on a dedicated, encrypted volume with restricted POSIX permissions. Multi-tenant environments should leverage DuckDB’s read-only database attachments and explicit ATTACH statements to prevent cross-tenant data leakage.
-- Secure multi-tenant attachment
ATTACH 'tenant_a.db' AS tenant_a (READ_ONLY);
ATTACH 'tenant_b.db' AS tenant_b (READ_ONLY);
-- DuckDB has no GRANT/role system; READ_ONLY attachments plus per-file OS
-- permissions provide tenant isolation.
CLI initialization for enterprise workloads requires deterministic environment variables and hardened connection strings. The operational baseline is established in Setting Up DuckDB Spatial CLI, which details secure credential injection and audit logging. For Python integrations, enforce connection pooling and explicit transaction boundaries using the official DuckDB Python API documentation.
Diagnostic Boundaries & Decision Matrix
The choice between in-memory and disk execution should follow explicit thresholds, not heuristics. Use the following diagnostic matrix to route workloads:
graph TD
A["Spatial workload"] --> B{"Dataset < 15 GB<br/>and memory use < 60%?"}
B -->|"yes"| C["In-memory<br/>vectorized execution"]
B -->|"no"| D["Set temp_directory (NVMe)"]
D --> E["Disk-backed spilling<br/>external hash join / sort"]
| Workload Characteristic | In-Memory Execution | Disk-Spill Execution |
|---|---|---|
| Dataset Size | < 15GB uncompressed | > 15GB or unbounded |
| Join Type | Point-to-Polygon, <1M rows | Polygon-to-Polygon, complex topology |
| CRS State | Uniform, pre-validated | Mixed, requires ST_Transform |
| Temp Storage IOPS | N/A | NVMe, >50k IOPS |
| Memory Utilization | < 60% of memory_limit |
> 60%, triggers external operators |
When EXPLAIN ANALYZE reports External Hash Join or External Order By, verify that temp_directory throughput matches the query’s spill rate. If I/O wait exceeds 40%, reduce threads to prevent lock contention on the spill buffer, or refactor the query to apply spatial filters earlier in the execution pipeline.
Conclusion
DuckDB Spatial’s execution model is highly adaptive, but production stability requires explicit configuration boundaries. By aligning memory limits, ingestion formats, and spatial indexing strategies with your hardware profile, you can deterministically route workloads between in-memory vectorized execution and disk-backed spilling. Monitor spill operators, enforce CRS consistency, and isolate raster workloads to maintain predictable latency and throughput at scale.