GeoJSON Ingestion
GeoJSON ingestion in production environments is fundamentally an I/O and memory-bound operation. DuckDB Spatial resolves this through a vectorized, zero-copy JSON parser that materializes coordinate arrays directly into contiguous columnar buffers. Understanding the underlying DuckDB Spatial Architecture & Fundamentals is required before scaling ingestion pipelines, as the engine bypasses traditional row-by-row deserialization in favor of SIMD-optimized coordinate streaming. This page documents execution-ready patterns, execution plan tuning, and diagnostic boundaries for data engineering and GIS workflows.
Production Ingestion Pipeline & Configuration
The recommended ingestion path for well-formed GeoJSON is st_read(), which uses the GDAL/OGR vector reader and returns a table with a geom column typed as GEOMETRY. GeoJSON compliance follows RFC 7946: The GeoJSON Format, which mandates WGS84 coordinates and a strict FeatureCollection structure.
-- Configure execution boundaries before ingestion
SET threads = 8;
SET memory_limit = '16GB';
SET preserve_insertion_order = false;
SET max_temp_directory_size = '100GB';
-- Ingest with st_read (recommended for well-formed GeoJSON files)
CREATE OR REPLACE TABLE raw_geojson AS
SELECT
properties->>'id' AS id,
properties->>'name' AS name,
properties->>'category' AS category,
geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');
-- Enforce geometry validity early to prevent downstream query failures
UPDATE raw_geojson
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom);
For non-standard JSON where features are stored as individual JSON objects (one per line or nested), use read_json_auto with manual geometry extraction:
CREATE OR REPLACE TABLE raw_geojson_manual AS
SELECT
(properties->>'id')::BIGINT AS id,
(properties->>'name')::VARCHAR AS name,
st_geomfromgeojson(json_extract(data, '$.geometry')::VARCHAR) AS geom
FROM read_json_auto('s3://data-lake/ingest/parcels_2024.json', maximum_object_size=10485760);
Python Integration Pattern When orchestrating via Python, disable automatic progress reporting and bind parameters to prevent injection and optimize query caching:
import duckdb
con = duckdb.connect(config={'threads': 8, 'memory_limit': '16GB'})
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("SET preserve_insertion_order = false;")
# Parameterized ingestion for repeatable pipelines
query = """
CREATE OR REPLACE TABLE raw_geojson AS
SELECT properties->>'id' AS id, geom
FROM st_read($1)
"""
con.execute(query, ['s3://data-lake/ingest/parcels_2024.geojson'])
Execution Plan Analysis & Diagnostics
Run EXPLAIN ANALYZE against the ingestion query to validate operator behavior:
EXPLAIN ANALYZE
CREATE OR REPLACE TABLE raw_geojson AS
SELECT properties->>'name' AS name, geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');
Key Operator Thresholds:
ST_Read/READ_CSV_AUTO: The scan operator. Forst_read, confirm row count matches the expected feature count.PhysicalProjection: Confirms column pruning. Selecting only required columns avoids full JSON tree traversal, reducing peak RSS by 30–50% on property-rich features.- Spatial predicates applied in a
WHEREclause after ingestion are evaluated post-materialization on the resulting table.
Diagnostic Boundary: If the query shows spill behavior in EXPLAIN ANALYZE, adjust SET memory_limit upward or partition the source file. DuckDB’s spill behavior is deterministic; monitor SELECT * FROM duckdb_temporary_files(); to avoid I/O thrashing on network-attached storage.
Storage Topology & Memory Boundaries
GeoJSON ingestion defaults to in-memory materialization, which is optimal for datasets under 4GB. For larger payloads, explicit disk-backed storage prevents OOM termination and enables concurrent analytical workloads. Refer to In-Memory vs Disk Storage for engine-level allocation strategies.
Production Disk-Backed Pattern:
-- Attach a persistent database file for final storage
ATTACH '/mnt/warehouse/warehouse.duckdb' AS warehouse;
CREATE OR REPLACE TABLE warehouse.raw_geojson AS
SELECT geom, properties->>'name' AS name
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');
CRS Handling & Transformation GeoJSON strictly uses EPSG:4326 (WGS84, lon/lat). If downstream analytics require projected coordinates (e.g., EPSG:3857 or a local UTM zone), apply transformations immediately after ingestion to avoid repeated projection overhead during query execution. Consult CRS Mapping & Transformations for coordinate system resolution logic and precision loss mitigation.
-- Project to metric coordinate system post-ingestion
CREATE OR REPLACE TABLE parcels_metric AS
SELECT
id, name, category,
st_transform(geom, 'EPSG:4326', 'EPSG:32633') AS geom
FROM raw_geojson;
Performance Trade-offs & Tuning Guidelines
| Configuration | Performance Impact | Diagnostic Trigger |
|---|---|---|
preserve_insertion_order = true |
Disables parallel scan. Ingestion latency increases 2–4×. | High single-thread CPU utilization during scan. |
threads > physical_cores |
Thread thrashing and L3 cache contention. Peak RSS spikes. | CPU utilization < 70% with high context switches. |
ST_MakeValid() on 100% of rows |
CPU-bound validation overhead (~15–25% of ingestion time). | Apply only when ST_IsValid(geom) returns > 2% invalid. |
temp_directory unset (no spill target) |
OOM termination on payloads > memory_limit. |
duckdb.OutOfMemoryException during scan. |
Execution Boundaries:
- Memory Pressure: If
EXPLAIN ANALYZEshows spill nodes, partition the source GeoJSON by feature count or geographic bounding box before ingestion. - Invalid Geometry Rate: When
ST_IsValidreturns > 5% invalid features, isolate them into a quarantine table rather than applyingST_MakeValidglobally. This preserves query determinism and prevents topology corruption. - Cloud Storage: For S3 access, configure credentials via
CREATE SECRETor environment variables. UseSET s3_region = 'us-east-1'if auto-detection fails.
Validation & Error Handling
Production pipelines must enforce strict geometry contracts. Use ST_IsEmpty and ST_IsValid to filter malformed features before indexing or spatial joins.
-- Quarantine invalid geometries for manual review
CREATE OR REPLACE TABLE geojson_quarantine AS
SELECT id, name, geom,
CASE WHEN ST_IsEmpty(geom) THEN 'empty'
WHEN NOT ST_IsValid(geom) THEN 'invalid'
END AS failure_reason
FROM raw_geojson
WHERE NOT ST_IsValid(geom) OR ST_IsEmpty(geom);
-- Drop quarantined rows from primary table
DELETE FROM raw_geojson WHERE id IN (SELECT id FROM geojson_quarantine);
Monitor failure_reason distributions to identify upstream serialization errors. Common patterns include self-intersecting polygons, duplicate vertices, or non-planar rings. Resolve at the source ETL layer when possible to avoid repeated compute costs during ingestion.