GeoJSON Ingestion

GeoJSON ingestion in production environments is fundamentally an I/O and memory-bound operation. DuckDB Spatial resolves this through a vectorized, zero-copy JSON parser that materializes coordinate arrays directly into contiguous columnar buffers. Understanding the underlying DuckDB Spatial Architecture & Fundamentals is required before scaling ingestion pipelines, as the engine bypasses traditional row-by-row deserialization in favor of SIMD-optimized coordinate streaming. This page documents execution-ready patterns, execution plan tuning, and diagnostic boundaries for data engineering and GIS workflows.

Production Ingestion Pipeline & Configuration

The recommended ingestion path for well-formed GeoJSON is st_read(), which uses the GDAL/OGR vector reader and returns a table with a geom column typed as GEOMETRY. GeoJSON compliance follows RFC 7946: The GeoJSON Format, which mandates WGS84 coordinates and a strict FeatureCollection structure.

-- Configure execution boundaries before ingestion
SET threads = 8;
SET memory_limit = '16GB';
SET preserve_insertion_order = false;
SET max_temp_directory_size = '100GB';

-- Ingest with st_read (recommended for well-formed GeoJSON files)
CREATE OR REPLACE TABLE raw_geojson AS
SELECT
    properties->>'id' AS id,
    properties->>'name' AS name,
    properties->>'category' AS category,
    geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');

-- Enforce geometry validity early to prevent downstream query failures
UPDATE raw_geojson
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom);

For non-standard JSON where features are stored as individual JSON objects (one per line or nested), use read_json_auto with manual geometry extraction:

CREATE OR REPLACE TABLE raw_geojson_manual AS
SELECT
    (properties->>'id')::BIGINT AS id,
    (properties->>'name')::VARCHAR AS name,
    st_geomfromgeojson(json_extract(data, '$.geometry')::VARCHAR) AS geom
FROM read_json_auto('s3://data-lake/ingest/parcels_2024.json', maximum_object_size=10485760);

Python Integration Pattern When orchestrating via Python, disable automatic progress reporting and bind parameters to prevent injection and optimize query caching:

import duckdb

con = duckdb.connect(config={'threads': 8, 'memory_limit': '16GB'})
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("SET preserve_insertion_order = false;")

# Parameterized ingestion for repeatable pipelines
query = """
    CREATE OR REPLACE TABLE raw_geojson AS
    SELECT properties->>'id' AS id, geom
    FROM st_read($1)
"""
con.execute(query, ['s3://data-lake/ingest/parcels_2024.geojson'])

Execution Plan Analysis & Diagnostics

Run EXPLAIN ANALYZE against the ingestion query to validate operator behavior:

EXPLAIN ANALYZE
CREATE OR REPLACE TABLE raw_geojson AS
SELECT properties->>'name' AS name, geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');

Key Operator Thresholds:

  • ST_Read / READ_CSV_AUTO: The scan operator. For st_read, confirm row count matches the expected feature count.
  • PhysicalProjection: Confirms column pruning. Selecting only required columns avoids full JSON tree traversal, reducing peak RSS by 30–50% on property-rich features.
  • Spatial predicates applied in a WHERE clause after ingestion are evaluated post-materialization on the resulting table.

Diagnostic Boundary: If the query shows spill behavior in EXPLAIN ANALYZE, adjust SET memory_limit upward or partition the source file. DuckDB’s spill behavior is deterministic; monitor SELECT * FROM duckdb_temporary_files(); to avoid I/O thrashing on network-attached storage.

Storage Topology & Memory Boundaries

GeoJSON ingestion defaults to in-memory materialization, which is optimal for datasets under 4GB. For larger payloads, explicit disk-backed storage prevents OOM termination and enables concurrent analytical workloads. Refer to In-Memory vs Disk Storage for engine-level allocation strategies.

Production Disk-Backed Pattern:

-- Attach a persistent database file for final storage
ATTACH '/mnt/warehouse/warehouse.duckdb' AS warehouse;
CREATE OR REPLACE TABLE warehouse.raw_geojson AS
SELECT geom, properties->>'name' AS name
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');

CRS Handling & Transformation GeoJSON strictly uses EPSG:4326 (WGS84, lon/lat). If downstream analytics require projected coordinates (e.g., EPSG:3857 or a local UTM zone), apply transformations immediately after ingestion to avoid repeated projection overhead during query execution. Consult CRS Mapping & Transformations for coordinate system resolution logic and precision loss mitigation.

-- Project to metric coordinate system post-ingestion
CREATE OR REPLACE TABLE parcels_metric AS
SELECT
    id, name, category,
    st_transform(geom, 'EPSG:4326', 'EPSG:32633') AS geom
FROM raw_geojson;

Performance Trade-offs & Tuning Guidelines

Configuration Performance Impact Diagnostic Trigger
preserve_insertion_order = true Disables parallel scan. Ingestion latency increases 2–4×. High single-thread CPU utilization during scan.
threads > physical_cores Thread thrashing and L3 cache contention. Peak RSS spikes. CPU utilization < 70% with high context switches.
ST_MakeValid() on 100% of rows CPU-bound validation overhead (~15–25% of ingestion time). Apply only when ST_IsValid(geom) returns > 2% invalid.
temp_directory unset (no spill target) OOM termination on payloads > memory_limit. duckdb.OutOfMemoryException during scan.

Execution Boundaries:

  1. Memory Pressure: If EXPLAIN ANALYZE shows spill nodes, partition the source GeoJSON by feature count or geographic bounding box before ingestion.
  2. Invalid Geometry Rate: When ST_IsValid returns > 5% invalid features, isolate them into a quarantine table rather than applying ST_MakeValid globally. This preserves query determinism and prevents topology corruption.
  3. Cloud Storage: For S3 access, configure credentials via CREATE SECRET or environment variables. Use SET s3_region = 'us-east-1' if auto-detection fails.

Validation & Error Handling

Production pipelines must enforce strict geometry contracts. Use ST_IsEmpty and ST_IsValid to filter malformed features before indexing or spatial joins.

-- Quarantine invalid geometries for manual review
CREATE OR REPLACE TABLE geojson_quarantine AS
SELECT id, name, geom,
       CASE WHEN ST_IsEmpty(geom) THEN 'empty'
            WHEN NOT ST_IsValid(geom) THEN 'invalid'
       END AS failure_reason
FROM raw_geojson
WHERE NOT ST_IsValid(geom) OR ST_IsEmpty(geom);

-- Drop quarantined rows from primary table
DELETE FROM raw_geojson WHERE id IN (SELECT id FROM geojson_quarantine);

Monitor failure_reason distributions to identify upstream serialization errors. Common patterns include self-intersecting polygons, duplicate vertices, or non-planar rings. Resolve at the source ETL layer when possible to avoid repeated compute costs during ingestion.