Batch vs Stream Geospatial Processing

Q: Can I mix batch and streaming in the same spatial pipeline?

Yes. A common pattern runs a streaming layer for real-time alerts and geofence triggers while a nightly batch job reconciles historical data, fills gaps, and regenerates authoritative spatial aggregates. The two layers share the same object storage sink but write to different prefixes.

Q: What spatial operations are unsafe in a streaming context?

Topology validation, spatial joins across large reference datasets, and raster mosaicking all require loading substantial data into memory and are poor fits for per-event streaming functions. Move these to batch steps triggered after windowed aggregation.

Q: How do I handle CRS drift between batch and stream outputs?

Normalize to a canonical CRS at the ingestion boundary of both pipelines — EPSG:4326 for interchange, a UTM projection for distance-accurate computation. Validate pyproj transformer chains against known control points in CI to catch regressions before they reach production.

For spatial workloads on AWS, GCP, and Azure, the right processing paradigm depends on two concrete numbers: how often data arrives and how stale a result your users can tolerate. Batch suits workloads where data arrives in bulk every few minutes to hours and results need to be accurate rather than instant. Streaming is mandatory when continuous feeds — GPS pings, vessel transponders, IoT environmental sensors — must produce geofence alerts or routing updates within seconds.

This page covers the decision criteria, exact platform limits, step-by-step implementation patterns, failure modes, and cost arithmetic for both paradigms within the broader Event-Driven Geospatial Processing Patterns framework.

Why the Processing Paradigm Matters for Spatial Workloads

Spatial data has properties that make the batch/stream decision harder than it is for tabular data:

File sizes are orders of magnitude larger. A single Sentinel-2 scene is 800 MB–1.2 GB; a full AIS day-log for the North Sea can exceed 40 GB. Loading either into a serverless function’s memory takes time that erodes every latency budget.
Operations are CPU-bound and non-trivial to parallelise. Spatial joins, topology validation, and raster reprojection hold the Python GIL or saturate a single vCPU. Function memory allocation controls vCPU share on AWS Lambda and GCP Cloud Functions — under-provisioned functions stall.
Library initialisation is slow. GDAL driver registration, pyproj datum file loading, and shapely GEOS initialisation each add 300–900 ms of cold-start overhead before any spatial work begins. This cold-start cost is negligible in a batch job but catastrophic for a streaming function that must process a 50 ms event window.
CRS drift is silent. A coordinate system mismatch between a batch pipeline’s output and a streaming layer’s reference dataset produces geometrically wrong results with no runtime error.

The diagram below shows how data arrival patterns map to processing paradigms and then to platform compute choices:

Why This Constraint Matters for Geospatial Workloads

Batch: Memory and Throughput Are the Bottlenecks

Batch spatial jobs involve large contiguous operations: raster mosaicking, spatial joins across millions of features, topology validation, and compliance report generation. These operations scale quadratically with feature count and linearly with band count. An unconstrained spatial join between a 5 M-feature road network and a 2 M-feature admin boundary dataset can require 12–20 GB of working memory before geopandas writes a single row.

Partitioning by spatial index (H3, S2, or quadtree cells) is the primary mitigation. Chunked I/O for large satellite imagery covers the pattern in detail for multi-band Sentinel-2 data, but the same tile-boundary logic applies to vector joins. Each chunk must be self-contained within its bounding box to prevent cross-shard topology errors.

Stream: Cold-Start Latency and State Management Are the Bottlenecks

A streaming function that processes a continuous vessel transponder feed cannot afford 900 ms of GDAL driver registration on every invocation. Cold-start mapping for Python GDAL shows that shared-library resolution alone accounts for 400–700 ms of the initialisation budget. For streaming workloads, keeping functions warm via provisioned concurrency or a long-polling consumer model eliminates this overhead entirely.

State management is the second bottleneck. Geofence evaluation requires a reference polygon dataset to be resident in the function or in a low-latency cache. Loading a 500 MB admin-boundary GeoPackage from S3 on every invocation adds seconds. Pre-loading into ElastiCache, Memorystore, or Azure Cache for Redis at function warm-up, then re-reading from the cache on each event, reduces lookup latency to single-digit milliseconds.

Platform-by-Platform Limits

The decision between paradigms is also a decision about which platform constraints bind first:

Constraint	AWS Lambda	GCP Cloud Functions 2nd gen	Azure Functions (Consumption)
Max timeout	15 minutes	60 minutes	10 minutes
Max memory	10 GB	32 GB	1.5 GB
Ephemeral storage (`/tmp`)	10 GB (configurable)	Varies by instance class	~500 MB
Deployment package (unzipped)	250 MB	1 GB (unzipped source)	500 MB (zip)
Reserved concurrency	1,000 per region (soft limit)	3,000 per region	200 per App
Batch trigger	S3 event / EventBridge cron / SQS	Cloud Storage notification / Cloud Scheduler	Blob trigger / Timer trigger
Stream trigger	Kinesis / SQS FIFO	Pub/Sub push	Event Hubs / Service Bus
Billing unit	1 ms (min 100 ms)	100 ms	100 ms

Impact on geospatial workloads: Azure’s 1.5 GB memory ceiling rules it out for in-memory spatial joins on large datasets unless you pre-filter aggressively. GCP’s 60-minute timeout accommodates heavy raster batch jobs that would time out on AWS Lambda. AWS Lambda’s configurable 10 GB /tmp is critical for intermediate GeoTIFF extraction — managing /tmp storage limits for GeoTIFF extraction explains the configuration steps and pitfalls.

Step-by-Step Implementation

Batch Pipeline: Spatial Join with Chunked Dispatch

python

import os
import json
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
import geopandas as gpd
from shapely.geometry import box

# Explicit environment — never rely on ambient paths
os.environ.setdefault("GDAL_DATA", "/opt/share/gdal")
os.environ.setdefault("PROJ_LIB", "/opt/share/proj")
os.environ.setdefault("LD_LIBRARY_PATH", "/opt/lib")

S3_BUCKET = os.environ["GEO_BUCKET"]
INPUT_PREFIX = os.environ["INPUT_PREFIX"]   # e.g. "raw/admin-boundaries/"
OUTPUT_PREFIX = os.environ["OUTPUT_PREFIX"] # e.g. "processed/joined/"
CHUNK_DEGREES = 2.0  # Spatial partition size in decimal degrees

s3 = boto3.client("s3")


def list_spatial_chunks(bbox_global=(-180, -90, 180, 90)):
    """Generate 2-degree grid cells covering the bounding box."""
    minx, miny, maxx, maxy = bbox_global
    x = minx
    while x < maxx:
        y = miny
        while y < maxy:
            yield (x, y, x + CHUNK_DEGREES, y + CHUNK_DEGREES)
            y += CHUNK_DEGREES
        x += CHUNK_DEGREES


def process_chunk(chunk_bbox: tuple, reference_key: str) -> dict:
    """
    Load only the features intersecting chunk_bbox from S3,
    run a spatial join against a reference dataset,
    and write GeoParquet output.
    """
    minx, miny, maxx, maxy = chunk_bbox
    bbox_geom = box(minx, miny, maxx, maxy)
    chunk_id = f"{minx:.1f}_{miny:.1f}"

    # Idempotency check — skip if output already exists
    output_key = f"{OUTPUT_PREFIX}chunk={chunk_id}/part.parquet"
    try:
        s3.head_object(Bucket=S3_BUCKET, Key=output_key)
        return {"chunk_id": chunk_id, "status": "skipped"}
    except s3.exceptions.ClientError:
        pass  # Does not exist; proceed

    # Stream features within bounding box only
    features_url = f"s3://{S3_BUCKET}/{INPUT_PREFIX}features.gpkg"
    features_gdf = gpd.read_file(
        features_url,
        bbox=(minx, miny, maxx, maxy),
        engine="pyogrio",
    )
    if features_gdf.empty:
        return {"chunk_id": chunk_id, "status": "empty"}

    # Normalize CRS at ingestion boundary
    features_gdf = features_gdf.to_crs("EPSG:4326")

    # Load reference dataset (small; fits in memory)
    ref_url = f"s3://{S3_BUCKET}/{reference_key}"
    ref_gdf = gpd.read_file(ref_url, bbox=(minx, miny, maxx, maxy), engine="pyogrio")
    ref_gdf = ref_gdf.to_crs("EPSG:4326")

    joined = gpd.sjoin(features_gdf, ref_gdf, how="left", predicate="intersects")

    # Write to GeoParquet via pyarrow for zero-copy serialization
    table = pa.Table.from_pandas(joined.drop(columns="geometry"))
    buf = pa.BufferOutputStream()
    pq.write_table(table, buf, compression="zstd")

    s3.put_object(
        Bucket=S3_BUCKET,
        Key=output_key,
        Body=buf.getvalue().to_pybytes(),
        ContentType="application/octet-stream",
    )
    del features_gdf, ref_gdf, joined, table  # Explicit GC for C-level memory
    return {"chunk_id": chunk_id, "status": "written", "key": output_key}


def lambda_handler(event, context):
    reference_key = event["reference_key"]
    results = [process_chunk(bbox, reference_key) for bbox in list_spatial_chunks()]
    return {"processed": len(results), "results": results}

Validation command — verify output chunk counts match expected grid coverage:

python

import boto3

s3 = boto3.client("s3")
resp = s3.list_objects_v2(Bucket=S3_BUCKET, Prefix=OUTPUT_PREFIX)
written = [o["Key"] for o in resp.get("Contents", []) if o["Key"].endswith(".parquet")]
print(f"Output chunks: {len(written)}")  # Expected: ~8100 for global 2-degree grid

Stream Pipeline: Geofence Evaluation on Vessel Transponder Events

python

import os
import json
import redis
import pickle
import hashlib
import boto3
import geopandas as gpd
from shapely.geometry import Point
from shapely import wkb

os.environ.setdefault("GDAL_DATA", "/opt/share/gdal")
os.environ.setdefault("PROJ_LIB", "/opt/share/proj")
os.environ.setdefault("LD_LIBRARY_PATH", "/opt/lib")

REDIS_HOST = os.environ["REDIS_HOST"]
REDIS_PORT = int(os.environ.get("REDIS_PORT", "6379"))
GEOFENCE_S3_KEY = os.environ["GEOFENCE_S3_KEY"]
GEO_BUCKET = os.environ["GEO_BUCKET"]
ALERT_TOPIC_ARN = os.environ["ALERT_TOPIC_ARN"]

_redis = None
_geofences = None  # Module-level warm cache


def get_redis():
    global _redis
    if _redis is None:
        _redis = redis.Redis(host=REDIS_HOST, port=REDIS_PORT, decode_responses=False)
    return _redis


def load_geofences():
    """Load geofence polygons at warm-up; cache as WKB bytes in Redis."""
    global _geofences
    if _geofences is not None:
        return _geofences

    r = get_redis()
    cached = r.get("geofences:v1")
    if cached:
        _geofences = pickle.loads(cached)
        return _geofences

    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket=GEO_BUCKET, Key=GEOFENCE_S3_KEY)
    gdf = gpd.read_file(obj["Body"], engine="pyogrio").to_crs("EPSG:4326")
    _geofences = gdf[["zone_id", "zone_name", "geometry"]].copy()
    r.set("geofences:v1", pickle.dumps(_geofences), ex=3600)  # 1-hour TTL
    return _geofences


def dedup_key(event_id: str) -> str:
    return f"processed:{hashlib.sha256(event_id.encode()).hexdigest()[:16]}"


def process_record(record: dict) -> dict | None:
    """
    Evaluate a single AIS position report against loaded geofences.
    Returns an alert dict if the vessel enters a monitored zone.
    """
    r = get_redis()
    event_id = record["message_id"]

    # Idempotent deduplication — at-least-once broker delivery
    if r.set(dedup_key(event_id), "1", nx=True, ex=300) is None:
        return None  # Already processed

    lon = float(record["longitude"])
    lat = float(record["latitude"])
    pt = Point(lon, lat)

    geofences = load_geofences()
    hits = geofences[geofences.geometry.contains(pt)]
    if hits.empty:
        return None

    return {
        "vessel_mmsi": record["mmsi"],
        "zones": hits["zone_id"].tolist(),
        "lon": lon,
        "lat": lat,
        "event_id": event_id,
    }


def lambda_handler(event, context):
    sns = boto3.client("sns")
    alerts = []

    for kinesis_record in event["Records"]:
        payload = json.loads(kinesis_record["kinesis"]["data"])
        alert = process_record(payload)
        if alert:
            sns.publish(
                TopicArn=ALERT_TOPIC_ARN,
                Message=json.dumps(alert),
                Subject=f"Vessel {alert['vessel_mmsi']} entered zone",
            )
            alerts.append(alert)

    return {"alert_count": len(alerts)}

Measurement and Verification

Batch Pipeline Metrics

Monitor these CloudWatch / Cloud Monitoring metric names after a batch run:

Metric	Namespace / Signal	Healthy range
`Duration`	`AWS/Lambda`	Below 80% of timeout (< 12 min for 15-min timeout)
`MemoryUtilization`	`AWS/Lambda` (via Lambda Insights)	Below 85% of allocated memory
`Errors`	`AWS/Lambda`	0 for idempotent re-runs
`ConcurrentExecutions`	`AWS/Lambda`	Below reserved concurrency limit
`S3 PutObject latency`	`AWS/S3`	< 200 ms p99 for < 50 MB writes

Benchmark script — measure end-to-end chunk throughput:

python

import time
import boto3

s3 = boto3.client("s3")
start = time.monotonic()

resp = s3.list_objects_v2(Bucket=S3_BUCKET, Prefix=OUTPUT_PREFIX)
chunk_count = len([o for o in resp.get("Contents", []) if o["Key"].endswith(".parquet")])

elapsed = time.monotonic() - start
print(f"Chunks written: {chunk_count}, listing latency: {elapsed:.3f}s")
# Expected: > 500 chunks/invocation for a 2-degree global grid

Stream Pipeline Metrics

Metric	AWS signal	Healthy range
Iterator age	`AWS/Kinesis: GetRecords.IteratorAgeMilliseconds`	< 5,000 ms p99
Processing duration	`AWS/Lambda: Duration`	< 500 ms p99
Throttles	`AWS/Lambda: Throttles`	0
Dedup cache hit rate	Custom `Redis KEYS processed:*` count growth	Stable; not growing unboundedly

Failure Modes and Debugging

1. OOM Kill During Spatial Join (`Runtime exited with error: signal: killed`)

Cause: Spatial join loaded both datasets fully before pre-filtering. Feature count * geometry complexity exceeded allocated memory.

Fix: Apply bounding-box pre-filter using bbox= argument in gpd.read_file() before the join. If the reference dataset is large, partition it by the same spatial index as the feature dataset.

2. Iterator Age Grows Continuously (Stream Lag Accumulates)

Cause: Each Kinesis shard processes records faster than the Lambda function can consume them. Usually caused by cold-start overhead or geofence cache misses on every invocation.

Fix: Enable provisioned concurrency to eliminate cold starts. Pre-warm the Redis geofence cache via a scheduled ping function. See reducing Python GDAL cold starts with provisioned concurrency for the configuration pattern.

3. Duplicate Spatial Aggregates in Batch Output (`DuplicateKeyError` or inflated counts)

Cause: S3 event notification fired twice for the same upload (at-least-once delivery). Batch function wrote a partial output then retried without checking for an existing artifact.

Fix: Add the head_object idempotency check shown in the batch code above. For database sinks, use upsert semantics (INSERT ... ON CONFLICT DO UPDATE) keyed on the spatial chunk ID.

4. CRS Mismatch Between Batch and Stream Outputs (`ValueError: CRS is geographic`)

Cause: Batch pipeline normalised to EPSG:32632 (UTM) for accurate distance computation; streaming function left data in EPSG:4326. Downstream spatial join fails because projections are incompatible.

Fix: Enforce a canonical CRS contract at the ingestion boundary of both pipelines. Document the target CRS in an environment variable (TARGET_CRS=EPSG:4326) and validate it in CI with a pyproj round-trip test against known control points.

5. Dead-Letter Queue Backlog Grows for Malformed Payloads (`JSONDecodeError` / `KeyError: 'longitude'`)

Cause: Upstream telemetry device firmware bug produces malformed AIS sentences; Lambda raises an unhandled exception and the message lands in the dead-letter queue.

Fix: Wrap field access in explicit validation and emit structured error logs with the raw payload encoded as a base64 string. Route validated failures to a DLQ-specific remediation Lambda rather than blocking the main shard.

Cost and Scaling Considerations

Batch: Cost Is Dominated by Compute Duration × Memory

AWS Lambda bills at 1 ms granularity (minimum 100 ms) multiplied by memory allocation. A 10 GB function costs roughly 10× more per millisecond than a 1 GB function. For CPU-bound spatial operations, memory allocation also controls the vCPU share — a 10 GB function gets access to six vCPUs, which accelerates multi-threaded GDAL operations. The right memory setting is the one that minimises duration × memory while staying within the timeout.

Back-of-envelope for a 10,000-chunk global batch job:

Memory: 4 GB per function ($0.0000000667 per GB-ms)
Duration: 90 s average per chunk
Cost per chunk: 4 × 90,000 ms × $0.0000000667 ≈ $0.000024
Total: 10,000 × $0.000024 = $0.24 (plus S3 PUT costs ~$0.005)

Compare this against an always-on EC2 r6g.xlarge ($0.2016/hr): a nightly batch that completes in 30 minutes costs EC2 $0.10 in idle time plus $0.04 in compute — similar cost but with capacity management overhead.

Stream: Cost Is Dominated by Invocation Count × Shard Hours

Kinesis charges $0.015 per shard-hour. A 10-shard stream running continuously costs $3.60/day in shard fees alone. Lambda invocation cost adds roughly $0.20/day for 1,000 events/second at 50 ms duration with 256 MB allocated. At this scale, stream processing costs $3.80/day vs a hypothetical batch run every 5 minutes costing $0.50/day — a 7.6× premium for real-time delivery.

When streaming cost is justified: real-time geofence alerting, dynamic route re-optimisation, and live vessel traffic dashboards where a 5-minute data lag causes measurable business impact. For anything where 5-minute staleness is acceptable, run batch on a cron and save the premium.

Hybrid Architecture

The most cost-efficient pattern for most geospatial platforms is a hybrid: a streaming layer at minimal shard count handles only geofence alerts and anomaly detection, while a batch pipeline handles all heavy computation (spatial joins, mosaic generation, compliance exports). Both layers write to the same GeoParquet data lake under different prefixes, allowing downstream tools to query either freshness tier via S3 and GCS event triggers that gate the next pipeline stage on new object arrival.

Frequently Asked Questions

Can I mix batch and streaming in the same spatial pipeline?

Yes. A common pattern runs a streaming layer for real-time alerts and geofence triggers while a nightly batch job reconciles historical data, fills gaps, and regenerates authoritative spatial aggregates. The two layers share the same object storage sink but write to different prefixes.

What spatial operations are unsafe in a streaming context?

Topology validation, spatial joins across large reference datasets, and raster mosaicking all require loading substantial data into memory and are poor fits for per-event streaming functions. Move these to batch steps triggered after windowed aggregation.

How do I handle CRS drift between batch and stream outputs?

Normalise to a canonical CRS at the ingestion boundary of both pipelines — EPSG:4326 for interchange, a UTM projection for distance-accurate computation. Validate pyproj transformer chains against known control points in CI to catch regressions before they reach production.

When to Use Batch vs Streaming for Real-Time AIS Tracking — latency thresholds and partition strategies for vessel telemetry
SQS and Pub/Sub Queue Routing Strategies — fan-out patterns and backpressure handling for spatial event queues
Implementing Dead-Letter Queues for Failed Vector Jobs — remediation patterns for malformed spatial payloads
Chunked I/O for Large Satellite Imagery — tile-boundary partitioning for multi-band raster batch jobs
Cold-Start Mapping for Python GDAL — initialisation sequence and warm-up strategies that apply equally to streaming functions

Back to Event-Driven Geospatial Processing Patterns