SQS and Pub/Sub Queue Routing Strategies for Geospatial Pipelines

Q: Can AWS SQS filter messages by content without EventBridge?

Standard SQS queues have no native broker-side filtering. The producer or an intermediary Lambda must inspect the payload and dispatch to the correct queue URL. EventBridge adds content-based routing rules that evaluate JSON path expressions and fan out to multiple SQS queues without custom code.

Q: What is the maximum message size for spatial payloads in SQS vs Pub/Sub?

SQS caps each message at 256 KB. GCP Pub/Sub also caps at 10 MB per message (not 256 KB as sometimes stated) and 10 MB per request. Azure Service Bus Standard allows 256 KB; Premium allows up to 100 MB. Store actual raster or vector files in object storage and pass only the URI and metadata in the queue message.

Q: How do FIFO queues affect throughput for geospatial workloads?

SQS FIFO queues are capped at 3,000 messages per second with batching (300 without). For high-volume tile ingestion this becomes a bottleneck. Use FIFO only when strict ordering is required — for example, sequential cadastral boundary edits — and partition by MessageGroupId to maximise parallelism within the ordering constraint.

Message routing determines throughput, cost, and fault tolerance in serverless geospatial pipelines. The single most impactful decision is whether routing logic lives in the broker (GCP Pub/Sub subscription filters, Azure Service Bus topic rules) or in your own code (AWS SQS with a Lambda dispatcher). Spatial payloads — GeoTIFF metadata envelopes, Shapefile upload events, vector feature batches — carry attributes like CRS, format, bounding box, and priority that make content-based routing far more effective than round-robin or random distribution.

This page covers the three core routing architectures (content-based, priority-tiered, and geo-partitioned), exact platform quotas for all three clouds, step-by-step Python implementations, and the failure modes that most commonly break spatial routing in production.

Why Routing Architecture Matters for Geospatial Workloads

Spatial data is heterogeneous in a way that most queue documentation ignores. A CRS reprojection job for a single GeoJSON feature consumes ~50 MB of memory and completes in milliseconds. A raster mosaic job for a 12-band Sentinel-2 tile set can demand 8–10 GB of memory and run for several minutes — right up against AWS Lambda’s 15-minute timeout and 10 GB memory ceiling. Routing both jobs to the same consumer pool means either the memory ceiling is set for the largest job (wasting cost on small jobs) or the smallest job’s timeout evicts the large job mid-run.

Misrouted messages also interact badly with ephemeral storage limits in AWS Lambda: a raster worker that exhausts /tmp mid-tile will nack the message repeatedly until it lands in a Dead Letter Queue, silently stalling downstream catalog updates.

The routing layer is also where you intercept events that arrive from S3 and GCS upload triggers for Shapefiles. The trigger fires per object; the router inspects object metadata and decides whether to dispatch to a vector topology queue, a raster ingestion queue, or a metadata-only indexer — before any compute-intensive GDAL driver initialisation runs.

Platform-by-Platform Limits

Every routing decision has hard platform constraints that directly affect queue throughput, message size, and retry behaviour for spatial workloads.

Constraint	AWS SQS	GCP Pub/Sub	Azure Service Bus
Max message size	256 KB	10 MB per message	256 KB (Standard) / 100 MB (Premium)
Broker-side filtering	None on standard queues; EventBridge rules required	Native subscription filters (`attributes.key = "value"`)	Native topic subscription rules (SQL filter or correlation filter)
Throughput ceiling	Unlimited (standard); 3,000 msg/s (FIFO with batching)	10,000 msg/s per topic (quota-adjustable)	1,000–5,000 msg/s depending on tier
Max message retention	14 days	7 days (extendable to 31 days)	14 days (Premium)
Visibility timeout range	0 s – 12 hours	N/A (ack deadline: 10 s – 600 s)	Lock duration: 0 – 5 minutes
Dead Letter Queue	Separate SQS queue; `maxReceiveCount` 1–1000	Dead-letter topic per subscription	Built-in dead-letter subqueue
Message ordering	FIFO queues only (strict); standard = best-effort	Ordered delivery with ordering keys	Sessions (Premium)
Attribute/label support	Up to 10 message attributes	Unlimited key-value string attributes	Up to 64 KB of custom properties
Pricing model	Per request (64 KB chunks)	Per GB data volume	Per operation

Geospatial impact: GeoTIFF metadata envelopes typically stay well under 256 KB, but chunked I/O for large satellite imagery can generate per-chunk messages that each carry tile coordinates, band indices, and output URIs. On SQS, keep these attribute payloads under 256 KB by storing the full spatial extent as a URI reference rather than inline WKT geometry.

Step-by-Step Implementation

Step 1 — Standardise the Message Envelope

Every spatial message must carry routing attributes at the top level of the message metadata (not buried inside the JSON body). The broker reads attributes; it does not parse JSON bodies for filtering.

Mandatory envelope fields:

source_uri — object storage URI (e.g. s3://bucket/path/scene.tif)
crs — EPSG code string (e.g. EPSG:32632)
processing_mode — one of vector, raster, or metadata
priority — one of high, normal, or low
job_id — UUID for idempotency tracking

Step 2 — AWS SQS: Lambda-Based Dispatcher

AWS SQS does not filter at the broker; the router runs in a Lambda that receives the raw S3 event (or EventBridge event) and selects a destination queue based on envelope attributes.

python

import boto3
import json
import logging
import os
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
sqs = boto3.client("sqs", region_name=os.environ["AWS_REGION"])

# Queue URLs injected via environment — never hard-coded ARNs
QUEUE_MAP = {
    "vector":   os.environ["VECTOR_QUEUE_URL"],
    "raster":   os.environ["RASTER_QUEUE_URL"],
    "metadata": os.environ["METADATA_QUEUE_URL"],
}
# Priority-tiered queues for raster (high-priority gets shorter visibility timeout)
PRIORITY_RASTER_QUEUES = {
    "high":   os.environ["RASTER_HIGH_QUEUE_URL"],
    "normal": os.environ["RASTER_QUEUE_URL"],
    "low":    os.environ["RASTER_LOW_QUEUE_URL"],
}

def select_queue(payload: dict) -> str:
    mode = payload.get("processing_mode", "metadata").lower()
    if mode == "raster":
        priority = payload.get("priority", "normal").lower()
        return PRIORITY_RASTER_QUEUES.get(priority, PRIORITY_RASTER_QUEUES["normal"])
    return QUEUE_MAP.get(mode, QUEUE_MAP["metadata"])


def route_spatial_message(payload: dict) -> dict:
    """
    Dispatch one spatial job message to the correct SQS queue.
    Returns the SQS MessageId on success; raises on ClientError.
    """
    queue_url = select_queue(payload)

    # Message attributes are indexed by SQS for visibility; max 10 per message.
    message_attrs = {
        "crs":             {"DataType": "String", "StringValue": payload.get("crs", "EPSG:4326")},
        "processing_mode": {"DataType": "String", "StringValue": payload.get("processing_mode", "metadata")},
        "priority":        {"DataType": "String", "StringValue": payload.get("priority", "normal")},
        "job_id":          {"DataType": "String", "StringValue": payload["job_id"]},
    }

    try:
        response = sqs.send_message(
            QueueUrl=queue_url,
            MessageBody=json.dumps(payload),
            MessageAttributes=message_attrs,
            # DelaySeconds=0 is the default; increase for scheduled batch jobs
            DelaySeconds=0,
        )
        logger.info(
            "routed message",
            extra={
                "job_id":    payload["job_id"],
                "queue":     queue_url,
                "message_id": response["MessageId"],
            },
        )
        return {"status": "ok", "message_id": response["MessageId"], "queue": queue_url}
    except ClientError as exc:
        logger.error("SQS send failed for job %s: %s", payload.get("job_id"), exc)
        raise


def lambda_handler(event: dict, context) -> dict:
    results = []
    for record in event.get("Records", []):
        # Records arrive from S3 events, EventBridge, or direct invocation
        payload = json.loads(record.get("body", record.get("detail", "{}")))
        results.append(route_spatial_message(payload))
    return {"dispatched": results}

Set these environment variables in your Lambda configuration — never rely on defaults:

code

AWS_REGION=eu-west-1
VECTOR_QUEUE_URL=https://sqs.eu-west-1.amazonaws.com/123456789012/geo-vector
RASTER_QUEUE_URL=https://sqs.eu-west-1.amazonaws.com/123456789012/geo-raster
RASTER_HIGH_QUEUE_URL=https://sqs.eu-west-1.amazonaws.com/123456789012/geo-raster-high
RASTER_LOW_QUEUE_URL=https://sqs.eu-west-1.amazonaws.com/123456789012/geo-raster-low
METADATA_QUEUE_URL=https://sqs.eu-west-1.amazonaws.com/123456789012/geo-metadata

Step 3 — GCP Pub/Sub: Broker-Side Subscription Filters

GCP Pub/Sub evaluates subscription-level filter expressions before delivering messages. A single ingest topic fans out to multiple subscriptions; each subscription sees only the messages matching its filter — no routing Lambda required.

python

from google.cloud import pubsub_v1
from google.api_core.exceptions import GoogleAPIError
import json
import logging
import os

logger = logging.getLogger(__name__)
publisher = pubsub_v1.PublisherClient()
TOPIC_PATH = publisher.topic_path(
    os.environ["GCP_PROJECT_ID"],
    os.environ["PUBSUB_INGEST_TOPIC"],
)


def publish_spatial_event(payload: dict) -> str:
    """
    Publish a spatial job message with routing attributes.
    GCP Pub/Sub evaluates these attributes at subscription time —
    each subscription's filter runs broker-side; consumers only
    receive messages that match.
    """
    data = json.dumps(payload).encode("utf-8")

    # All attribute values must be strings for filter evaluation.
    attributes = {
        "crs":             payload.get("crs", "EPSG:4326"),
        "processing_mode": payload.get("processing_mode", "metadata"),
        "priority":        payload.get("priority", "normal"),
        "job_id":          payload["job_id"],
    }

    try:
        future = publisher.publish(TOPIC_PATH, data=data, **attributes)
        message_id = future.result(timeout=10)  # block up to 10 s
        logger.info("published job=%s msg=%s", payload["job_id"], message_id)
        return message_id
    except (GoogleAPIError, TimeoutError) as exc:
        logger.error("Pub/Sub publish failed for job %s: %s", payload.get("job_id"), exc)
        raise

Create the subscriptions with filter expressions using gcloud:

bash

# Raster subscription — receives only raster messages
gcloud pubsub subscriptions create geo-raster-sub \
  --topic=geospatial-ingest \
  --message-filter='attributes.processing_mode = "raster"' \
  --ack-deadline=300 \
  --dead-letter-topic=geo-dlq \
  --max-delivery-attempts=5

# High-priority raster subscription (separate consumer pool)
gcloud pubsub subscriptions create geo-raster-high-sub \
  --topic=geospatial-ingest \
  --message-filter='attributes.processing_mode = "raster" AND attributes.priority = "high"' \
  --ack-deadline=120 \
  --dead-letter-topic=geo-dlq \
  --max-delivery-attempts=3

# Vector topology subscription
gcloud pubsub subscriptions create geo-vector-sub \
  --topic=geospatial-ingest \
  --message-filter='attributes.processing_mode = "vector"' \
  --ack-deadline=60 \
  --dead-letter-topic=geo-dlq \
  --max-delivery-attempts=5

Set the required environment variables:

code

GCP_PROJECT_ID=my-geo-project
PUBSUB_INGEST_TOPIC=geospatial-ingest
GOOGLE_APPLICATION_CREDENTIALS=/run/secrets/sa-key.json

Step 4 — Azure Service Bus: Topic Subscription Rules

Azure Service Bus topics support SQL-like filter rules per subscription. This mirrors GCP’s model: publish once to a topic, multiple subscriptions receive filtered subsets.

python

from azure.servicebus import ServiceBusClient, ServiceBusMessage
import json
import logging
import os

logger = logging.getLogger(__name__)
CONNECTION_STR = os.environ["SERVICEBUS_CONNECTION_STR"]
TOPIC_NAME = os.environ["SERVICEBUS_TOPIC"]


def publish_to_service_bus(payload: dict) -> None:
    """
    Publish a spatial message with application properties for rule-based routing.
    Azure Service Bus subscription rules evaluate these properties SQL-style.
    """
    with ServiceBusClient.from_connection_string(CONNECTION_STR) as client:
        with client.get_topic_sender(topic_name=TOPIC_NAME) as sender:
            msg = ServiceBusMessage(
                json.dumps(payload),
                application_properties={
                    "processing_mode": payload.get("processing_mode", "metadata"),
                    "priority":        payload.get("priority", "normal"),
                    "crs":             payload.get("crs", "EPSG:4326"),
                    "job_id":          payload["job_id"],
                },
            )
            sender.send_messages(msg)
            logger.info("sent job=%s to topic=%s", payload["job_id"], TOPIC_NAME)

Create subscriptions with SQL filter rules:

bash

# Raster subscription
az servicebus topic subscription create \
  --resource-group geo-rg \
  --namespace-name geo-bus \
  --topic-name geospatial-ingest \
  --name raster-sub

az servicebus topic subscription rule create \
  --resource-group geo-rg \
  --namespace-name geo-bus \
  --topic-name geospatial-ingest \
  --subscription-name raster-sub \
  --name raster-rule \
  --filter-sql-expression "processing_mode = 'raster'"

Set the required environment variable:

code

SERVICEBUS_CONNECTION_STR=Endpoint=sb://geo-bus.servicebus.windows.net/;SharedAccessKeyName=...
SERVICEBUS_TOPIC=geospatial-ingest

Step 5 — Dead Letter Queue Wiring

All three platforms require explicit DLQ configuration. On SQS, the DLQ is a separate queue associated via a redrive policy. Proper DLQ handling is detailed in Implementing Dead Letter Queues for Failed Vector Jobs.

python

import boto3
import json

sqs = boto3.client("sqs")

def attach_dlq(source_queue_url: str, dlq_arn: str, max_receive_count: int = 5) -> None:
    """
    Attach a Dead Letter Queue to a source SQS queue.
    max_receive_count=5 gives five delivery attempts before messages
    are moved to the DLQ for forensic analysis.
    """
    sqs.set_queue_attributes(
        QueueUrl=source_queue_url,
        Attributes={
            "RedrivePolicy": json.dumps({
                "deadLetterTargetArn": dlq_arn,
                "maxReceiveCount": str(max_receive_count),
            })
        },
    )

Step 6 — Idempotency Guard at the Consumer

Network partitions can cause the same message to be delivered more than once. Track job_id in a fast store to prevent duplicate CRS reprojections or topology checks:

python

import boto3
from botocore.exceptions import ClientError

dynamodb = boto3.resource("dynamodb")
JOBS_TABLE = dynamodb.Table("geo-processed-jobs")


def is_already_processed(job_id: str) -> bool:
    """
    Attempt to write job_id with a conditional expression.
    Returns True if the job was already recorded (duplicate); False if new.
    """
    try:
        JOBS_TABLE.put_item(
            Item={"job_id": job_id},
            ConditionExpression="attribute_not_exists(job_id)",
        )
        return False  # new job — proceed
    except ClientError as exc:
        if exc.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return True  # duplicate — skip
        raise

Measurement and Verification

Use these metrics to confirm routing is working correctly:

AWS SQS:

python

import boto3

cw = boto3.client("cloudwatch")

def get_queue_age(queue_name: str) -> float:
    """Return ApproximateAgeOfOldestMessage in seconds."""
    resp = cw.get_metric_statistics(
        Namespace="AWS/SQS",
        MetricName="ApproximateAgeOfOldestMessage",
        Dimensions=[{"Name": "QueueName", "Value": queue_name}],
        StartTime="2024-01-01T00:00:00Z",  # replace with dynamic window
        EndTime="2024-01-01T01:00:00Z",
        Period=60,
        Statistics=["Maximum"],
    )
    points = resp.get("Datapoints", [])
    return max((p["Maximum"] for p in points), default=0.0)

Key CloudWatch metric names to alert on:

ApproximateAgeOfOldestMessage — sustained growth signals a consumer stall
ApproximateNumberOfMessagesNotVisible — high values indicate slow consumers or visibility timeout expiry
NumberOfMessagesSent to DLQ — non-zero means failures are accumulating

GCP Pub/Sub:

bash

# Check oldest unacked message age (seconds) per subscription
gcloud monitoring metrics list \
  --filter="metric.type=pubsub.googleapis.com/subscription/oldest_unacked_message_age"

Set an alerting policy on subscription/oldest_unacked_message_age > 300 seconds as a proxy for routing lag.

Expected output after correct routing:

ApproximateAgeOfOldestMessage stays below your processing SLA (e.g. < 120 s for high-priority raster)
DLQ depth remains at 0 during steady-state operations
Consumer error rate on each queue is < 0.5% (malformed payloads caught by schema validation at the producer)

Failure Modes and Debugging

1 — Messages land in wrong queue (silent misrouting)

Signature: Raster consumers report KeyError: 'band_count' because they are receiving metadata-only messages. Vector consumers time out because large raster payloads exceed Lambda memory.

Root cause: The processing_mode attribute is missing or cased inconsistently (Raster vs raster). SQS does not validate attribute values; the router silently falls back to a default queue.

Fix: Validate and normalise processing_mode before routing. Add a schema check at the producer stage using pydantic:

python

from pydantic import BaseModel, field_validator
from typing import Literal

class SpatialMessage(BaseModel):
    source_uri: str
    crs: str
    processing_mode: Literal["vector", "raster", "metadata"]
    priority: Literal["high", "normal", "low"] = "normal"
    job_id: str

    @field_validator("processing_mode", "priority", mode="before")
    @classmethod
    def lowercase_enum(cls, v: str) -> str:
        return v.lower()

2 — SQS visibility timeout expires mid-processing

Signature: ApproximateNumberOfMessagesNotVisible spikes; the same job_id appears in logs twice within minutes. DLQ fills up faster than expected.

Root cause: GDAL cold-start time plus raster I/O exceeds the queue’s visibility timeout. The broker assumes the consumer crashed and redelivers the message to another worker.

Fix: Set visibility timeout to at least 2× the 99th-percentile processing time, up to the 12-hour SQS maximum. For raster jobs where processing time is unpredictable, extend the timeout programmatically from inside the consumer:

python

import boto3

sqs = boto3.client("sqs")

def extend_visibility(queue_url: str, receipt_handle: str, extra_seconds: int = 120) -> None:
    """Extend visibility timeout before it expires during long raster operations."""
    sqs.change_message_visibility(
        QueueUrl=queue_url,
        ReceiptHandle=receipt_handle,
        VisibilityTimeout=extra_seconds,
    )

The cold start sequence for Python GDAL can add 8–12 seconds before GDAL registers its first driver — factor this into your baseline visibility timeout.

3 — GCP Pub/Sub subscription filter drops messages silently

Signature: Published message count is higher than total delivered count across all subscriptions. Some messages are never processed and disappear without reaching the DLQ.

Root cause: A message matches no subscription filter. GCP Pub/Sub silently discards messages that no active subscription claims. This is different from SQS, where unrouted messages sit in the source queue until they expire.

Fix: Create a catch-all subscription with no filter (--message-filter omitted) pointing to a DLQ topic. This ensures every message is delivered somewhere:

bash

gcloud pubsub subscriptions create geo-catch-all-sub \
  --topic=geospatial-ingest \
  --dead-letter-topic=geo-unrouted-dlq \
  --max-delivery-attempts=1

4 — Azure Service Bus message lock expiry

Signature: MessageLockLostException in consumer logs. Processing appears to complete but catalog entries are not written.

Root cause: The lock duration (default 60 seconds, max 5 minutes on Standard tier) expires before a long-running GDAL operation completes. The broker redelivers the message to another consumer, which may find intermediate output in /tmp in an inconsistent state.

Fix: Use Premium tier for raster workloads (lock duration up to 5 minutes) and renew the lock mid-processing:

python

def renew_lock_periodically(receiver, msg, interval_seconds: int = 55) -> None:
    """Renew Azure Service Bus message lock before it expires."""
    import threading
    def renew():
        while True:
            import time
            time.sleep(interval_seconds)
            receiver.renew_message_lock(msg)
    t = threading.Thread(target=renew, daemon=True)
    t.start()

5 — FIFO queue throughput bottleneck

Signature: ApproximateNumberOfMessagesVisible grows steadily during peak ingest. NumberOfMessagesSent exceeds NumberOfMessagesDeleted by more than 2×.

Root cause: SQS FIFO queues are capped at 3,000 msg/s with batching. High-volume tile events from a large Sentinel-2 scene delivery can easily exceed this, especially when batch vs. stream processing decisions result in per-tile events rather than per-scene batches.

Fix: Use a standard (non-FIFO) queue for tile-level parallelism; reserve FIFO only for workflows that require strict ordering (e.g. sequential cadastral boundary edits). Partition FIFO queues by MessageGroupId (one per geographic region or per source sensor) to maximise parallel delivery within ordering groups.

Cost and Scaling Considerations

AWS SQS: Pricing is per 64 KB chunk, per request. A 10-attribute message under 256 KB costs 4 request units. At 10 million raster routing events per month, routing cost is approximately $4 (well below the compute cost). The real cost driver is the Lambda dispatcher — keep it small (128 MB, minimal timeout) since it performs no GDAL operations.

GCP Pub/Sub: Priced per GB of message data processed (both publish and deliver). Subscription filter evaluation is included at no extra cost. If your message envelopes average 2 KB and you route 10 million events, you pay for ~20 GB of data volume per direction (~$0.06 each way at standard pricing).

Azure Service Bus: Standard tier prices per million operations (~$0.01). Premium tier is a fixed hourly charge per messaging unit regardless of volume — economical only above ~5 million operations per month. Use Premium for raster workloads that need long lock durations, sessions, or large messages (up to 100 MB).

Scale-out behaviour: Both SQS and Pub/Sub scale horizontally without configuration up to their documented throughput ceilings. AWS Lambda auto-scales consumers up to the account concurrency limit (default 1,000); set a reserved concurrency on each queue’s consumer to prevent one workload type from starving others. For raster workers, cap concurrency at floor(available_memory_budget / per_worker_peak_memory) to avoid OOM evictions at scale.

When to prefer this approach vs. alternatives:

Use queue-based routing (this pattern) when jobs are discrete, independently retryable, and take 1–15 minutes each
Use chunked streaming I/O instead of a queue for continuous sensor feeds where per-event queuing overhead exceeds processing time
Use AWS Step Functions or GCP Workflows for stateful multi-step pipelines where routing depends on intermediate output (e.g. CRS detection from a partial read of the file)

Frequently Asked Questions

Can AWS SQS filter messages by content without EventBridge?

Standard SQS queues have no native broker-side filtering. The producer or an intermediary Lambda must inspect the payload and dispatch to the correct queue URL. EventBridge adds content-based routing rules that evaluate JSON path expressions and fan out to multiple SQS queues without custom code.

What is the maximum message size for spatial payloads in SQS vs Pub/Sub?

SQS caps each message at 256 KB. GCP Pub/Sub allows up to 10 MB per message and 10 MB per publish request. Azure Service Bus Standard allows 256 KB; Premium allows up to 100 MB. Store actual raster or vector files in object storage and pass only the URI and metadata in the queue message.

How do FIFO queues affect throughput for geospatial workloads?

SQS FIFO queues are capped at 3,000 messages per second with batching (300 without). For high-volume tile ingestion this becomes a bottleneck. Use FIFO only when strict ordering is required — for example, sequential cadastral boundary edits — and partition by MessageGroupId to maximise parallelism within the ordering constraint.

Back to Event-Driven Geospatial Processing Patterns