Lambda Concurrency with SQS: The Gotchas Nobody Warns You About

When you first wire an SQS queue to a Lambda function, everything looks clean. Messages come in, Lambda processes them, life is good. Then you hit production traffic and discover that the defaults are not your friends.

This post covers the concurrency controls I've learned to set correctly after running SQS-triggered Lambdas at scale — and the failure modes that aren't obvious from the docs.

The default behaviour will surprise you

Out of the box, Lambda's SQS event source mapping starts with 5 concurrent batches and scales up aggressively. AWS can spin up to 1,000 concurrent executions (or your account limit) remarkably fast. If your Lambda writes to a database, calls an external API, or does anything with a finite connection pool, you'll hit rate limits before you realise what's happening.

The first gotcha: Lambda scales to match the queue depth, not your downstream capacity.

Batch size is not just about throughput

Most teams set batchSize to 10 and move on. But batch size interacts with concurrency in ways that matter.

A batch size of 10 with 100 concurrent executions means 1,000 messages in flight simultaneously. If each message takes 500ms to process and you're writing to DynamoDB, that's 2,000 write capacity units per second just from this one consumer.

The second gotcha: batch size × concurrency = your actual downstream load. Model this before you deploy.

MaximumConcurrency is the control you actually want

Added in late 2022, MaximumConcurrency on the event source mapping is the cleanest way to cap how many Lambda instances process from a specific queue. This is different from reserved concurrency on the function itself.

new SqsEventSource(queue, {
  batchSize: 5,
  maxConcurrency: 20,
  maxBatchingWindow: Duration.seconds(5),
});

Why this matters: reserved concurrency on the function affects all invocation sources. If the same Lambda is triggered by API Gateway and SQS, setting reserved concurrency to 20 means your API callers compete with your queue consumers. MaximumConcurrency scopes the limit to just the SQS trigger.

The partial batch failure trap

By default, if any message in a batch fails, the entire batch goes back to the queue. Every message gets reprocessed, including the ones that succeeded. If you're not idempotent, you've just introduced duplicates.

The fix is reportBatchItemFailures. Return the IDs of the failed messages and only those get retried:

export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
  const failures: SQSBatchItemFailure[] = [];

  for (const record of event.Records) {
    try {
      await processMessage(record);
    } catch (error) {
      failures.push({ itemIdentifier: record.messageId });
    }
  }

  return { batchItemFailures: failures };
};

This is one of those settings that should arguably be the default but isn't.

What I set on every SQS-Lambda integration now

After enough production incidents, I've landed on a checklist:

MaximumConcurrency set to match downstream capacity, not Lambda's desire to scale
batchSize between 1-10 depending on processing time per message
maxBatchingWindow of 5-10 seconds to let small batches fill up during low traffic
reportBatchItemFailures always enabled
DLQ configured with maxReceiveCount of 3-5 before messages move to dead letter
Alarms on DLQ depth — if messages are landing here, something is wrong

The underlying principle: Lambda wants to scale. Your job is to tell it where to stop.

This is part of a series on event-driven patterns I've used in production. Next up: DynamoDB single-table design for multi-tenant platforms.