Event-Driven Platform on AWS

# the business problem

Your services have started to hold each other hostage

A scaleup that started lean — one service, one database, one team — eventually reaches a point where the architecture becomes the bottleneck. A new feature in billing needs data from fulfilment. Notifications need to know about orders. Analytics needs to track everything.

The instinct is to connect them directly: API calls between services, shared databases, synchronous responses. It works at first. Then one service goes slow and everything downstream queues behind it. An outage in one part of the product takes down the whole thing. Teams can't ship independently because their services are tightly wired together.

The cost is real: engineering velocity drops, on-call escalations increase, and customers experience failures that ripple across unrelated features. The product can't scale because the architecture can't.

# the solution

Services that react to events, not to each other

Instead of services calling each other directly, they publish events to a central stream — and every other service that cares about that event subscribes independently. An order is placed: billing processes it, fulfilment picks it up, notifications fire, and analytics records it. Each independently, each at their own pace, without any of them knowing the others exist.

This blueprint delivers a managed event streaming platform on AWS — built for production scale with predictable costs, without the operational overhead of running your own Kafka cluster. Services become genuinely independent: they can be deployed, scaled, and failed without touching anything else. Failed processing can be retried automatically. Events can be replayed to fix bugs or onboard a new consumer without touching the producer.

The architecture diagram below shows a typical production deployment — producers publishing to topic partitions, consumer services reading at their own pace, with a relational store for the business data and a cache layer for high-read paths.

Architecture diagram

Diagram to be added — Producers → Event Stream → Consumer Services → Data Store

# key decisions

The four choices that define this blueprint

Every blueprint involves decisions that aren't obvious in hindsight. Here's what was chosen, why, and what was ruled out.

Decision 01

Event streaming vs direct service calls

Chosen

Event streaming — producers publish without knowing who's listening. Consumers subscribe without the producer ever needing to change.

Ruled out

Direct REST calls between services. Creates tight coupling — when a downstream service is slow or unavailable, it blocks the upstream caller.

Decision 02

Managed Kafka vs self-hosted

Chosen

Amazon MSK (managed Kafka). Automatic patching, built-in replication, no cluster management. Lets the team focus on building, not operating Kafka.

Ruled out

Self-hosted Kafka on EC2. At scaleup stage, maintaining a Kafka cluster is roughly half a senior engineer's time. That's rarely the right trade-off.

Decision 03

Long-running consumers vs serverless functions

Chosen

Long-running consumer services on ECS Fargate. Kafka consumers need persistent connections and stateful offset management — serverless functions aren't suited for this.

Ruled out

AWS Lambda for Kafka consumers. Cold starts kill throughput at volume. Lambda is the right choice for simple SQS queues, not high-throughput Kafka topics.

Decision 04

Infrastructure as code approach

Chosen

Terraform with modular structure. State management, team-familiar syntax, reusable modules for MSK, ECS, and networking. Easily version-controlled alongside application code.

Ruled out

Manual AWS console configuration. Non-reproducible, impossible to audit, and creates snowflake environments that can't be recreated reliably. Never the right call.

# fit assessment

Is this the right blueprint for you?

This fits when

Multiple services need to react to the same event — e.g. order placed triggers billing, fulfilment, and notification independently

You're processing 10,000+ events per day and synchronous handling is becoming a performance bottleneck

You need a reliable audit trail of state changes — for compliance, debugging, or analytics

Teams need to ship independently without coordinating deploys across service boundaries

This doesn't fit when

You're processing fewer than 1,000 events per day — a simple job queue is cheaper and far easier to operate

You're still finding product-market fit — this level of infrastructure adds operational complexity before you need it

Your team has no streaming or Kafka experience — the learning curve is real and requires investment to do safely

You run a single-service architecture — event-driven adds complexity with no meaningful benefit at that scale

# complexity & cost signal

What to expect before you commit

Rough signals only — actual numbers depend on your traffic, team, and configuration. Use these to decide whether to scope further.

Time to production

4–6 weeks

From kickoff to a production-grade deployment with monitoring and runbooks

Team required

2–3 engineers

1 platform/DevOps engineer + 1–2 backend engineers for consumer services

Operational complexity

Medium–High

Consumer lag monitoring, dead-letter queues, and offset management need ongoing attention

AWS cost (rough)

£900–2,500/mo

At moderate scale. MSK broker cost is the main driver — scales with broker instance size

Back to Labs

Implement this blueprint

Want this built for your team?

We adapt these blueprints to your stack, team size, and budget. If your services are starting to hold each other back, let's design the solution together.

Book a discovery call →