Event-Driven Platform on AWS
When your services can no longer afford to wait for each other — how to build a platform where every part of your product reacts to what's happening in real time, without coupling teams or systems together.
Your services have started to hold each other hostage
A scaleup that started lean — one service, one database, one team — eventually reaches a point where the architecture becomes the bottleneck. A new feature in billing needs data from fulfilment. Notifications need to know about orders. Analytics needs to track everything.
The instinct is to connect them directly: API calls between services, shared databases, synchronous responses. It works at first. Then one service goes slow and everything downstream queues behind it. An outage in one part of the product takes down the whole thing. Teams can't ship independently because their services are tightly wired together.
The cost is real: engineering velocity drops, on-call escalations increase, and customers experience failures that ripple across unrelated features. The product can't scale because the architecture can't.
Services that react to events, not to each other
Instead of services calling each other directly, they publish events to a central stream — and every other service that cares about that event subscribes independently. An order is placed: billing processes it, fulfilment picks it up, notifications fire, and analytics records it. Each independently, each at their own pace, without any of them knowing the others exist.
This blueprint delivers a managed event streaming platform on AWS — built for production scale with predictable costs, without the operational overhead of running your own Kafka cluster. Services become genuinely independent: they can be deployed, scaled, and failed without touching anything else. Failed processing can be retried automatically. Events can be replayed to fix bugs or onboard a new consumer without touching the producer.
The architecture diagram below shows a typical production deployment — producers publishing to topic partitions, consumer services reading at their own pace, with a relational store for the business data and a cache layer for high-read paths.
Diagram to be added — Producers → Event Stream → Consumer Services → Data Store
The four choices that define this blueprint
Every blueprint involves decisions that aren't obvious in hindsight. Here's what was chosen, why, and what was ruled out.
Event streaming vs direct service calls
Event streaming — producers publish without knowing who's listening. Consumers subscribe without the producer ever needing to change.
Direct REST calls between services. Creates tight coupling — when a downstream service is slow or unavailable, it blocks the upstream caller.
Managed Kafka vs self-hosted
Amazon MSK (managed Kafka). Automatic patching, built-in replication, no cluster management. Lets the team focus on building, not operating Kafka.
Self-hosted Kafka on EC2. At scaleup stage, maintaining a Kafka cluster is roughly half a senior engineer's time. That's rarely the right trade-off.
Long-running consumers vs serverless functions
Long-running consumer services on ECS Fargate. Kafka consumers need persistent connections and stateful offset management — serverless functions aren't suited for this.
AWS Lambda for Kafka consumers. Cold starts kill throughput at volume. Lambda is the right choice for simple SQS queues, not high-throughput Kafka topics.
Infrastructure as code approach
Terraform with modular structure. State management, team-familiar syntax, reusable modules for MSK, ECS, and networking. Easily version-controlled alongside application code.
Manual AWS console configuration. Non-reproducible, impossible to audit, and creates snowflake environments that can't be recreated reliably. Never the right call.
Is this the right blueprint for you?
This fits when
Multiple services need to react to the same event — e.g. order placed triggers billing, fulfilment, and notification independently
You're processing 10,000+ events per day and synchronous handling is becoming a performance bottleneck
You need a reliable audit trail of state changes — for compliance, debugging, or analytics
Teams need to ship independently without coordinating deploys across service boundaries
This doesn't fit when
You're processing fewer than 1,000 events per day — a simple job queue is cheaper and far easier to operate
You're still finding product-market fit — this level of infrastructure adds operational complexity before you need it
Your team has no streaming or Kafka experience — the learning curve is real and requires investment to do safely
You run a single-service architecture — event-driven adds complexity with no meaningful benefit at that scale
What to expect before you commit
Rough signals only — actual numbers depend on your traffic, team, and configuration. Use these to decide whether to scope further.
4–6 weeks
From kickoff to a production-grade deployment with monitoring and runbooks
2–3 engineers
1 platform/DevOps engineer + 1–2 backend engineers for consumer services
Medium–High
Consumer lag monitoring, dead-letter queues, and offset management need ongoing attention
£900–2,500/mo
At moderate scale. MSK broker cost is the main driver — scales with broker instance size
Implement this blueprint
Want this built for your team?
We adapt these blueprints to your stack, team size, and budget. If your services are starting to hold each other back, let's design the solution together.