Telemetry for half a million devices, in real time
Split high-frequency sensor ingestion, processing, and query into independent containerised services on a Kafka event bus, each scaling to its own load — sub-second dashboards at a tenth of the previous cost.
- Microservices
- Containers
- Kafka
- Time-series
- 520k
- Devices
- 1.2M msg/min
- Ingest
- −91%
- Cost / device
Each service has its own pipeline and ships on its own cadence — a fix to the ingestion service deploys without touching query or processing. All promote through the same dev → staging → prod environments behind a manual approval before prod.
Northwind ran a fleet of grid sensors on a single monolithic time-series cluster that was both expensive and perpetually nearly full. Ingestion, alerting, and operator queries all contended for the same machines, so the cluster had to be provisioned for the sum of every peak at once — headroom that sat idle most of the day. Worse, the coupling meant a flood of inbound telemetry starved the read path: operators waited minutes for a dashboard to load during exactly the demand spikes when they needed it instantly.
We broke the monolith into services drawn along the natural seams of the workload, each a container on ECS/EKS that scales to its own load. An ingestion service terminates connections from the half-million-device fleet and does nothing else — validate, normalise, publish. It writes onto a Kafka event bus (MSK), which decouples the firehose of inbound telemetry from everything downstream and absorbs spikes as buffered backlog rather than dropped messages. A processing-and-alerting service consumes that stream, computes rolling aggregates, and raises threshold alerts; a separate query service owns the read path and serves operator dashboards from pre-rolled summaries. Each writes to a time-series store sized for its own access pattern. Because ingestion and query no longer share a machine, a telemetry surge scales the ingestion fleet horizontally without ever touching dashboard latency — and every service is deployed, versioned, and scaled independently.
The platform now absorbs over a million messages a minute across five hundred thousand devices, and each service rides its own autoscaler — the ingestion fleet expands with the surge while the query service holds steady. Operator dashboards render in under a second during peak load, because the read path no longer competes with ingestion for capacity. And cost per device fell ninety-one percent: instead of one over-provisioned cluster paying for the sum of every peak, each right-sized service scales up only when its own load demands it and back down the moment it passes.
Let’s write the next system into being
Tell us what you’re building. We’ll reply within two business days with a frank read on scope, shape, and whether we’re the right studio for it.