Case study

Telemetry for half a million devices, in real time

Split high-frequency sensor ingestion, processing, and query into independent containerised services on a Kafka event bus, each scaling to its own load — sub-second dashboards at a tenth of the previous cost.

Northwind Energy2024Illustrative

Microservices
Containers
Kafka
Time-series

By the numbers

520k
Devices: 1.2M msg/min
Ingest: −91%
Cost / device

how it's built

Telemetry pipelinemicroservices · containers

500k devicesMQTT / HTTPS

Ingestion serviceECS · autoscaled

Event busKafka · MSK

Processing serviceECS

Alerting serviceECS

Time-series storeCassandra

Query APIECS

Dashboards

Scale

Each service scales to its own load — an ingest spike never throttles dashboards.

Throughput

1.2M messages/min sustained across the fleet.

Latency

Sub-second fleet dashboards.

Why microservices

Ingest, processing, and query have different scaling shapes and deploy cadences — independent services keep them decoupled.

Delivery & environmentsCI/CD

Sourcegit push · per service

CIlint · typecheck · test

BuildDocker build → ECR

DeployECS · rolling (per service)

devstagingprod

Each service has its own pipeline and ships on its own cadence — a fix to the ingestion service deploys without touching query or processing. All promote through the same dev → staging → prod environments behind a manual approval before prod.

01The problem

Northwind ran a fleet of grid sensors on a single monolithic time-series cluster that was both expensive and perpetually nearly full. Ingestion, alerting, and operator queries all contended for the same machines, so the cluster had to be provisioned for the sum of every peak at once — headroom that sat idle most of the day. Worse, the coupling meant a flood of inbound telemetry starved the read path: operators waited minutes for a dashboard to load during exactly the demand spikes when they needed it instantly.

02The approach

We broke the monolith into services drawn along the natural seams of the workload, each a container on ECS/EKS that scales to its own load. An ingestion service terminates connections from the half-million-device fleet and does nothing else — validate, normalise, publish. It writes onto a Kafka event bus (MSK), which decouples the firehose of inbound telemetry from everything downstream and absorbs spikes as buffered backlog rather than dropped messages. A processing-and-alerting service consumes that stream, computes rolling aggregates, and raises threshold alerts; a separate query service owns the read path and serves operator dashboards from pre-rolled summaries. Each writes to a time-series store sized for its own access pattern. Because ingestion and query no longer share a machine, a telemetry surge scales the ingestion fleet horizontally without ever touching dashboard latency — and every service is deployed, versioned, and scaled independently.

03The outcome

The platform now absorbs over a million messages a minute across five hundred thousand devices, and each service rides its own autoscaler — the ingestion fleet expands with the surge while the query service holds steady. Operator dashboards render in under a second during peak load, because the read path no longer competes with ingestion for capacity. And cost per device fell ninety-one percent: instead of one over-provisioned cluster paying for the sum of every peak, each right-sized service scales up only when its own load demands it and back down the moment it passes.

05Start a project

Let’s write the next system into being

Tell us what you’re building. We’ll reply within two business days with a frank read on scope, shape, and whether we’re the right studio for it.

Start a project