Async Batch Processing for Manufacturing Telemetry and OEE Pipelines

Async batch processing is the stage of Ingestion & Cleaning Workflows that accumulates discrete telemetry payloads into deterministic, time- or event-bound windows, cleans them in place, and computes Overall Equipment Effectiveness (OEE) without blocking upstream ingestion. This page focuses on one narrow concern: how to seal a batch correctly so that every downstream metric is reproducible. Manufacturing networks deliver telemetry under variable sampling rates, intermittent RF conditions, and asynchronous PLC scan cycles, so the batch boundary — not the database — is where ordering, completeness, and temporal correctness must be guaranteed. Get the window contract wrong and availability windows drift, micro-stops inflate, and replayed data produces different numbers than the original run.

Window contract at a glance Permalink to this section

A batch flush is governed by an explicit contract. The accumulator seals a window when any trigger fires, and every sealed batch carries enough metadata to be replayed byte-for-byte.

Window strategy	Flush trigger	Typical setting	Best for
Time-based (tumbling)	wall-clock elapsed since last flush	5 min / 15 min	OEE reporting cadence, shift rollups
Count-based	buffered events reach `max_size`	500 events	high, steady throughput lines
State-transition	`RUN` → `STOP` / `MAINTENANCE` seen	per-asset	downtime boundary precision
Hybrid	first of timeout, count, or transition	all three armed	bursty mixed-rate fleets
Grace period	window stays open after seal trigger	2 s	absorbing out-of-order late packets

Core concept and design contract Permalink to this section

The foundation of async batch processing is how telemetry windows are defined, materialized, and flushed. Factory-floor constraints dictate that batch boundaries respect PLC scan cycles, edge gateway memory limits, and shift-change timestamps. Time-based windows align with the 5-minute or 15-minute intervals used for OEE reporting; count-based windows trigger when a fixed number of events accumulate; and hybrid windowing arms both, flushing partial batches when a timeout is reached or a specific state transition is detected.

Three rules form the design contract the accumulator must honor:

Ordering by source time, not arrival time. Each payload is timestamped at the edge and tagged with a unique asset identifier following PLC tag standardization so that one accumulator can serve heterogeneous assets. The buffer is sorted on source timestamps before sealing, because OEE state aggregation depends on contiguous state sequences rather than network arrival order.
Bounded lateness via a grace period. Network jitter and broker failover cause out-of-order delivery. A configurable grace period keeps the window open briefly after its seal trigger so late-arriving packets land in the correct batch instead of polluting the next one. This mirrors event-time watermarking in stream processors: events later than the watermark are dropped to a side channel, never silently merged.
Per-asset isolation. A stalled or chatty asset must never delay or starve another. Accumulators are keyed per asset (or per ISA-95 work unit) so backpressure is local, not global.

These rules align with the at-least-once delivery semantics of QoS 1 for discrete state transitions: duplicates are expected, so the batch must deduplicate on (asset_id, sequence_id) before computing counters.

Implementation Permalink to this section

Python automation builders implement these windows with asyncio queues and a sliding buffer. Each incoming payload is enqueued, drained into a per-asset buffer, sorted on its edge timestamp, and the buffer is sealed when the flush contract is satisfied. The accumulator below handles the 80% case: a single asset stream with hybrid time/count flushing and a grace period for late packets.

import asyncio
import time
import logging
from dataclasses import dataclass, field
from typing import Callable, Awaitable

logger = logging.getLogger("async_batch")


@dataclass(slots=True)
class TelemetryPayload:
    asset_id: str
    timestamp: float       # edge-generated epoch seconds (UTC)
    sequence_id: int       # monotonic per-asset counter for dedup
    state: str             # e.g. RUN / STOP / MAINTENANCE
    value: float


@dataclass
class AsyncBatchAccumulator:
    """Per-asset accumulator with hybrid flush and late-arrival grace period."""
    flush_cb: Callable[[list[TelemetryPayload]], Awaitable[None]]
    max_size: int = 500
    timeout_sec: float = 30.0
    grace_period: float = 2.0
    _buffer: list[TelemetryPayload] = field(default_factory=list)
    _seen: set[int] = field(default_factory=set)
    _last_flush: float = field(default_factory=time.monotonic)
    _queue: asyncio.Queue[TelemetryPayload] = field(default_factory=asyncio.Queue)

    async def ingest(self, payload: TelemetryPayload) -> None:
        await self._queue.put(payload)

    async def run(self) -> None:
        while True:
            try:
                payload = await asyncio.wait_for(self._queue.get(), timeout=1.0)
                if payload.sequence_id in self._seen:
                    continue                       # QoS 1 duplicate — drop
                self._seen.add(payload.sequence_id)
                self._buffer.append(payload)
            except asyncio.TimeoutError:
                pass                               # idle tick: still re-check flush

            if self._should_flush():
                await self._seal_and_flush()

    def _should_flush(self) -> bool:
        if not self._buffer:
            return False
        if len(self._buffer) >= self.max_size:
            return True
        if time.monotonic() - self._last_flush >= self.timeout_sec:
            return True
        if self._buffer[-1].state in {"STOP", "MAINTENANCE"}:
            return True
        return False

    async def _seal_and_flush(self) -> None:
        # Grace period: drain any late packets already waiting before sealing.
        deadline = time.monotonic() + self.grace_period
        while time.monotonic() < deadline:
            try:
                late = self._queue.get_nowait()
                if late.sequence_id not in self._seen:
                    self._seen.add(late.sequence_id)
                    self._buffer.append(late)
            except asyncio.QueueEmpty:
                await asyncio.sleep(0.05)
        # Seal: order by source time so OEE sees a contiguous state sequence.
        batch = sorted(self._buffer, key=lambda p: p.timestamp)
        self._buffer.clear()
        self._seen.clear()
        self._last_flush = time.monotonic()
        try:
            await self.flush_cb(batch)             # hand sealed batch downstream
        except Exception:
            logger.exception("batch flush failed; batch routed to DLQ")
            raise

The flush_cb hands the sealed, ordered batch to the cleaning and OEE stages described below. Sealing on a copy and clearing the buffer immediately keeps the ingest loop responsive while the (potentially slow) downstream callback runs.

Cleaning and quality gates inside the batch Permalink to this section

Once a batch is sealed it enters the cleaning phase, where pipelines enforce data-quality thresholds before any metric is computed. Raw telemetry frequently contains missing intervals from sensor polling failures, gateway reboots, or electromagnetic interference; availability math cannot proceed until those voids are reconstructed. Reconstruction is delegated to gap-filling algorithms that interpolate linearly between known states, forward-fill machine status codes, or apply cycle-time heuristics — all operating deterministically so replayed batches produce identical OEE.

Simultaneously, sensor noise and transient electrical faults inject spurious readings that skew performance calculations. The batch runs outlier detection methods — rolling Z-score, interquartile-range filtering, or Hampel identifiers — in memory over the sealed buffer, flagging anomalies before they reach the OEE state machine. Flagged values are set to NaN so the gap-filling routine, not the outlier filter, owns replacement.

import numpy as np


def rolling_zscore_mask(values: np.ndarray, window: int = 10,
                        threshold: float = 3.0) -> tuple[np.ndarray, np.ndarray]:
    """Flag outliers with a rolling Z-score; flagged points become NaN.

    Returns the cleaned array and a boolean mask of detected outliers.
    Detection only; imputation is deferred to the gap-filling stage.
    """
    cleaned = values.astype(float).copy()
    mask = np.zeros(len(values), dtype=bool)
    for i in range(window - 1, len(values)):
        win = values[i - window + 1: i + 1]
        mu, sigma = win.mean(), win.std()
        z = abs(values[i] - mu) / (sigma + 1e-9)   # epsilon avoids /0 on flat signals
        if z > threshold:
            mask[i] = True
            cleaned[i] = np.nan
    return cleaned, mask

Edge devices and PLCs rarely stay synchronized with enterprise NTP/PTP servers, so before sealing, the batch applies clock drift correction: each edge timestamp is offset against a trusted reference and a linear drift model is applied across the window. Skipping this step produces phantom state transitions and misaligned shift boundaries that no amount of downstream cleaning can recover.

OEE computation on deterministic boundaries Permalink to this section

OEE is the product of Availability, Performance, and Quality, and async batch processing exists so that each factor is calculated over deterministic, non-overlapping windows. Availability follows the canonical form:

$A = \frac{\text{Planned Production Time} - \text{Downtime}}{\text{Planned Production Time}}$

Availability derives from state-duration aggregation (RUN vs STOP vs MAINTENANCE), Performance compares actual cycle times against the ideal cycle time, and Quality removes scrap or rework events flagged during cleaning. The batch processor materializes a ShiftOEE record by iterating the cleaned, time-aligned sequence; state transitions are validated against ISA-95 equipment models, and the exact factor definitions are pinned by OEE formula validation so that two engineers computing the same shift get the same number. Partial batches at shift boundaries are explicitly carried forward or truncated according to shift boundary logic, eliminating the metric drift that ad-hoc boundary handling introduces.

Edge cases and failure modes Permalink to this section

Factory networks are inherently unstable; gateway reboots, PLC communication drops, and sensor calibration cycles are expected operational events, not exceptions. A robust accumulator degrades gracefully rather than failing catastrophically:

PLC heartbeat loss. When a controller stops responding, the pipeline transitions the asset to a COMMUNICATION_LOSS state, forward-fills the last known good value with a quality flag, and continues OEE aggregation with a reduced-confidence marker — rather than treating the gap as zero production.
Late packets beyond the grace period. Packets arriving after the window seals are routed to a side channel keyed by their target window, never merged into the live batch, so sealed metrics stay immutable.
Duplicate delivery (QoS 1). At-least-once brokers redeliver on reconnect; dedup on (asset_id, sequence_id) prevents double-counted parts and inflated performance numbers.
Scan-cycle misalignment. When a PLC scan cycle is slower than the count-based trigger, the time-based and state-transition triggers must still seal the window so a slow asset does not buffer indefinitely.
Broker failover mid-window. In-flight buffers should spill to local NVMe or object storage and replay on reconnect; the accumulator’s idempotent sealing makes replayed batches safe.

@dataclass
class PLCStateMonitor:
    heartbeat_timeout: float = 5.0
    _last_seen: dict[str, float] = field(default_factory=dict)

    def update(self, asset_id: str, ts: float) -> None:
        self._last_seen[asset_id] = ts

    def is_live(self, asset_id: str, now: float) -> bool:
        return (now - self._last_seen.get(asset_id, 0.0)) <= self.heartbeat_timeout

Verification and testing Permalink to this section

A batch pipeline is only trustworthy if its determinism is provable. The single most valuable test feeds the same payloads in shuffled arrival order and asserts the sealed batch — and the resulting OEE — is identical, proving that ordering and dedup work end to end.

import pytest


@pytest.mark.asyncio
async def test_seal_is_order_invariant():
    sealed: list[list[TelemetryPayload]] = []

    async def capture(batch): sealed.append(batch)

    base = [TelemetryPayload("LINE-1.PRESS-3", 1000.0 + i, i, "RUN", float(i))
            for i in range(5)]
    base.append(TelemetryPayload("LINE-1.PRESS-3", 1005.0, 5, "STOP", 0.0))

    acc = AsyncBatchAccumulator(flush_cb=capture, max_size=100, grace_period=0.1)
    runner = asyncio.create_task(acc.run())
    for p in reversed(base):                 # arrive out of order
        await acc.ingest(p)
    await acc.ingest(base[2])                 # duplicate sequence_id
    await asyncio.sleep(1.0)                  # allow STOP-triggered flush
    runner.cancel()

    assert len(sealed) == 1
    ts = [p.timestamp for p in sealed[0]]
    assert ts == sorted(ts)                   # source-time order restored
    assert len(sealed[0]) == 6                # duplicate dropped

Beyond unit tests, confirm behavior against live infrastructure: inspect broker logs for redelivery counts that should match your dedup-drop metric, and validate sealed batches with a TSDB query that checks no window overlaps and no gaps exceed a tolerance, for example in TimescaleDB:

-- Detect overlapping or missing 15-minute OEE windows per asset
SELECT asset_id, window_start,
       window_start - lag(window_end) OVER (PARTITION BY asset_id
                                            ORDER BY window_start) AS gap
FROM shift_oee
ORDER BY asset_id, window_start;

A non-zero, non-900s gap flags a dropped or overlapping batch.

Performance and scale considerations Permalink to this section

As sensor density grows across production lines, single-process accumulators become bottlenecks. Horizontal scaling decouples ingestion from cleaning and computation by routing sealed batches to a task queue and a worker pool. The companion recipe, using Celery for high-throughput MQTT ingestion, shows how to fan sealed batches into dedicated queues, apply backpressure via broker prefetch limits, and scale workers on queue depth or CPU. Configuring task_acks_late=True and task_reject_on_worker_lost=True ensures a batch is acknowledged only after OEE computation succeeds, preventing metric loss during pod restarts or network partitions; with Redis or RabbitMQ as broker, this supports thousands of concurrent asset streams at sub-second flush latency.

Memory is bounded by max_size times the number of live assets plus the dedup set, so prefer count-based flushing on high-rate lines to cap buffer growth, and clear the dedup set on every seal (as above) to keep it from growing without limit. Persisting sealed batches to a time-series database with a deterministic write key (asset_id + window_start) makes ingestion idempotent under replay, and per-asset partitioning keeps query and retention costs proportional to fleet size. Production hardening means structured logging, distributed tracing, and Prometheus metrics for batch latency, queue depth, gap-fill ratio, and outlier-rejection rate, with dead-letter queues capturing malformed payloads and alerts firing on sustained queue growth or PLC connectivity degradation.

Ingestion & Cleaning Workflows — parent overview of the full ingestion and cleaning pipeline
Using Celery for high-throughput MQTT ingestion — scaling sealed batches across a distributed worker pool
Clock drift correction — temporal alignment applied before a batch is sealed
Gap-filling algorithms — deterministic reconstruction of missing intervals
Outlier detection methods — in-batch anomaly filtering before OEE computation
OEE formula validation — pinning the Availability, Performance, and Quality definitions the batch computes

Async Batch Processing for Manufacturing Telemetry and OEE Pipelines

Window contract at a glance #Permalink to this section

Core concept and design contract #Permalink to this section

Implementation #Permalink to this section

Cleaning and quality gates inside the batch #Permalink to this section

OEE computation on deterministic boundaries #Permalink to this section

Edge cases and failure modes #Permalink to this section

Verification and testing #Permalink to this section

Performance and scale considerations #Permalink to this section

Related #Permalink to this section

Continue exploring

Related in this section