Downtime Classification & OEE Calculation Pipelines for Industrial Telemetry

Overall Equipment Effectiveness is only as trustworthy as the machine-state stream it is computed from, yet most failures in OEE reporting originate long before the multiplication of Availability, Performance, and Quality. They originate in the classification layer, where raw PLC signals are turned into discrete production and downtime intervals. Naive approaches that forward unconditioned digital inputs straight into a time-series database collapse under real factory conditions: microstops bleed into availability counters, shift handoffs fragment production runs, daylight-saving transitions double-count intervals, and asynchronous MES events arrive out of order. This guide covers the engineering patterns that turn noisy telemetry into deterministic, auditable downtime classification and OEE math. It is one of the three core sections of this site, alongside Core Architecture & Data Mapping and Ingestion & Cleaning Workflows; start from the site index for the full pipeline.

The classification layer as an ordered chain: raw tags are resolved to canonical states, microstops are aggregated, intervals are sliced to shifts, and only then does OEE math run behind a validation gate before publishing — with failed or missing data routed to a fallback that reconciles back into the metric.

The subsystems below run as an ordered chain. Each one defines a contract, ships a production-grade implementation, and then enumerates the failure modes that break it on a live floor. Upstream signal hygiene from PLC tag standardization and time-series database sync is assumed; the classification layer cannot recover information that ingestion has already corrupted.

Deterministic State Resolution from Raw Telemetry Permalink to this section

Raw machine telemetry arrives as discrete digital tags, analog process variables, and high-frequency encoder pulses. The first contract this layer must satisfy is deterministic state resolution: identical inputs must always resolve to the same canonical state, regardless of arrival order or evaluation host. PLCs typically expose a limited set of status bits (Running, Faulted, Idle, Maintenance), but these rarely map cleanly to the Availability, Performance, and Quality dimensions OEE requires. A robust pipeline implements a finite state machine (FSM) at the edge or in the ingestion layer to normalize these signals into canonical manufacturing states.

The edge-side finite state machine. Maintenance and fault guards take precedence over run and idle conditions, so a planned window suppresses unplanned-downtime counting and a fault overrides an idle reading — the determinism that every downstream OEE metric depends on.

The contract requires debouncing noisy transitions, applying hysteresis to analog thresholds, and establishing a deterministic precedence order when multiple status bits assert simultaneously. A fault condition must override an idle state, and a planned maintenance window must suppress unplanned downtime counters. The transition from raw tag arrays to structured state events is where event-to-downtime mapping becomes the critical control point for every downstream metric.

from dataclasses import dataclass
from datetime import datetime
from enum import Enum

class MachineState(Enum):
    RUNNING = "RUNNING"
    IDLE = "IDLE"
    FAULTED = "FAULTED"
    PLANNED_MAINTENANCE = "PLANNED_MAINTENANCE"
    UNPLANNED_DOWNTIME = "UNPLANNED_DOWNTIME"

@dataclass
class StateTransition:
    asset_id: str
    timestamp: datetime
    previous_state: MachineState
    current_state: MachineState
    raw_tags: dict

def resolve_state(raw_tags: dict, current_state: MachineState) -> MachineState:
    """Deterministic FSM with explicit precedence hierarchy."""
    if raw_tags.get("maintenance_mode"):
        return MachineState.PLANNED_MAINTENANCE
    if raw_tags.get("fault_active"):
        return MachineState.FAULTED
    if raw_tags.get("running_command") and raw_tags.get("motor_current", 0) > 0.5:
        return MachineState.RUNNING
    if raw_tags.get("idle_timer", 0) > 30:
        return MachineState.IDLE
    return current_state  # Hold previous state on ambiguous input

Edge cases and failure modes. Two-state PLCs that expose only a Running bit cannot distinguish planned from unplanned stops, so the FSM must enrich state with an external maintenance calendar before classification. Chattering inputs — a fault bit toggling every PLC scan — will flood the transition log unless the resolver holds the previous state on ambiguous input, as shown above. Analog thresholds without hysteresis (a single motor_current > 0.5 comparison) flap continuously when the signal sits near the boundary; widen to a dual-threshold band (enter at 0.5, exit at 0.3). Finally, the resolver must be a pure function of (raw_tags, current_state) so the same event replayed from a dead-letter queue yields an identical state, with no dependence on wall-clock time inside the function body.

Microstop Aggregation & Threshold Engineering Permalink to this section

High-speed packaging lines and CNC machining centers generate hundreds of sub-second stops that, logged indiscriminately, artificially depress availability. The contract for this subsystem is to separate process-induced microstops from genuine macrostops that require operator intervention, without discarding the true production losses in between. The mechanism is a configurable duration threshold combined with state-persistence logic.

A common pattern implements a sliding window that aggregates consecutive idle or fault transitions below a defined duration — typically thirty to one hundred twenty seconds depending on asset class. When the aggregated duration exceeds the threshold, the pipeline promotes the sequence to a classified downtime event. This prevents telemetry noise from triggering false alerts while preserving real losses. Correct threshold tuning for microstops ensures that cycle-to-cycle variation does not corrupt the baseline availability window.

# downtime_thresholds.yaml
microstop_config:
  max_duration_sec: 45
  aggregation_window_sec: 120
  state_filters: ["IDLE", "FAULTED"]
  promotion_rule: "SUM_DURATION > max_duration_sec"
  fallback_behavior: "LOG_TO_METRICS_ONLY"

import pandas as pd

def aggregate_microstops(events_df: pd.DataFrame, threshold_sec: int = 45) -> pd.DataFrame:
    """Rolling aggregation of consecutive non-running states."""
    events_df = events_df.sort_values("timestamp").copy()
    events_df["is_running"] = events_df["state"] == "RUNNING"
    events_df["group_id"] = (~events_df["is_running"]).cumsum()

    aggregated = (
        events_df[~events_df["is_running"]]
        .groupby(["asset_id", "group_id"])
        .agg(
            start=("timestamp", "min"),
            end=("timestamp", "max"),
            duration=("timestamp", lambda x: (x.max() - x.min()).total_seconds())
        )
        .reset_index()
    )
    return aggregated[aggregated["duration"] > threshold_sec]

Edge cases and failure modes. A single global threshold is wrong: a conveyor motor and a high-speed cartoner have microstop profiles that differ by an order of magnitude, so thresholds belong per asset class, not per site. The cumsum grouping above assumes a clean monotonic timestamp index — if clock drift correction has not run upstream, out-of-order rows will split one stop into two groups and under-count duration. Aggregation that spans a shift boundary will attribute a stop to the wrong period unless slicing (next section) runs first. And promoting a stop to a macrostop must be idempotent: replaying the same window must not emit a second downtime event with a fresh ID.

Temporal Alignment & Shift Boundary Handling Permalink to this section

Production windows rarely align with UTC midnight. Shift handoffs, timezone conversions, and daylight-saving transitions introduce fragmentation that corrupts daily OEE rollups. The contract here is to slice continuous telemetry into discrete, auditable production windows without truncating active states or double-counting overlapping intervals.

The implementation requires timezone-aware datetime handling, explicit shift-calendar definitions, and boundary reconciliation. When an event spans two shifts, the pipeline splits the duration and assigns each fragment to the correct operational period. Robust shift boundary logic prevents metric leakage during DST changes and keeps MES production orders aligned with PLC telemetry windows.

from zoneinfo import ZoneInfo
import pandas as pd

def slice_to_shifts(events_df: pd.DataFrame, shift_calendar: list[dict]) -> pd.DataFrame:
    """Split events crossing shift boundaries into discrete production windows."""
    tz = ZoneInfo("America/Chicago")
    results = []

    for _, event in events_df.iterrows():
        start, end = event["start"].astimezone(tz), event["end"].astimezone(tz)
        for shift in shift_calendar:
            shift_start = pd.Timestamp(shift["start"], tz=tz)
            shift_end = pd.Timestamp(shift["end"], tz=tz)

            overlap_start = max(start, shift_start)
            overlap_end = min(end, shift_end)

            if overlap_start < overlap_end:
                results.append({
                    "asset_id": event["asset_id"],
                    "shift_id": shift["id"],
                    "start": overlap_start,
                    "end": overlap_end,
                    "duration_sec": (overlap_end - overlap_start).total_seconds(),
                    "state": event["state"]
                })
    return pd.DataFrame(results)

Edge cases and failure modes. During a fall-back DST transition the local wall clock repeats one hour; naive local-time arithmetic counts that hour twice, so all interval math must occur in UTC and convert to local only for display. In a spring-forward transition the missing hour produces negative or zero-length overlaps, which the overlap_start < overlap_end guard correctly drops. An event that crosses three shifts (a weekend outage spanning shift changes) must fan out into one fragment per shift, not be clipped to the first match — the inner loop above handles this only if the calendar enumerates every shift in range. Overlapping or back-to-back shift definitions will double-assign duration unless the calendar is validated for gaps and overlaps before slicing.

OEE Calculation & Metric Validation Permalink to this section

OEE is the product of three bounded ratios. The formula looks trivial, but production implementations fail when denominators approach zero, negative durations slip through, or scrap events misalign with cycle counts. The contract is that no component may leave this stage outside the closed interval $[0, 1]$ , and the published OEE must be reproducible from the stored inputs.

$\text{Availability} = \frac{\text{Planned Production Time} - \text{Downtime}}{\text{Planned Production Time}}$

$\text{Performance} = \frac{\text{Ideal Cycle Time} \times \text{Total Count}}{\text{Operating Time}}$

$\text{Quality} = \frac{\text{Good Count}}{\text{Total Count}}$

$\text{OEE} = \text{Availability} \times \text{Performance} \times \text{Quality}$

Each component must be clamped between $0.0$ and $1.0$ , with explicit handling for missing telemetry and maintenance exclusions. Rigorous OEE formula validation catches upstream corruption before it reaches executive reporting, and the overlapping-maintenance-window case shows how planned exclusions must be subtracted from planned time rather than added to downtime.

def calculate_oee(
    planned_time_sec: float,
    operating_time_sec: float,
    ideal_cycle_time_sec: float,
    total_count: int,
    good_count: int
) -> dict:
    if planned_time_sec <= 0:
        return {"availability": 0.0, "performance": 0.0, "quality": 0.0, "oee": 0.0}

    availability = min(1.0, max(0.0, operating_time_sec / planned_time_sec))
    performance = min(1.0, max(0.0, (ideal_cycle_time_sec * total_count) / operating_time_sec)) if operating_time_sec > 0 else 0.0
    quality = min(1.0, max(0.0, good_count / total_count)) if total_count > 0 else 0.0

    oee = availability * performance * quality
    return {
        "availability": round(availability, 4),
        "performance": round(performance, 4),
        "quality": round(quality, 4),
        "oee": round(oee, 4)
    }

Edge cases and failure modes. A performance ratio above $1.0$ is a signal, not a rounding artifact: it means the configured ideal cycle time is slower than the line actually runs, so clamping must be paired with an alert rather than silently capping. Quality computed before scrap reconciliation closes will drift as late MES rework events arrive, so late-arriving counts must trigger a recomputation of the affected window. Float division compounds IEEE 754 representation error across millions of cycles; for ledger-grade reporting, accumulate counts as integers and operating time in whole seconds, applying precision and rounding limits only at the final ratio. A zero planned_time_sec (an unscheduled asset) must yield a defined zero, never a ZeroDivisionError that drops the asset from the rollup.

Pipeline Reliability & Fallback Routing Permalink to this section

Telemetry pipelines in industrial environments operate under hostile conditions: intermittent cellular backhaul, PLC watchdog resets, and clock drift across edge nodes. The contract for this subsystem is that no message is silently lost and no transient outage produces a permanent gap in the OEE record. Production architectures meet it with idempotent ingestion, dead-letter-queue routing, and explicit state reconciliation.

When a message fails schema validation or arrives out of order, the pipeline routes it to a DLQ with exponential-backoff retry rather than dropping it. For missing telemetry windows, a fallback strategy interpolates the known state from the last valid snapshot and flags the interval as ESTIMATED in downstream metrics, drawing on the same gap-filling algorithms used in the cleaning stage. Aligning timestamps to UTC and enforcing monotonic clock progression prevents duplicate state transitions during network-partition recovery. Reference implementations should follow the ISA-95 asset hierarchy and use Python’s datetime module for timezone-aware boundary enforcement.

import time
from typing import Callable

def resilient_ingest(payload: dict, validator: Callable, dlq_client, max_retries: int = 3):
    """Idempotent ingestion with fallback routing and exponential backoff."""
    attempt = 0
    last_error: str = "unknown"
    while attempt < max_retries:
        try:
            if not validator(payload):
                raise ValueError("Schema validation failed")
            # Simulate write to time-series DB / message broker
            return {"status": "committed", "timestamp": time.time()}
        except Exception as exc:
            last_error = str(exc)
            attempt += 1
            delay = min(2 ** attempt, 30)
            time.sleep(delay)

    dlq_client.publish(
        topic="telemetry.dlq",
        payload=payload,
        metadata={"error": last_error, "attempts": attempt},
    )
    return {"status": "routed_to_dlq", "fallback": "estimated_state"}

Edge cases and failure modes. Blocking time.sleep retries inside an async event loop will stall every other asset’s stream; in production this logic belongs behind asyncio.sleep or a separate worker pool. An ESTIMATED flag that is never cleared poisons trend analysis, so reconciliation must replace estimates with real values once late telemetry arrives. Exponential backoff without a jitter term synchronizes retries across hundreds of edge nodes after a broker failover, producing a thundering herd; add randomized jitter to the delay. And idempotency requires a stable dedup key (asset + sequence + state-transition hash) — without it, partition-recovery replays inflate downtime counts.

Validation Gates Before Metrics Exit This Stage Permalink to this section

No classified interval or OEE record should reach a dashboard, data lake, or MES feed until it passes a deterministic gate set. These checks are intentionally boring and absolute — they exist to fail fast and route to the DLQ rather than publish a plausible-but-wrong number.

Schema compliance. Required fields present, enum-valid states, asset_id resolvable to an ISA-95 equipment node. Reject unknown states rather than coercing them to IDLE.
Temporal bounds. Every interval has start < end, no negative duration, and timestamps fall within a configurable tolerance of the edge clock (for example, ±5 minutes) to reject drift-corrupted rows.
Value sanity. Availability, Performance, and Quality each lie in $[0, 1]$ after clamping; counts are non-negative integers; good_count <= total_count.
Continuity. The sum of state durations in a shift window equals the window length minus any ESTIMATED gap, so silent dropouts surface as a continuity error rather than inflated availability.
Quality flags. Intervals derived from BAD or UNCERTAIN source samples are excluded from aggregation, not averaged in, mirroring the outlier-detection masking applied during ingestion.

def validate_interval(rec: dict) -> tuple[bool, str]:
    """Deterministic gate set; returns (ok, reason) for DLQ routing."""
    if rec["start"] >= rec["end"]:
        return False, "non_positive_duration"
    if not 0.0 <= rec.get("availability", 0.0) <= 1.0:
        return False, "availability_out_of_bounds"
    if rec.get("good_count", 0) > rec.get("total_count", 0):
        return False, "good_exceeds_total"
    if rec.get("quality_flag") in {"BAD", "UNCERTAIN"}:
        return False, "excluded_quality_flag"
    return True, "ok"

Engineering Constraints & Known Limits Permalink to this section

The patterns above are bounded by the physics of factory networks and the limits of floating-point arithmetic. Treat these as design constraints, not surprises.

Floating-point accumulation. Summing per-cycle operating time as floats accrues IEEE 754 error over a 24/7 run; keep durations in integer seconds or Decimal and divide once at the end. The same caveat governs precision and rounding limits across the wider architecture.
Clock drift and skew. Edge oscillators drift tens of milliseconds per hour. Classification depends on a monotonic, synchronized index, so NTP/PTP sync and clock drift correction are prerequisites, not optional polish — sub-second microstop thresholds are meaningless against an unsynchronized clock.
Network partitioning. During backhaul loss, edge buffers must spill locally and replay in order on recovery; the classifier must treat replayed events idempotently so reconnection does not duplicate downtime.
Backpressure. When downstream sinks slow, the aggregation window must apply bounded buffering and shed to disk rather than grow unbounded in memory; pair this with the async batch patterns in ingestion and cleaning workflows.
Late and out-of-order events. MES scrap and rework events routinely arrive minutes after the cycle, so OEE windows must remain reopenable for a defined grace period and recompute deterministically when corrected.

Event-to-Downtime Mapping — turning raw state transitions into classified downtime events.
Threshold Tuning for Microstops — choosing per-asset duration thresholds.
Shift Boundary Logic — slicing telemetry into timezone- and DST-safe windows.
OEE Formula Validation — deterministic reconciliation of Availability, Performance, and Quality.
Core Architecture & Data Mapping and Ingestion & Cleaning Workflows — the upstream sections this classification layer depends on.

Downtime Classification & OEE Calculation Pipelines for Industrial Telemetry

Deterministic State Resolution from Raw Telemetry #Permalink to this section

Microstop Aggregation & Threshold Engineering #Permalink to this section

Temporal Alignment & Shift Boundary Handling #Permalink to this section

OEE Calculation & Metric Validation #Permalink to this section

Pipeline Reliability & Fallback Routing #Permalink to this section

Validation Gates Before Metrics Exit This Stage #Permalink to this section

Engineering Constraints & Known Limits #Permalink to this section

Related #Permalink to this section

Continue exploring

Related in this section