Event-to-Downtime Mapping for OEE-Accurate Machine State Pipelines

Event-to-downtime mapping is the deterministic translation layer that turns raw machine-state transitions into the structured availability intervals that OEE math depends on, and it is one of the core subsystems of Downtime Classification & OEE Calculation. PLC tags, SCADA alarms, and edge-gateway payloads almost never arrive pre-classified; they show up as high-frequency boolean flags, numeric thresholds, and discrete state codes that must be temporally aligned and logically validated before any Availability number is trustworthy. This page narrows in on that single concern — converting events into auditable downtime windows — and assumes upstream signal hygiene from PLC tag standardization and a synchronized time-series database sync is already in place. Get the mapping wrong and sensor jitter, misaligned shift handoffs, and overlapping maintenance windows silently corrupt every downstream metric.

The pipeline below runs as an ordered chain: ingest and normalize, resolve canonical states, window events into intervals, slice them at shift boundaries, deduplicate overlaps, then emit an idempotent downtime ledger. Each stage defines a contract, ships a production-grade implementation, and then enumerates the failure modes that break it on a live floor.

The mapping layer as an ordered chain: raw tags are time-aligned, resolved to canonical states, windowed past the microstop threshold, then sliced to shifts and de-overlapped by priority before an idempotent ledger emits the validated downtime that OEE availability consumes — with physically impossible transitions rejected into a quarantine queue.

The canonical signal-to-event mapping contract this layer enforces is summarized below. Every raw input must resolve to exactly one canonical state, and every state carries a downtime classification that OEE consumes directly.

Raw signal (example tag)	Canonical state	Counts against	OEE dimension
`MachineRunning = 1`, `motor_current > 0.5 A`	`Running`	— (uptime)	Availability + Performance
`MachineRunning = 0`, no fault, idle timer < threshold	`Idle`	Microstop / minor stop	Availability
`EStopActive`, `ConveyorFault`, `MotorOverload`	`Fault`	Unplanned downtime	Availability
`MaintenanceMode = 1` (scheduled window)	`Maintenance`	Planned downtime	Excluded from loaded time
`LineCleared`, end-of-run signal	`Stopped`	No-demand / not scheduled	Excluded

Core concept and design contract Permalink to this section

Factory networks run on heterogeneous clock sources. PLCs typically broadcast at 100–500 ms scan rates, edge aggregators batch payloads at 1–5 second intervals, and MES systems operate on enterprise NTP-synced servers. Non-monotonic timestamps and timezone drift introduce phantom state transitions that corrupt availability metrics, so the first contract is strict UTC normalization with monotonic timestamp alignment per asset. Late-arriving or out-of-order packets must be handled via watermarking, not silently discarded; the same clock drift correction discipline used in ingestion applies verbatim here.

The second contract is deterministic state resolution: identical inputs must always resolve to the same canonical state, regardless of arrival order or evaluation host. Raw boolean states (MachineRunning, ConveyorBlocked, EStopActive) are mapped to a finite state machine (FSM) that defines the only physically valid transitions. Invalid jumps — such as Running directly to Maintenance without an intermediate Idle or Fault — are flagged as data-quality faults rather than treated as real machine events. This aligns with the ISA-95 separation of Level 1/2 device signals from the Level 3 MES events that consume them, and with the ISA-88 notion of well-defined state models per equipment module.

The canonical state model: Idle is the entry hub, Fault counts as unplanned downtime, Maintenance as planned, and Stopped as excluded time. Only these transitions are physically valid — a jump such as Running → Maintenance is rejected as a data-quality fault rather than recorded as a real machine event.

Any transition not in this graph (for example Running → Maintenance directly) is treated as a data-quality fault rather than a real machine event and is routed to a quarantine table for review. The FSM also encodes a deterministic precedence order: when multiple status bits assert in the same scan, a safety-critical Fault overrides Idle, and an active Maintenance window suppresses unplanned-downtime counters.

import pandas as pd
from datetime import timedelta

def normalize_and_validate_telemetry(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize timestamps, forward-fill missing states, and validate FSM transitions."""
    # Enforce monotonic timestamps per asset
    df = df.sort_values(["asset_id", "timestamp"]).reset_index(drop=True)
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)

    # Forward-fill constrained by max acceptable gap (e.g., 2x cycle time)
    max_gap = timedelta(seconds=10)
    df["state"] = df.groupby("asset_id")["state"].ffill(limit=int(max_gap.total_seconds() / 0.5))

    # FSM validation matrix (simplified)
    valid_transitions = {
        "Running": {"Idle", "Fault", "Stopped"},
        "Idle": {"Running", "Fault", "Maintenance"},
        "Fault": {"Idle", "Maintenance"},
        "Maintenance": {"Idle", "Running"},
        "Stopped": {"Idle", "Maintenance"}
    }

    df["prev_state"] = df.groupby("asset_id")["state"].shift(1)
    df["is_valid_transition"] = df.apply(
        lambda row: row["state"] in valid_transitions.get(row["prev_state"], set())
        if pd.notna(row["prev_state"]) else True, axis=1
    )
    return df

Implementation Permalink to this section

With states normalized and validated, the core implementation handles the 80% case: turning discrete state changes into continuous downtime intervals, then slicing those intervals against the production schedule.

Event windowing and microstop filtering Permalink to this section

When a Fault or Idle state activates, the pipeline opens a downtime window and records the start timestamp; the window closes on transition to Running or Maintenance. Industrial environments frequently exhibit rapid state oscillation from sensor bounce, mechanical hysteresis, or control-loop tuning — a machine may cycle between Running and Idle dozens of times per minute, fragmenting downtime records and artificially inflating loss counts. To prevent fragmentation, the pipeline applies a minimum-duration threshold before committing an event to the ledger. Events below the threshold are classified as operational noise or microstops depending on business rules; choosing that boundary correctly is the entire subject of threshold tuning for microstops.

def window_events_and_filter_microstops(df: pd.DataFrame, min_duration_sec: float = 30.0) -> pd.DataFrame:
    """Group consecutive identical states into windows and filter microstops."""
    df = df.copy()
    # Create group IDs for consecutive identical states
    df["state_group"] = (df["state"] != df.groupby("asset_id")["state"].shift()).cumsum()

    windows = df.groupby(["asset_id", "state_group"]).agg(
        start=("timestamp", "min"),
        end=("timestamp", "max"),
        state=("state", "first")
    ).reset_index()

    windows["duration_sec"] = (windows["end"] - windows["start"]).dt.total_seconds()

    # Filter out microstops below threshold
    windows = windows[windows["duration_sec"] >= min_duration_sec].copy()
    windows["event_id"] = windows.apply(lambda r: f"{r['asset_id']}_{r['start'].isoformat()}", axis=1)
    return windows.drop(columns=["state_group"])

Shift boundary resolution and schedule alignment Permalink to this section

Downtime events rarely respect shift boundaries. A fault triggered at 14:58 during Shift A and cleared at 15:05 during Shift B must be split proportionally or attributed according to plant policy; failing to resolve cross-boundary events causes metric leakage, double-counting, or unallocated production time. The window is intersected with a master shift calendar, clipped to each shift’s start/end, and each resulting segment inherits the shift ID, crew, and planned-production status. This deterministic intersection depends on robust shift boundary logic, including the DST and timezone handling that calendar arithmetic quietly gets wrong. Planned breaks, changeovers, and scheduled maintenance are cross-referenced against the production calendar to separate planned from unplanned losses.

import numpy as np

def align_to_shift_boundaries(windows: pd.DataFrame, shift_calendar: pd.DataFrame) -> pd.DataFrame:
    """Split downtime windows at shift boundaries and attribute to correct shifts."""
    # Convert shift calendar to intervals for efficient intersection
    shifts = pd.IntervalIndex.from_arrays(
        pd.to_datetime(shift_calendar["start"], utc=True),
        pd.to_datetime(shift_calendar["end"], utc=True),
        closed="left"
    )

    aligned = []
    for _, row in windows.iterrows():
        window_start = row["start"]
        window_end = row["end"]

        # Find overlapping shifts
        overlap_indices = shifts.overlaps(pd.Interval(window_start, window_end, closed="left"))
        if not overlap_indices.any():
            continue

        for idx in np.where(overlap_indices)[0]:
            shift_start = shifts[idx].left
            shift_end = shifts[idx].right

            clip_start = max(window_start, shift_start)
            clip_end = min(window_end, shift_end)

            if clip_start < clip_end:
                aligned.append({
                    "event_id": row["event_id"],
                    "asset_id": row["asset_id"],
                    "shift_id": shift_calendar.iloc[idx]["shift_id"],
                    "state": row["state"],
                    "start": clip_start,
                    "end": clip_end,
                    "duration_sec": (clip_end - clip_start).total_seconds()
                })
    return pd.DataFrame(aligned)

Edge cases and failure modes Permalink to this section

Real factories break the happy path constantly. The mapping layer must anticipate these failure classes explicitly rather than discovering them in a quarterly OEE audit.

Cascading simultaneous alarms. An E-Stop typically cascades into ConveyorFault, MotorOverload, and SafetyGateOpen within the same scan. If each alarm opens an independent downtime window, overlapping intervals inflate total downtime. Overlap resolution requires a deterministic priority matrix — safety-critical states (E-Stop, GuardOpen) supersede operational faults, which supersede maintenance states. The pipeline merges overlapping windows by selecting the highest-priority state for each time slice so that exactly one downtime event exists per interval. The full interval-arithmetic treatment lives in calculating OEE with overlapping maintenance windows.

Out-of-order and late-arriving packets. Edge buffers replaying after a network partition deliver events whose timestamps predate the current watermark. Discarding them undercounts downtime; appending them blindly produces non-monotonic sequences that break windowing. Use event-time processing with a bounded grace period and trigger incremental window corrections.

PLC scan-cycle misalignment. A 250 ms scan PLC feeding a pipeline that resamples to 100 ms creates phantom one-cell gaps. Forward-fill must be bounded to roughly twice the scan cycle so a genuine dropout is never papered over as uptime.

Sensor degradation masquerading as state churn. A failing proximity sensor produces a storm of zero-duration Fault → Idle oscillations. These pass FSM validation individually but are physically implausible in aggregate, so distribution-level anomaly checks (transition frequency, duration histograms) must run alongside per-row validation.

Once overlaps are resolved, the validated downtime feeds the availability term of the OEE formula. Availability is defined as:

$\text{Availability} = \frac{\text{Planned Production Time} - \text{Validated Downtime}}{\text{Planned Production Time}}$

The mapping layer is responsible only for producing the Validated Downtime term correctly; the full multiplication and its sanity bounds are the subject of OEE formula validation.

Verification and testing Permalink to this section

Mapping logic is deterministic, which makes it directly unit-testable. The most valuable tests assert that synthetic event streams with known downtime produce exactly the expected windows after microstop filtering and shift slicing.

import pandas as pd

def test_microstop_filtered_but_real_fault_kept():
    """A 5s blip is dropped; a 90s fault survives windowing."""
    df = pd.DataFrame({
        "asset_id": ["A"] * 4,
        "timestamp": pd.to_datetime([
            "2026-06-26T08:00:00Z", "2026-06-26T08:00:05Z",   # 5s Idle blip
            "2026-06-26T08:10:00Z", "2026-06-26T08:11:30Z",   # 90s Fault
        ], utc=True),
        "state": ["Idle", "Running", "Fault", "Running"],
    })
    windows = window_events_and_filter_microstops(df, min_duration_sec=30.0)
    downtime = windows[windows["state"].isin(["Idle", "Fault"])]
    assert set(downtime["state"]) == {"Fault"}        # microstop dropped
    assert downtime["duration_sec"].iloc[0] == 90.0   # real fault preserved

Beyond unit tests, reconcile the emitted ledger against the source of truth. Cross-check calculated availability against MES production logs and flag any deviation above 2% for audit. A periodic reconciliation query against the time-series database sync layer catches drift between the raw tag history and the derived ledger:

-- Sum derived downtime per asset/shift vs. raw fault-state dwell time.
-- Run on a TimescaleDB / hypertable; alert when |delta| > 2% of loaded time.
SELECT d.asset_id,
       d.shift_id,
       SUM(d.duration_sec)                         AS ledger_downtime_sec,
       r.raw_fault_sec,
       ABS(SUM(d.duration_sec) - r.raw_fault_sec)
           / NULLIF(r.raw_fault_sec, 0)            AS rel_error
FROM downtime_ledger d
JOIN (
    SELECT asset_id,
           time_bucket('8 hours', ts) AS shift_bucket,
           SUM(EXTRACT(EPOCH FROM dwell)) AS raw_fault_sec
    FROM raw_state_dwell
    WHERE state = 'Fault'
    GROUP BY asset_id, shift_bucket
) r ON r.asset_id = d.asset_id
GROUP BY d.asset_id, d.shift_id, r.raw_fault_sec
HAVING ABS(SUM(d.duration_sec) - r.raw_fault_sec) / NULLIF(r.raw_fault_sec, 0) > 0.02;

Performance and scale considerations Permalink to this section

High-frequency deployments ingest millions of state samples per minute across hundreds of assets, so the mapping layer must be both idempotent and partition-friendly.

Watermarking and late-data handling. Use event-time processing with a configurable watermark (e.g., 15-minute grace period). Late packets trigger incremental window updates rather than full recomputation of a shift.
Idempotent writes. Generate a deterministic event_id hash from asset_id, state, and start_timestamp. Upserts then make pipeline retries safe and prevent duplicate ledger rows.
Partitioning strategy. Store the downtime ledger partitioned by asset_id and date. This keeps OEE rollups to a bounded time range and limits compute during shift-report generation.
Anomaly detection. Continuously monitor transition frequency and duration distributions; spikes in Fault → Idle oscillation or a sudden burst of zero-duration windows signal sensor degradation or PLC ladder-logic errors before they distort a monthly report.

import hashlib

def upsert_downtime_ledger(df: pd.DataFrame, target_table: str) -> None:
    """Idempotent write with conflict resolution."""
    df["event_hash"] = df.apply(
        lambda r: hashlib.sha256(f"{r['asset_id']}_{r['state']}_{r['start'].isoformat()}".encode()).hexdigest(),
        axis=1
    )
    # Use database-native MERGE/UPSERT to handle late-arriving corrections
    # Example: DuckDB, PostgreSQL, or Delta Lake MERGE INTO ...
    pass

Event-to-downtime mapping is not a one-time configuration but a continuously calibrated system. Establish feedback loops between the pipeline and floor operators: when downtime classifications consistently mismatch operator logs, revisit threshold configuration, sensor calibration, and PLC ladder logic. Automate availability-variance alerts, enforce strict schema validation at ingestion, and keep the state-transition matrix version-controlled. Treated as a deterministic, auditable process, telemetry mapping yields reliable availability metrics, accurate performance tracking, and OEE numbers an engineer can defend in an audit.

Downtime Classification & OEE Calculation — parent section and end-to-end pipeline overview
Calculating OEE with overlapping maintenance windows — interval arithmetic for merged downtime
Threshold tuning for microstops — choosing the minimum-duration boundary
Shift boundary logic — slicing intervals across crews and DST
OEE formula validation — bounding and reconciling the final metric
PLC tag standardization — upstream signal hygiene this layer depends on

Event-to-Downtime Mapping for OEE-Accurate Machine State Pipelines

Core concept and design contract #Permalink to this section

Implementation #Permalink to this section

Event windowing and microstop filtering #Permalink to this section

Shift boundary resolution and schedule alignment #Permalink to this section

Edge cases and failure modes #Permalink to this section

Verification and testing #Permalink to this section

Performance and scale considerations #Permalink to this section

Related #Permalink to this section

Continue exploring

Related in this section