Gap-Filling Algorithms for Manufacturing Telemetry and OEE Pipelines

Gap filling is the stage of Ingestion & Cleaning Workflows that reconstructs missing samples in a telemetry stream deterministically, so that downstream Overall Equipment Effectiveness (OEE) math is computed over a continuous, auditable series rather than a series riddled with holes. This page focuses on one narrow concern: how to decide whether a hole may be filled, which method respects the underlying sensor physics, and how to record every synthesized value so it never silently inflates or deflates an availability, performance, or quality figure. Industrial streams arrive with discontinuities from network partitioning, edge gateway buffer overflows, PLC scan-rate mismatches, and intermittent cellular backhaul — and the wrong fill turns a genuine machine fault into a smooth, invisible line.

A production-grade gap-filling strategy preserves physical process constraints, never extrapolates past the last verified reading, halts at process boundaries such as shift changes and recipe changeovers, and tags every interpolated value for the OEE engine to weight. Get this wrong and a thirty-minute outage on a spindle-load tag becomes a confident straight line that hides downtime and corrupts the availability ratio.

Algorithm selection matrix Permalink to this section

Gap filling is not a one-size-fits-all operation. The mathematical treatment must align with sensor physics and process dynamics, and the duration of the hole is the first discriminator.

Gap duration	Signal type	Method	Rationale
≤ 5 s	Continuous analog (temperature, pressure, flow)	Linear interpolation	Steady-state processes follow predictable trajectories; O(n) and exactly reproducible
5–60 s	Continuous analog with known inertia	Cubic spline or Akima	Preserves curvature for thermodynamic or fluid-dynamics signals
Any	Discrete state (machine status, valve position, alarm flag)	Forward-fill, bounded horizon	State persistence is physically valid; prevents artificial state toggling
> 5 min	Any	Flag as `NaN` / `MISSING`	Long gaps exceed process predictability; filling violates audit compliance

Boundary enforcement is non-negotiable: interpolation must never extrapolate beyond the last known valid reading, and it must stop at shift boundaries, recipe changeovers, and machine state transitions. The detailed mechanics for constrained continuous variables — vectorized execution, the closed-form value, and horizon limits — are documented in implementing linear interpolation for missing sensor values, which this page generalizes across signal classes.

Core concept and design contract Permalink to this section

Two prerequisites must hold before any fill executes, and both are upstream stages in the same workflow. First, the series must be temporally harmonized: uncorrected oscillator drift between PLCs, OPC UA servers, and the cloud broker misaligns the fill window, so forward-fill bleeds across unrelated production cycles. Establishing a synchronized, monotonic time index is the job of clock drift correction, which enforces NTP/PTP alignment and rejects timestamp inversion at the edge gateway. Second, anomalies must already be quarantined: interpolating across an unflagged spike or a stuck-at-zero fault propagates corruption into the feature store, so outlier detection methods run first, masking statistical anomalies and physics-impossible readings as NaN so the fill treats them as gaps rather than anchors.

With a clean, aligned canvas, the fill itself obeys a five-rule contract:

Classify before filling. Each gap is bucketed by duration, by signal class (continuous analog vs. discrete state, derived from PLC tag standardization metadata), and by operational context. The bucket selects the method from the matrix above; nothing is filled by default.
Bounded horizon. A maximum fill length caps how many consecutive missing samples may be synthesized. Beyond the horizon the series stays MISSING, because a model cannot honestly reconstruct minutes of a fluctuating process.
No extrapolation. Interpolation requires a verified anchor on both sides. Leading and trailing gaps — before the first or after the last real reading — are never invented.
Boundary isolation. A gap that spans a shift change, maintenance window, or recipe transition is truncated at the boundary. The boundaries themselves come from shift boundary logic and the line’s event table; a fill must respect them or it will smear one ISA-95 work-unit’s state into the next.
Determinism and auditability. Re-running the pipeline on identical raw input must produce bitwise-identical output. This rules out stochastic imputation (KNN, random forest, Kalman smoothing with random initialization) in the deterministic path, and it requires every synthesized value to carry an is_interpolated flag.

This contract maps directly onto the at-least-once delivery model of QoS 1 for discrete state transitions: duplicates are deduplicated upstream, but the residual holes from genuinely lost packets are exactly what this stage repairs.

Implementation Permalink to this section

IIoT pipelines process high-frequency telemetry in asynchronous, memory-constrained environments, so the implementation pairs a pure, typed fill function with a chunked async runner. The fill function below handles the 80% case — a single continuous-analog tag with a state mask, a hard horizon, and explicit edge-boundary clamping that prevents any extrapolation.

import asyncio
import logging
import pandas as pd
from typing import Optional

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("gap_fill")


def apply_gap_filling(
    df: pd.DataFrame,
    value_col: str,
    time_col: str,
    scan_cycle_ms: int = 1000,
    max_gap_seconds: int = 30,
    state_mask_col: Optional[str] = None,
) -> pd.DataFrame:
    """Deterministic, horizon-bounded gap filling with boundary enforcement.

    Fills only interior gaps shorter than ``max_gap_seconds``, never extrapolates
    past the first/last verified sample, and refuses to fill across a change in
    the discrete state mask (e.g. RUN -> STOP). Emits an ``is_interpolated`` audit
    column so OEE computation can weight synthesized values.
    """
    df = df.copy()
    df[time_col] = pd.to_datetime(df[time_col], utc=True)
    df = df.set_index(time_col).sort_index()

    # Maximum consecutive samples we are allowed to synthesize.
    max_consecutive = int(max_gap_seconds * 1000 // scan_cycle_ms)

    # Audit flag: a row is interpolated iff it was NaN before the fill and is
    # filled afterwards (computed below against the bounded interpolation).
    was_missing = df[value_col].isna()

    filled = df[value_col].interpolate(
        method="linear",
        limit=max_consecutive,        # caps consecutive NaN fill length
        limit_area="inside",          # never extrapolate at the series edges
    )

    # Boundary isolation: do not fill across a discrete-state transition.
    if state_mask_col and state_mask_col in df.columns:
        state = df[state_mask_col]
        stable = state.eq(state.ffill()) & state.eq(state.bfill())
        # Where the state on either side of a gap differs, revert to NaN.
        filled = filled.where(stable | ~was_missing, other=df[value_col])

    df[value_col] = filled
    df["is_interpolated"] = was_missing & df[value_col].notna()

    n_filled = int(df["is_interpolated"].sum())
    logger.info("Filled %d of %d missing samples on %s", n_filled,
                int(was_missing.sum()), value_col)
    return df.reset_index()


async def process_chunk(chunk: pd.DataFrame, out: asyncio.Queue) -> None:
    """Async worker: fill one resampled chunk and enqueue the result."""
    try:
        result = apply_gap_filling(
            chunk, value_col="spindle_load", time_col="ts_aligned",
            scan_cycle_ms=1000, max_gap_seconds=30, state_mask_col="machine_state",
        )
        await out.put(result)
    except Exception:                  # never let one bad chunk kill the run
        logger.exception("chunk fill failed")
        await out.put(None)            # sentinel for the collector


async def run_pipeline(chunks: list[pd.DataFrame]) -> pd.DataFrame:
    out: asyncio.Queue = asyncio.Queue()
    workers = [asyncio.create_task(process_chunk(c, out)) for c in chunks]
    results = [r for _ in chunks if (r := await out.get()) is not None]
    await asyncio.gather(*workers)
    return pd.concat(results, ignore_index=True)

Note the reliance on limit_area="inside" rather than manual edge clamping: it is the canonical pandas guarantee that leading and trailing gaps stay NaN, which is the rule the contract demands. The chunked runner keeps memory bounded when reconstructing multi-day archives, and because each chunk is sealed by async batch processing before it reaches this stage, the fill operates on contiguous, source-time-ordered windows. For non-linear signals, swap method="linear" for "akima" or "cubicspline"; both remain deterministic and both honor limit and limit_area.

Edge cases and failure modes Permalink to this section

Real factories break the happy path in specific, repeatable ways:

Filling across a state transition. Without the state-mask guard, a gap straddling a RUN → STOP edge interpolates a phantom decline that the OEE engine reads as a slow stop rather than an instantaneous one, blurring the downtime boundary that shift boundary logic is supposed to keep crisp.
Clock jitter creating false gaps. Residual drift after correction can make diff() report a sub-sample “gap” that is really jitter. Resample to the scan-cycle grid first; never infer gaps from raw arrival deltas.
PLC scan-cycle misalignment. A tag sampled at 250 ms resampled onto a 1 s grid legitimately has three “missing” intermediate points that are not gaps at all. Define the grid from the tag’s declared scan cycle, not a global default, or you will synthesize values that never went missing.
Stuck-at-zero masquerading as data. A frozen sensor reports a valid-looking constant, so it is never a NaN and is never filled — it is an outlier problem, which is why outlier detection methods must run before this stage to convert the stuck run into gaps.
Floating-point reproducibility. Interpolation is IEEE 754 arithmetic; summing offsets in a different order across chunk boundaries can shift a value in the last ULP. Pin the dtype, fill whole gaps inside a single chunk (never split a gap across two chunks), and consult precision and rounding limits before comparing filled series for equality.
Backward bleed into pre-startup. A machine powered on mid-window has a leading NaN run that must stay NaN; limit_area="inside" enforces this, but a naive bfill() will happily invent pre-startup readings.

Verification and testing Permalink to this section

A fill stage is only trustworthy if its determinism and its boundary rules are provable. The highest-value test asserts that an interior gap is filled, a too-long gap is left MISSING, edges are never extrapolated, and a gap across a state change is refused.

import numpy as np
import pandas as pd


def test_gap_fill_contract():
    idx = pd.date_range("2026-06-26T06:00:00Z", periods=10, freq="1s")
    val = [0.0, 1.0, np.nan, 3.0,           # 1 s interior gap -> fill to 2.0
           np.nan, np.nan, np.nan, np.nan,  # 4 s gap with a state change inside
           8.0, np.nan]                     # trailing gap -> stays NaN
    state = ["RUN"] * 6 + ["STOP"] * 4
    df = pd.DataFrame({"ts_aligned": idx, "spindle_load": val,
                       "machine_state": state})

    out = apply_gap_filling(df, "spindle_load", "ts_aligned",
                            scan_cycle_ms=1000, max_gap_seconds=3,
                            state_mask_col="machine_state")
    s = out["spindle_load"]

    assert s.iloc[2] == 2.0                 # interior gap filled
    assert out["is_interpolated"].iloc[2]   # and flagged
    assert pd.isna(s.iloc[9])               # trailing edge never extrapolated
    assert pd.isna(s.iloc[4])               # gap across RUN->STOP refused
    assert out["is_interpolated"].sum() == 1

Beyond unit tests, validate the materialized series in the time-series database. A TimescaleDB query that joins synthesized rows against the gap-duration distribution confirms no fill exceeded the horizon and that the synthesized fraction stays within an alerting threshold:

-- Synthesized-sample ratio per asset per shift; alert if it exceeds budget
SELECT asset_id,
       time_bucket('8 hours', ts) AS shift,
       avg(is_interpolated::int) AS synth_ratio
FROM telemetry_clean
GROUP BY asset_id, shift
HAVING avg(is_interpolated::int) > 0.05
ORDER BY synth_ratio DESC;

A high synth_ratio is a data-quality signal in its own right: it usually means a flaky gateway or a sensor approaching failure, not a fill bug.

Performance and scale considerations Permalink to this section

Across a fleet of millions of asynchronous streams, the fill must stay O(n) per series and bounded in memory. Vectorized pandas/NumPy interpolation is the only acceptable path; row-wise iteration collapses throughput and is non-deterministic under threading. Two deployment patterns cover most plants. Stream-first: apply sliding-window interpolation at the edge gateway for real-time HMI dashboards, accepting that only short interior gaps can be filled because future anchors are not yet available. Batch reconciliation: run horizon-bounded interpolation nightly against the raw Parquet or Delta Lake tables, overwriting only is_interpolated = True records so the write is idempotent under replay.

Memory is bounded by chunk size times the number of live assets, so partition by asset_id and never split a single gap across chunk boundaries — doing so breaks both correctness (no anchor on one side) and floating-point reproducibility. Persist gap-duration distributions, fill rates, and the algorithm version into a lineage table so the synthesized fraction is queryable for audit and so method changes are traceable. Finally, the fill must stay honest to the OEE math: availability is

$A = \frac{\text{Planned Production Time} - \text{Downtime}}{\text{Planned Production Time}}$

and synthesized intervals should be excluded from the measurement basis unless process engineering signs off. Coordinate the fill flag with OEE formula validation so a weighted confidence_score decays with gap duration — treating gap filling as a constrained, auditable transformation rather than a statistical convenience is what keeps OEE, predictive maintenance, and digital-twin analytics defensible.

Ingestion & Cleaning Workflows — parent overview of the full ingestion and cleaning pipeline
Implementing linear interpolation for missing sensor values — the vectorized, boundary-safe mechanics for continuous tags
Clock drift correction — temporal alignment that must run before any fill
Outlier detection methods — quarantining anomalies so they become gaps, not anchors
Async batch processing — sealing the contiguous windows this stage fills
OEE formula validation — pinning how synthesized intervals weight Availability, Performance, and Quality

Gap-Filling Algorithms for Manufacturing Telemetry and OEE Pipelines

Algorithm selection matrix #Permalink to this section

Core concept and design contract #Permalink to this section

Implementation #Permalink to this section

Edge cases and failure modes #Permalink to this section

Verification and testing #Permalink to this section

Performance and scale considerations #Permalink to this section

Related #Permalink to this section

Continue exploring

Related in this section