Time-Series Database Sync for Manufacturing Telemetry and OEE Workflows

Time-series database synchronization is the stage where volatile edge measurements become durable, time-aligned, queryable history — and it is the persistence concern within Core Architecture & Data Mapping. Get the sync contract wrong and every downstream availability and cycle-time number inherits the defect: out-of-order writes fragment sliding windows, duplicate events inflate counters, and clock drift smears event boundaries until the published OEE no longer matches what the line actually did. This page defines the write contract a manufacturing time-series database (TSDB) must honor, a production ingestion implementation, the failure modes that surface on real factory floors, how to verify the data landed correctly, and how the design holds up at billions of rows.

Architecture: the sync write contract Permalink to this section

A TSDB sync layer is a pipeline of deterministic transforms between two boundaries. Its input is a stream of standardized telemetry tuples emerging from the MQTT topic hierarchy — each carrying a canonical tag built during PLC tag standardization, a source timestamp, a value, and an OPC UA quality code. Its output is a row in a partitioned hypertable that is UTC-normalized, schema-valid, and idempotent on (tag_id, ts). The invariant binding the two: replaying the same payload any number of times must leave the series unchanged, and no payload may be written with a timestamp the pipeline cannot vouch for.

Stage	Responsibility	Hard constraint
Decode	Parse topic + payload into a typed record	Reject unparseable topics; never fabricate fields
Normalize	Coerce time to UTC, cast types, clamp precision	All `ts` are timezone-aware UTC; values pass `decimal` quantization
Validate	Schema + quality + temporal-bound check	Drift beyond budget and bad-quality codes are quarantined, not stored
Write	Idempotent batched upsert	`ON CONFLICT (tag_id, ts) DO NOTHING`
Rollup	Continuous aggregate for OEE windows	Aggregates derive only from validated rows

Because the canonical tag_id becomes the TSDB partition/segment key, the naming topology chosen upstream is also the storage topology here — aligning the address to ISA-95 levels (Site → Area → Work Center → Work Unit) keeps per-line OEE queries bounded to a small set of series.

Core concept and design contract Permalink to this section

The contract has four non-negotiable rules, each defending a specific OEE computation downstream.

1. Timestamps belong to the edge, normalized to UTC at the ingestion boundary. The instant a measurement was generated is the only timestamp that carries process meaning; the instant it was received is an artifact of network latency. Assigning ingestion time silently shifts every event by the queueing delay, which corrupts the downtime intervals feeding the availability ratio:

$\text{Availability} = \frac{\text{Planned Production Time} - \text{Downtime}}{\text{Planned Production Time}}$

The pipeline therefore preserves the edge timestamp and converts it to UTC once, at the boundary — never per query.

2. Writes are idempotent. MQTT QoS 1 guarantees at-least-once delivery, so duplicate payloads are a certainty, not an exception. A unique constraint on (tag_id, ts) plus ON CONFLICT ... DO NOTHING makes redelivery harmless. Relying on the broker for exactly-once (QoS 2) is slower and still does not protect against application-level retries.

3. Quality codes are first-class. An OPC UA StatusCode in the Bad range (0x8000_0000+) means the controller itself distrusts the value; persisting it as if it were Good poisons aggregates. The schema keeps quality as a stored column so analysts can filter, and the validator quarantines values whose quality falls outside the accepted range.

4. Out-of-window data is quarantined, not silently dropped or blindly trusted. A timestamp far outside the allowable clock-drift budget signals a desynchronized edge clock; it goes to a dead-letter queue (DLQ) for inspection rather than fragmenting the series.

Implementation: normalization, validation, and idempotent write Permalink to this section

Construct ingestion as explicit, typed transforms. The normalization routine below validates with pydantic, anchors every timestamp to UTC, applies precision control before persistence (see precision and rounding limits), and writes in a single idempotent batch with asyncpg:

import asyncpg
from pydantic import BaseModel, Field, ValidationError, field_validator
from typing import Optional
from datetime import datetime, timezone


class TelemetryRecord(BaseModel):
    tag_id: str                                  # canonical ISA-95 address
    timestamp: datetime
    value: float
    quality: int = Field(ge=0, le=192)           # OPC UA StatusCode "Good" band
    source_ip: Optional[str] = None

    @field_validator("timestamp")
    @classmethod
    def to_utc(cls, ts: datetime) -> datetime:
        # Naive timestamps are rejected: tz context must survive from the edge.
        if ts.tzinfo is None:
            raise ValueError("timestamp must be timezone-aware")
        return ts.astimezone(timezone.utc)


async def ingest_batch(pool: asyncpg.Pool, payloads: list[dict]) -> int:
    validated: list[TelemetryRecord] = []
    for p in payloads:
        try:
            validated.append(TelemetryRecord(**p))
        except ValidationError as exc:
            log_to_dlq(p, error=str(exc))         # malformed → dead-letter, never stored
            continue

    if not validated:
        return 0

    async with pool.acquire() as conn:
        async with conn.transaction():
            await conn.executemany(
                """
                INSERT INTO telemetry_raw (tag_id, ts, value, quality, source_ip)
                VALUES ($1, $2, $3, $4, $5)
                ON CONFLICT (tag_id, ts) DO NOTHING
                """,
                [(r.tag_id, r.timestamp, r.value, r.quality, r.source_ip)
                 for r in validated],
            )
    return len(validated)

The standardized topic carries the routing metadata the writer needs to pick a partition tier or schema. Parse it once into structured fields rather than re-splitting strings per query:

import re

TOPIC_PATTERN = re.compile(
    r"^manufacturing/(?P<plant>[^/]+)/(?P<line>[^/]+)/"
    r"(?P<station>[^/]+)/(?P<metric>[^/]+)/(?P<tag>[^/]+)$"
)


def parse_mqtt_metadata(topic: str) -> dict[str, str]:
    match = TOPIC_PATTERN.match(topic)
    if not match:
        raise ValueError(f"invalid topic structure: {topic}")
    return match.groupdict()

The persistence target is a TimescaleDB hypertable: time-partitioned for write throughput, with a continuous aggregate that pre-computes the per-minute rollups OEE windows read, and a compression policy that reclaims storage once chunks go cold:

CREATE TABLE telemetry_raw (
    ts        TIMESTAMPTZ      NOT NULL,
    tag_id    VARCHAR(64)      NOT NULL,
    value     DOUBLE PRECISION,
    quality   SMALLINT,
    source_ip INET,
    PRIMARY KEY (tag_id, ts)              -- enforces idempotency
);

SELECT create_hypertable('telemetry_raw', 'ts',
                         chunk_time_interval => INTERVAL '7 days');

CREATE MATERIALIZED VIEW telemetry_1min_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       tag_id,
       avg(value)  AS avg_val,
       max(value)  AS max_val,
       min(value)  AS min_val,
       count(*)    AS sample_count
FROM telemetry_raw
WHERE quality <= 192                      -- aggregate only Good-quality samples
GROUP BY bucket, tag_id;

ALTER TABLE telemetry_raw SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'tag_id',
    timescaledb.compress_orderby   = 'ts DESC'
);
SELECT add_compression_policy('telemetry_raw', INTERVAL '14 days');

Edge cases and failure modes Permalink to this section

Real factories break this pipeline in predictable ways. Each needs an explicit defense.

Edge clock drift. Acquisition nodes run independent hardware clocks that drift relative to the plant reference, fragmenting time-series continuity. Anchor every gateway to a plant NTP source — chrony outperforms legacy ntpd for the step-and-slew behavior factory clocks need — and validate incoming timestamps against a drift budget at the boundary. The full configuration is covered in syncing edge timestamps with NTP servers, and cross-plant offset reconciliation in clock drift correction.

from datetime import datetime, timedelta, timezone

MAX_CLOCK_DRIFT = timedelta(seconds=2.5)      # tune to network jitter + scan cycle


def timestamp_in_window(edge_ts: datetime) -> bool:
    now = datetime.now(timezone.utc)
    return abs(now - edge_ts) <= MAX_CLOCK_DRIFT

Micro-oscillation noise. Sensors emit IEEE 754 floats with excess precision; sub-threshold flutter inflates row counts and triggers false state changes. Apply a deadband and quantize with decimal before the write so binary rounding artifacts never reach aggregation:

from decimal import Decimal, ROUND_HALF_UP, InvalidOperation

PRECISION = Decimal("0.01")
DEADBAND = Decimal("0.05")


def deadband_round(current: float, last_stored: Decimal) -> Decimal | None:
    try:
        cur = Decimal(str(current))            # str() avoids float repr noise
    except InvalidOperation:
        return None
    if abs(cur - last_stored) < DEADBAND:
        return None                            # suppress: no meaningful change
    return cur.quantize(PRECISION, rounding=ROUND_HALF_UP)

Broker failover and backpressure. A reconnecting consumer replays in-flight QoS 1 messages — handled by the idempotent write — but a TSDB stall during peak shift can drown the consumer. Wrap writes in bounded retry with jittered backoff, and when the database stays unreachable, spill to a local SQLite buffer and replay on recovery rather than dropping data or growing memory without limit:

import structlog
from tenacity import (retry, stop_after_attempt, wait_exponential,
                      retry_if_exception_type)
from asyncpg.exceptions import (ConnectionDoesNotExistError,
                                TooManyConnectionsError)

log = structlog.get_logger()


@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=30),  # jittered backoff
    retry=retry_if_exception_type(
        (ConnectionDoesNotExistError, TooManyConnectionsError)),
    reraise=True,
)
async def resilient_write(pool: asyncpg.Pool, query: str, params: tuple) -> None:
    try:
        async with pool.acquire() as conn:
            await conn.execute(query, *params)
    except Exception as exc:
        log.error("db_write_failed", error=str(exc))
        raise

PLC scan-cycle misalignment. A 100 ms scan cycle sampling a 50 ms event will alias; the TSDB faithfully stores whatever the gateway sampled. Resolve this at acquisition (deadband-on-change publishing) rather than expecting the database to reconstruct missing transitions — that is the job of the gap-filling algorithms layer, not the sync write.

Verification and testing Permalink to this section

Idempotency and UTC normalization are the two properties most worth a regression test, because both fail silently in production. A unit test asserting that a double-delivered batch leaves row count unchanged catches the highest-impact regression:

import pytest


@pytest.mark.asyncio
async def test_double_delivery_is_idempotent(pool):
    payload = {
        "tag_id": "PLT01/L03/WELD_B/ROBOT_A1/analog/temp_pv",
        "timestamp": "2026-06-26T08:00:00+02:00",   # CEST → must store as 06:00Z
        "value": 72.4, "quality": 0,
    }
    assert await ingest_batch(pool, [payload]) == 1
    await ingest_batch(pool, [payload])             # QoS 1 redelivery
    row = await pool.fetchrow(
        "SELECT count(*) AS n, min(ts) AS ts FROM telemetry_raw WHERE tag_id=$1",
        payload["tag_id"])
    assert row["n"] == 1                            # no duplicate
    assert row["ts"].utcoffset().total_seconds() == 0   # stored in UTC

Confirm the data landed correctly against the persisted series. A late-arrival audit query surfaces rows whose source timestamp lags ingestion beyond the drift budget — the signature of a desynchronized clock that slipped past validation:

-- Continuity + freshness check for one line over the last shift
SELECT tag_id,
       count(*)                                  AS samples,
       max(ts)                                   AS latest,
       now() - max(ts)                           AS staleness
FROM telemetry_raw
WHERE tag_id LIKE 'PLT01/L03/%'
  AND ts > now() - INTERVAL '8 hours'
GROUP BY tag_id
HAVING now() - max(ts) > INTERVAL '30 seconds'   -- flag stale series
ORDER BY staleness DESC;

Cross-checking the rolled-up sample counts against an independent line tally is the same reconciliation discipline applied during OEE formula validation: if the numbers disagree, the sync layer — not the math — is usually at fault.

Performance and scale considerations Permalink to this section

A single line can generate millions of rows per day, so the sync layer lives or dies on partitioning and write discipline.

Batch writes, bound the queue. executemany over 500–1000 records amortizes round trips; cap in-flight batches so a TSDB stall produces backpressure (and triggers the spill buffer) rather than unbounded memory growth. This is the same throughput contract enforced in async batch processing.
Size chunks to the working set. A 7-day chunk_time_interval keeps recent, hot chunks in memory while letting compression reclaim cold ones; chunks that are too small explode planning overhead, too large defeat compression and retention granularity.
Read from continuous aggregates, not raw. OEE dashboards querying telemetry_1min_agg touch orders of magnitude fewer rows than scanning telemetry_raw; the materialized rollup is refreshed incrementally, not recomputed.
Control series cardinality. Because tag_id is the segment key, long or unstable canonical names multiply index size across the full retention horizon — another reason to enforce short, stable addresses at PLC tag standardization.
Tier retention by topology. Partitioning aligned to ISA-95 Site/Line lets compression and drop policies operate per line, so a high-rate vibration sensor and a once-a-minute temperature point can carry different retention without separate tables.

When these defenses hold together — UTC-normalized edge timestamps, idempotent writes, quality-aware aggregates, and partitioning that matches the plant topology — the persisted series is a faithful record of the floor, and the OEE computed from it reflects production reality rather than infrastructure artifacts.

Core Architecture & Data Mapping — the parent overview for this subsystem.
Syncing edge timestamps with NTP servers — chrony configuration and drift budgeting for edge gateways.
MQTT Topic Hierarchies — the namespace that feeds the canonical tag_id into the writer.
PLC Tag Standardization — the canonical address that becomes the TSDB partition key.
Precision & Rounding Limits — IEEE 754 boundaries enforced before persistence.
Async Batch Processing — the throughput and backpressure contract for high-rate writes.

For implementation reference, see the TimescaleDB hypertable documentation and the Python decimal module.

Time-Series Database Sync for Manufacturing Telemetry and OEE Workflows

Architecture: the sync write contract #Permalink to this section

Core concept and design contract #Permalink to this section

Implementation: normalization, validation, and idempotent write #Permalink to this section

Edge cases and failure modes #Permalink to this section

Verification and testing #Permalink to this section

Performance and scale considerations #Permalink to this section

Related #Permalink to this section

Continue exploring

Related in this section