Technology

Cross-venue reconciliation: designing a matching engine that tolerates divergence

Alpha Equations · 22 April 2026 · 11 min read

Contents

The three divergence classes are not the same problem
The canonical key is firm-generated, not borrowed from either side
Bi-temporal fields separate when a trade happened from when it was recorded
Append-only fill events project state; they do not store it
The escalation contract routes by class, not by severity
Prior approaches treat divergence as error

A cross-venue matching engine achieves near-real-time auditability through explicit tolerance across three structural divergence classes: order-ID shape, timestamp skew, and partial-fill fragmentation. Every exception the engine cannot resolve deterministically routes through a tiered escalation chain. The post covers the matching engine's canonical-key design, its bi-temporal timestamp model, its append-only fill-event log and pair-matching semantics, the three exception conditions, and the shape of the handoff contract between the deterministic tier and the agent above it. This post covers the matching engine only; connectivity and post-trade workflows are outside its scope. The engine operates on the firm's own-capital execution only, not on third-party order flow.

The three divergence classes are not the same problem#

When a trading system and an execution venue both record the same fill, the two records are never identical. Three structural differences separate the problem into three distinct territories, each of which requires a different tolerance mechanism. A design that applies a single matching rule across all three produces both false positives and false negatives simultaneously.

The first class is order-ID shape divergence. The FIX execution-report lifecycle[1] distinguishes three identity fields: ClOrdID, which is client-assigned and changes on each replace or cancel request; OrderID, which is venue-assigned and stable through the order's lifecycle; and ExecID, which is a per-fill identifier. A single logical order can therefore accumulate many ClOrdID values across a replace chain, while the venue assigns one OrderID and one ExecID per fill. Neither side's native identifier is stable across the full lifecycle. ISO 20022 formalises the same problem at a different layer — the ClientTransactionIdentification field threads execution identity through to settlement confirmation, but no standard prescribes how to normalise cross-venue identifier shapes before that layer is reached.

The second class is timestamp skew. Both sides record event timestamps, but they record them from different clocks. The trading system's ingest clock and the venue's matching-engine clock diverge by a measurement uncertainty that is real and bounded, not random noise. RFC 5905[2] specifies that typical secondary servers and clients on a fast local network achieve offset accuracy within a few hundred microseconds; primary servers achieve tens of microseconds. Those bounds are small but non-zero, and they are asymmetric — propagation delay on the upstream path differs from the downstream path, so both sides observe the same event from slightly different positions on the time axis. Treating that difference as a data-quality error to be cleaned before matching ignores the physics. It is a measurement uncertainty that the matching engine must model.

The third class is partial-fill fragmentation. A large order may fill across many smaller exchange executions over an extended window; the trading system records each leg individually while the venue may summarise them into one aggregated fill event, or vice versa. The inverse pattern — one trading-system record matching many exchange fill reports — is equally common. Neither case is an error on either side. Both cases are correct representations of the same underlying execution, at different levels of granularity. Snapshot comparison cannot resolve this class: only a representation of fill state that accumulates over time without losing the individual events makes the fragmentation visible as a structural fact rather than a missing record.

These three structural differences determine the architecture of the canonical key that the matching engine builds before any comparison takes place.

The canonical key is firm-generated, not borrowed from either side#

The matching engine requires a stable identity for every logical order across its full lifecycle. Neither side's native identifier is safe as the canonical anchor. The trading-system identifier mutates across a replace or cancel chain — the OrigClOrdID chain in FIX terminology links the history, but the chain itself is not a single stable key. The venue identifier arrives asynchronously: the venue's OrderID may not be present in the trading system's internal record at the moment the first reconciliation comparison runs, because the venue acknowledgement has not yet arrived.

A firm-generated canonical reconciliation key — onto which both sides' identifiers are mapped as views — is the only design that survives the full order lifecycle, including replacement chains, partial fills, and late venue acknowledgements. The key is generated at order creation time, before any venue interaction occurs. It does not encode venue fields, timestamp fragments, or price information. Its independence from both sides' data is what makes it stable when either side's identifier changes shape.

The mapping layer is a deterministic one-to-many projection. One canonical key maps to many ClOrdID values across a replace chain and to many ExecID values across a partial-fill sequence. The reverse mapping is always one-to-one: any fill event on either side resolves to exactly one canonical parent. The matching engine's comparison always starts from the canonical key, never from either side's native identifier.

The key-mapping layer handles order-ID shape divergence entirely. A ClOrdID replace chain on the trading-system side and an OrderID on the venue side both project onto the same canonical parent before any comparison runs. The matching engine sees two ChildFill events with the same parent key. It does not see two records with different identifier shapes.

With the key fixed, the time problem requires its own model.

Bi-temporal fields separate when a trade happened from when it was recorded#

Every fill event on both sides of the reconciliation carries two independent time dimensions. Valid time is when the venue reports the event occurred — the execution timestamp in the fill report. Transaction time is when the firm's ingest layer received and stored the event. These are not the same field. Conflating them produces false divergence signals: a fill that arrived late at the ingest layer will appear to differ from the venue's record if only one timestamp is compared, even when the underlying execution times agree exactly.

The bi-temporal model makes this distinction formal. Jensen and Snodgrass's consensus glossary[3] defines valid time as when a fact is true in the modelled reality and transaction time as when the database stores it. SQL:2011 formalises both dimensions as first-class temporal attributes of a table, with SYSTEM_TIME managing the transaction axis and an application-declared PERIOD managing the valid-time axis. A reconciliation engine that compares two records at the same valid-time slice can distinguish a genuine execution-time difference from an ingestion-delay artefact — because the two dimensions are stored and queried independently.

The tolerance window on the valid-time comparison is not a static constant. It is a dynamic interval bounded by measured clock-synchronisation dispersion: the maximum expected offset between the firm's clock and the venue's clock at the moment of the event. RFC 5905 provides the physical grounding — the dispersion bound is a function of the roundtrip delay and the jitter on the synchronisation path, not an arbitrary guess. A tighter measured dispersion produces a narrower tolerance window; a wider dispersion widens it. The window is sized as a small-integer multiple of the measured dispersion, so it tracks the actual clock relationship rather than a fixed engineering constant.

Late or out-of-order fill events are not exceptions under this model. When a fill arrives after its pair, the projection is rebuilt from the log and the pair-matching test is re-run against the bi-temporal window. If the pair is found within the tolerance interval, the match closes. Bi-temporal semantics handle reordering natively, without requiring the matching engine to maintain a separate late-arrival queue or a corrective reprocessing path.

State diagram: ChildFill moves from ingest through bi-temporal pair-matching to matched or exception state; transitions labelled valid-time, transaction-time, grace-window. — The ChildFill lifecycle: ingest assigns both temporal axes; the pair-matching gate tests valid-time alignment within the tolerance window; grace-window expiry is the only path to a missing-pair exception.

Append-only fill events project state; they do not store it#

The standard approach to tracking a partial fill is an in-place CumQty accumulator: each fill event updates a running total, and the current quantity filled is always available as a single field. That design loses the individual fill events. When a late or out-of-order fill arrives, the accumulator has no history to reconcile against — the fill is either applied in-place (which may be wrong) or flagged as an exception (which may be unnecessary). The accumulator makes reordering a structural problem.

An append-only ChildFill event log avoids this. Each fill is an immutable event; order state is a projection folded on demand from the log, not an authoritative record mutated by each arrival. Martin Fowler's event-sourcing formulation[4] captures this as recording all changes to application state as a sequence of events from which state can be reconstructed by replay. Jay Kreps[5] formalises the corollary: two identical deterministic processes starting from the same state and receiving the same inputs in the same order produce the same output. The ChildFill log applies this to both sides simultaneously — the trading system's fill stream and the venue's fill stream are each append-only logs whose projections are compared rather than merged.

The ChildFill event schema carries seven fields. The following pseudocode shows the record shape and the three invariant conditions that the matching engine evaluates on each pair-matching attempt:

// ChildFill — immutable fill event record
ChildFill {
  parent_reconciliation_key  // firm-generated; links to logical parent order
  child_exec_id              // source-side execution identifier; not canonical
  qty                        // fill quantity for this execution leg
  price                      // execution price for this leg
  source_side                // "trading-system" | "exchange"
  valid_time                 // when the venue reports the event occurred
  transaction_time           // when the firm's ingest layer stored this record
}

// Exception invariants — evaluated after each grace-window expiry
INVARIANT {
  // Condition 1: fill arrived for a key with no parent order registered
  IF parent_reconciliation_key NOT IN order_registry
    RAISE exception_class: unknown-key

  // Condition 2: no matching ChildFill found on the opposite source_side
  // within the bi-temporal tolerance window after grace window has elapsed
  IF pair NOT FOUND WITHIN tolerance_window AFTER grace_window
    RAISE exception_class: missing-pair

  // Condition 3: sum of qty on one source_side exceeds parent's authorised qty
  IF SUM(qty WHERE source_side = S) > parent_order.authorised_qty
    RAISE exception_class: over-fill
}

The three conditions are exhaustive. A ChildFill that satisfies none of them either matches its pair — closing the reconciliation — or remains open inside the grace window. Late arrivals that close before the grace window expires are invisible to the exception path. An event that produces no pair by window-close generates a missing-pair exception and nothing else. The matching engine has no fourth condition, no soft-flag path, and no silent-discard branch.

A trading-system ChildFill pairs with a venue ChildFill when both share the same parent reconciliation key and their valid-time fields fall within the tolerance window derived from measured clock dispersion. Quantity and price are secondary confirmation fields, not primary matching keys — because price representation varies between venue APIs even for the same execution, and quantity rounding conventions differ across systems.

Exceptions that survive the grace window travel upward through the escalation chain.

The escalation contract routes by class, not by severity#

Every exception the matching engine cannot resolve deterministically carries three fields when it is emitted: a three-class classification drawn from the unknown-key, missing-pair, and over-fill taxonomy; a value representing the fill quantity and price in the exception; and a risk-state delta encoding how the unresolved divergence changes the engine's internal state model. The routing rule is keyed on all three fields. The tier that receives the exception is determined by whether a deterministic resolution rule exists for that combination — not by the monetary value of the exception alone.

The deterministic tier handles exceptions with a known resolution path. An unknown-key exception indicates a mapping-layer failure — either the key was not registered before the fill arrived, or the ingest layer received a fill with no corresponding parent. The first case has a deterministic fix; the second does not and is routed upward. A missing-pair exception may resolve if the venue's fill report arrives in the next ingest cycle; the matching engine re-evaluates open exceptions before routing. An over-fill exception is always routed immediately — no rule can authorise a quantity exceeding the parent order's authorised fill.

The reconciliation agent tier receives exceptions for which no deterministic rule exists: identifier-mapping ambiguities, non-standard venue encodings, and multi-leg assemblies that require contextual reasoning. The agent's action space is constrained — it can suggest a match, suggest a break, or escalate. It cannot act on the ledger (the internal reconciliation ledger) unilaterally unless the exception is on a deterministic resolution path, carries a low risk-state delta, and falls below the value threshold that policy assigns to unilateral action. That constraint is what makes the agent tier auditable: the agent's suggestion is an input to the resolution, not the resolution itself, except in the bounded class of cases where policy permits unilateral closure.

Human confirmation is required when transaction size, transaction risk, or exception complexity crosses the policy threshold. The threshold values are policy, not mechanism. The invariant the mechanism enforces is unconditional: every unresolved divergence emits; every emitted exception raises risk state until resolved. An exception the agent cannot close is promoted to human review. It does not fall back to the matched state. It does not fall back to the open state without a record. Every path through the system ends with an event in the immutable exception log.

Thompson et al.'s Disruptor architecture (2011) demonstrates that a single-writer ordered event log makes application state entirely derivable from its input-event sequence. Pat Helland, writing at CIDR 2015[6], extends the principle: the log is the truth; downstream state is a cache of a subset of the log. The exception log here is immutable by the same reasoning. An exception appended to the log cannot be removed; it can only be closed by a subsequent resolution event, which is itself appended. The full reconciliation history is derivable from the log alone.

Prior approaches treat divergence as error#

The published literature on reconciliation — across both the payments domain and the trade-execution domain — treats divergence as an error condition that the system corrects. Rule-based matching engines route exceptions by confidence score: a high-confidence match closes automatically; a low-confidence match requires human review. AI-augmented architectures extend this by using a learned model to raise the confidence threshold for the ambiguous middle band. In both cases, the implicit assumption is that divergence is an undesirable signal — the matching engine's goal is to eliminate it.

The insight the three-class taxonomy embeds is different. Divergence in the order-ID shape class is structural: it is produced by the FIX lifecycle's own design, which requires ClOrdID to change on each replace request. It cannot be eliminated; it must be normalised before matching begins. Divergence in the timestamp class is physical: it is produced by the bounded offset between two independent clocks. It cannot be eliminated; it must be modelled with a tolerance window grounded in the synchronisation bounds. Divergence in the partial-fill class is representational: it is produced by the legitimate difference between granular fill events and summary fill aggregations. It cannot be eliminated; it must be resolved by a fill-event model that accumulates across both levels of granularity simultaneously.

None of these three classes is an error. Each is an expected property of the environment. The peer record's confidence-score escalation model conflates these three causes into a single dimension — confidence — which obscures why each exception fires. An exception routed to the AI tier because its confidence score is below threshold carries no information about whether the low confidence is caused by an ID normalisation failure, a clock offset, or a fill-count mismatch. The resolution path is different for each cause, and the agent that receives an undifferentiated low-confidence exception must rediscover the cause before it can act.

The three-class taxonomy makes cause explicit at emission time. The agent tier receives an exception already labelled with its class. The resolution path for a missing-pair exception is structurally different from the resolution path for an unknown-key exception — the agent's action space is pruned before it begins reasoning. This is the framing the peer record does not carry.

Event-sourced architectures in the peer record — Fowler's event-sourcing formulation, Kreps's state machine replication principle — establish the log-as-truth foundation the ChildFill model extends. The extension is applying that principle to two event streams simultaneously and comparing their projections rather than merging their records. The bi-temporal model, widely cited as a storage technique, functions here as a reconciliation operator: comparing two systems' projections at the same valid-time slice makes divergence visible as a temporal gap rather than a value mismatch. That reframing is not in the peer record.

The no-silent-discard invariant holds by design: every unresolved divergence emits, every emitted exception raises risk state, and every raised risk state persists until a resolution event closes it. The residual the invariant cannot close is the grace window itself. Between the moment the matching engine first determines that a pair has not arrived and the moment the grace window expires and the missing-pair exception fires, there is an interval — sized as a small- integer multiple of the measured clock dispersion — during which the system's reconciliation state carries open items within the grace window. During that interval the system is working correctly: the grace window is open, the pair-matching test is running on each ingest cycle, and no exception has fired because no exception condition has been met. There is no event to observe, no signal to emit. The gap is the necessary cost of tolerating divergence at all — and it has no engineering fix that does not reintroduce a different class of false positive.

Cross-venue reconciliation: designing a matching engine that tolerates divergence

The three divergence classes are not the same problem#

The canonical key is firm-generated, not borrowed from either side#

Bi-temporal fields separate when a trade happened from when it was recorded#

Append-only fill events project state; they do not store it#

The escalation contract routes by class, not by severity#

Prior approaches treat divergence as error#

References

Read next

Latency Anomaly Detection as a Distinct Engineering Layer

WebSocket head-of-line blocking in market-data feeds

Microsecond-tier execution in a multi-venue environment