Transition Matching: A Unified Framework via Discrete-Time Markov Transitions

Background and Motivation

Flow Matching is a generative model based on continuous-time deterministic ODEs that learns smooth trajectories from noise to data. However, this formulation has several limitations:

  • Dependence on continuous time: Since actual inference discretizes using methods like the Euler method, a trade-off between the number of steps and accuracy is unavoidable
  • Deterministic trajectories: Only a single trajectory exists for each noise point, precluding stochastic exploration
  • Disconnection from other paradigms: It occupies a different framework from diffusion models (stochastic) and autoregressive models (discrete, causal)

Shaul et al. (2025) proposed Transition Matching to simultaneously resolve these limitations through a unified framework. The core question is:

Can diffusion models, Flow Matching, and autoregressive models be unified as discrete-time Markov transitions?

Transition Matching answers this question affirmatively. By formulating the generative process as a sequence of stochastic transition kernels, it achieves a flexible framework that encompasses all three paradigms.

General Framework of Transition Matching

Generation via Markov Transition Kernels

Transition Matching formulates the generative process as a sequence of Markov transition kernels at discrete times \(t = 0, 1, \ldots, T\). Specifically, the generative process to be learned takes the following form:

\[ p^\theta(X_0, X_1, \ldots, X_T) = p(X_0) \prod_{t=0}^{T-1} p^\theta_{t+1|t}(X_{t+1} | X_t) \tag{1}\]

Here, \(p(X_0)\) is the noise distribution (e.g., standard normal distribution) and \(p^\theta_{t+1|t}\) is the transition kernel parameterized by learnable parameters \(\theta\). Generation starts from \(X_0 \sim p(X_0)\) and sequentially samples \(X_{t+1} \sim p^\theta_{t+1|t}(\cdot | X_t)\).

Supervising Process

The learning objective is to imitate a supervising process \(q\) whose terminal distribution is the data distribution \(q_1\). The supervising process is defined by the following joint distribution:

\[ q(X_0, X_1, \ldots, X_T) = p(X_0) \prod_{t=0}^{T-1} q_{t+1|t}(X_{t+1} | X_t) \]

This supervising process is typically constructed from an interpolation path connecting noise and data (e.g., linear interpolation \(X_t = (1-\alpha_t) X_0 + \alpha_t X_1\)).

Loss Function

Learning is performed by minimizing the divergence between the supervising process transition kernels and the model’s transition kernels at each time step:

\[ \mathcal{L}(\theta) = \sum_{t=0}^{T-1} \mathbb{E}_{q(X_t)} \left[ D\left( q_{t+1|t}(\cdot | X_t) \,\|\, p^\theta_{t+1|t}(\cdot | X_t) \right) \right] \tag{2}\]

Here, \(D\) is a divergence measure between probability distributions (such as KL divergence). A crucial point is that the transition kernels at each time step can be matched independently, which is the source of the framework’s flexibility.

Kernel Parameterization via Latent Variables

Directly approximating the supervising process transition kernel \(q_{t+1|t}\) can be difficult in some cases. Transition Matching introduces a latent variable \(Y\) to parameterize the kernel:

\[ q_{t+1|t}(X_{t+1} | X_t) = \int q_{t+1|t,Y}(X_{t+1} | X_t, Y) \, q_{Y|t}(Y | X_t) \, dY \]

Different choices of latent variable \(Y\) give rise to different variants. This design freedom underpins the versatility of Transition Matching.

flowchart LR
    X0["X₀<br/>(noise)"] -->|"p^θ_{1|0}"| X1["X₁"]
    X1 -->|"p^θ_{2|1}"| X2["X₂"]
    X2 -->|"..."| XT["X_T<br/>(data)"]

    style X0 fill:#f9f,stroke:#333
    style XT fill:#9f9,stroke:#333
Figure 1: Transition Matching Framework: Markov transitions from noise to data
  • Supervising process: Target \(q_{t+1|t}(X_{t+1} | X_t)\)
  • Learning: Imitate with \(p^\theta_{t+1|t}(X_{t+1} | X_t)\)
  • Loss: Minimize \(D(q_{t+1|t} \| p^\theta_{t+1|t})\) at each time step independently

DTM: Difference Transition Matching

Formulation

DTM (Difference Transition Matching) is the most fundamental variant of Transition Matching and serves as a natural discrete-time generalization of Flow Matching.

It adopts the difference \(Y = X_T - X_0\) (the difference between data and noise) as the latent variable. This means that at each transition step, the model predicts the “direction” from noise to data. The transition kernel is defined as:

\[ p^\theta_{t+1|t}(X_{t+1} | X_t) = \mathcal{N}\left(X_{t+1}; X_t + (\alpha_{t+1} - \alpha_t) f_\theta(X_t, t), \sigma_t^2 I \right) \]

Here, \(f_\theta(X_t, t)\) is a neural network that predicts the difference (the direction of \(X_T - X_0\)), \(\alpha_t\) is the interpolation schedule parameter, and \(\sigma_t\) is the magnitude of stochastic noise.

Theoretical Relationship with Flow Matching

An important theorem establishing the relationship between DTM and FM has been proven.

Theorem: The expected value of one DTM step coincides with a Flow Matching Euler step. That is:

\[ \mathbb{E}\left[X_{t+1} | X_t\right] = X_t + (\alpha_{t+1} - \alpha_t) \mathbb{E}\left[f_\theta(X_t, t)\right] \]

This is structurally identical to the FM Euler discretization:

\[ z_{t+\Delta t} = z_t + \Delta t \cdot v_\theta(z_t, t) \]

Furthermore, in the limit \(T \to \infty\) (infinite number of time steps), DTM converges exactly to FM Euler steps.

This theorem has two important implications:

  • DTM is theoretically justified as a rigorous discrete-time version of FM
  • At finite steps, DTM has an additional stochastic noise term \(\sigma_t\) compared to FM

Additionally, the paper provides a new elementary proof of FM’s marginal velocity field. While the original FM formulation required somewhat complex arguments for proving the existence and uniqueness of the marginal velocity field, the Transition Matching framework yields a more direct and transparent proof.

Backbone-Head Architecture

The practical implementation of DTM adopts the Backbone-Head architecture. This is a critically important design from the perspective of computational efficiency.

flowchart TB
    Input["Input X₀"] --> Backbone["Backbone (heavy)<br/>e.g., UNet, DiT<br/>(run once per sample)"]
    Backbone -->|"shared features"| H1["Head 1<br/>(t=1)"]
    Backbone -->|"shared features"| H2["Head 2<br/>(t=2)"]
    Backbone -->|"shared features"| H3["Head 3<br/>(t=3)"]
    Backbone -->|"shared features"| HT["Head T<br/>(t=T)"]

    style Backbone fill:#ffd,stroke:#333
    style H1 fill:#dff,stroke:#333
    style H2 fill:#dff,stroke:#333
    style H3 fill:#dff,stroke:#333
    style HT fill:#dff,stroke:#333
Figure 2: Backbone-Head architecture: Run the heavy Backbone once, then predict transitions at each time step with lightweight Heads

Backbone forward pass count is reduced from 128 (FM) to 16 (DTM), achieving a 7x speedup.

The key points of this architecture are:

  • Backbone: A heavy network such as UNet or DiT that extracts common features from the input. It is run only once per sample
  • Head: Lightweight networks specialized for each time step \(t\) that predict transitions from the Backbone’s output
  • Speedup: While conventional FM requires 128 Backbone forward passes, DTM requires only 16. This corresponds to a 7x speedup

ARTM: Autoregressive Transition Matching

ARTM (Autoregressive Transition Matching) is a variant that incorporates the structure of autoregressive models into Transition Matching.

Independent Linear Processes and Causal Structure

In ARTM, an independent linear process is defined for each token position \(i\):

\[ X_t^{(i)} = (1 - \alpha_t) X_0^{(i)} + \alpha_t X_T^{(i)} \]

Here, \(X_t^{(i)}\) is the state at time \(t\) for token position \(i\). The crucial point is that this process has a causal structure. That is, the transition at position \(i\) depends only on information from positions \(1, \ldots, i-1\).

This naturally integrates the left-to-right sequential generation structure of autoregressive models with the noise-to-data transformation structure of Flow Matching. The velocity at each token position is learned independently, and causal masks control the flow of information.

Relationship with Autoregressive Models

ARTM can be interpreted as an extension of discrete-token autoregressive generation to continuous space. When the number of tokens is fixed to 1 and the number of steps is \(T=1\), it degenerates to a single step of a standard autoregressive model.

FHTM: Full History Transition Matching

FHTM (Full History Transition Matching) is the most expressive variant and occupies an important position in integration with LLM architectures.

Access to Full History

While DTM and ARTM predict the next state based only on the current state \(X_t\), FHTM has access to the full history \(X_0, X_1, \ldots, X_t\):

\[ p^\theta_{t+1|0:t}(X_{t+1} | X_0, X_1, \ldots, X_t) \]

This means abandoning the Markov property in exchange for utilizing richer contextual information.

Training with Teacher-Forcing

FHTM training uses teacher-forcing. This is the standard training technique for autoregressive models in natural language processing, where the true history from the supervising process is used as input during training:

\[ \mathcal{L}_{\text{FHTM}}(\theta) = \sum_{t=0}^{T-1} \mathbb{E}_{q(X_0, \ldots, X_t)} \left[ D\left( q_{t+1|0:t}(\cdot | X_0, \ldots, X_t) \,\|\, p^\theta_{t+1|0:t}(\cdot | X_0, \ldots, X_t) \right) \right] \]

Teacher-forcing ensures efficient and stable training. At inference time, the model uses its own generated history sequentially.

Innovation as a Fully Causal Model

The most notable achievement of FHTM is that it was the first fully causal model to surpass Flow Matching. Previously, generative models with causal structures were thought to be inferior to bidirectional models. FHTM overturned this conventional wisdom, demonstrating that leveraging full history information more than compensates for the constraints of causal structure.

FHTM can be directly implemented with standard LLM architectures (Transformer decoder). The reasons are as follows:

  • Causal mask: FHTM’s causal structure perfectly aligns with the autoregressive causal mask of LLMs
  • Teacher-forcing: The standard LLM training technique can be applied directly
  • Representation as token sequences: The state \(X_t\) at each time step can be treated as a token and processed as a temporal sequence

This compatibility opens the possibility of seamlessly integrating text and image generation. For example, after autoregressive generation of text tokens, the same model with the same architecture could perform step-by-step refinement of images. The practical advantage of being able to directly leverage the massive LLM ecosystem (optimization methods, inference engines, hardware support) is also significant.

Experimental Results

DTM Image Generation Performance

DTM was trained on 350M Shutterstock data and evaluated on text-conditioned image generation. The evaluation metrics cover both image quality and prompt alignment.

Metric DTM (16 steps) FM (128 steps) Notes
CLIPScore Surpasses Baseline Text-image alignment
PickScore Surpasses Baseline Human preference-based evaluation
ImageReward Surpasses Baseline Reward model score
Aesthetics Surpasses Baseline Aesthetic quality
Backbone forwards 16 128 7x speedup

DTM surpasses 128-step FM across all metrics with only 16 Backbone forward passes. This demonstrates the effectiveness of the Backbone-Head architecture.

FHTM Performance

FHTM as a fully causal model showed noteworthy results in the following respects:

  • Surpassing FM: The first instance of a causal model outperforming a bidirectional model
  • LLM architecture: Demonstrated implementability with a standard Transformer decoder
  • Effectiveness of teacher-forcing: Confirmed training stability and efficiency

Comparison of the Three Variants

Summarizing the positioning of each variant:

Variant Latent Variable \(Y\) Structure Key Advantage
DTM \(X_T - X_0\) (difference) Markov Discrete-time version of FM, 7x speedup
ARTM Independent linear processes Causal Bridge between AR models and FM
FHTM Full history Fully causal First to surpass FM, LLM-compatible

Significance and Positioning

The contribution of Transition Matching extends beyond individual performance improvements. This framework provides a unified perspective for treating three generative model paradigms that have previously developed independently:

  • Diffusion models: Can be expressed as stochastic transition kernels
  • Flow Matching: Recoverable as the \(T \to \infty\) limit of DTM
  • Autoregressive models: Subsumed as special cases of ARTM and FHTM

This unification not only deepens theoretical understanding but also opens new design spaces in practice. For example, it becomes possible to mix stochastic and deterministic transitions, or to adopt causal structure for some steps and bidirectional structure for others.

The fact that FHTM is compatible with LLM architectures is particularly significant as an important step toward unified text and image generation. In the context of one-step generation, it occupies a complementary position to the other methods discussed in the main document, approaching the shared goal of high-quality generation with fewer steps from a unique angle of discrete-time, stochastic formulation.

As a theoretical byproduct of Transition Matching, a new elementary proof of Flow Matching’s marginal velocity field has been obtained.

In the original Flow Matching formulation, deriving the marginal velocity field \(u_t(x)\) from the conditional velocity field \(u_t(x | x_1)\) required going through the theory of probability flows or the continuity equation.

In the Transition Matching framework, starting from discrete-time transition kernels and taking the \(T \to \infty\) limit directly yields the existence and expression of the marginal velocity field. Specifically:

\[ u_t(x) = \lim_{\Delta t \to 0} \frac{1}{\Delta t} \mathbb{E}\left[X_{t+\Delta t} - X_t \mid X_t = x\right] \]

This proof provides the perspective of understanding continuous-time FM as the limit of discrete time, placing the theoretical foundations of Flow Matching on firmer ground.