Equilibrium Matching Generative Modeling with Implicit Energy-Based Models

Homepage: https://raywang4.github.io/equilibrium_matching/
Paper link: https://arxiv.org/abs/2510.02300
Code: https://github.com/raywang4/EqM

Background

Diffusion and Flow Matching models learn non-equilibrium dynamics. Flow Matching (FM), for example, learns to match the conditional velocity along a linear path connecting noise and image samples. During sampling, Flow Matching starts from pure Gaussian noise and iteratively denoises the current sample using the velocity predicted by $f$ . Let’s denote $f$ as the generative model, $\epsilon$ as the added gaussion noise, $x$ as the real image sampled from training data, and $t$ as the time step. The objective function for FM is,

\begin{equation} \label{eq:FM_obj} L_{FM} = (f(x_t, t) − (x − \epsilon))^2. \end{equation}

Method

Equilibrium Matching (EqM) constructs an energy landscape in which ground-truth samples are on vallies of this landscape and noise is on the mountain points. It was constructed through adding noise to real data. From the paper , (EqM) learns a time-invariant gradient field that is compatible with an underlying energy function, eliminating time/noise conditioning and fixed-horizon integrators. Conceptually, EqM’s gradient vanishes on the data manifold and increases toward noise, yielding an equilibrium landscape in which ground-truth samples are stationary points. Flow Matching learns a varying velocity that only converges to ground truths at the final timestep, whereas EqM learns a time-invariantgradient landscape that always converges to ground-truth data points.

The objective function is very similar with FM. It is denoted as:

\begin{equation} \label{eq:EqM_obj} L_{EqM} = (f(x_\gamma) − (x − \epsilon)c(\gamma))^2. \end{equation}

To explain the above objective function, the paper first construct an energy landscape in which the target gradient at the ground-truth samples is zero, which means they are on vallies. It also defines a corruption scheme. $\gamma$ is defined as an interpolation factor sampled uniformly between 0 and 1. The model is to learn the path from the high energy points (which is noisy) to low energy point (which is real data). The intermediate interpolated sample is constructed as $x_{\gamma}=\gamma x + (1-\gamma) \epsilon $, where $epsilon$ is the gaussion noise. As stated in the paper, Unlike t in FM, our γ is implicit and not seen by the model. Our goal is to define a target gradient at these intermediate samples $x_{\gamma} $ that matches an implicit energy landscape. Equation (2) is derived by using a gradient direction that descends from noise to data.

There are other details on how to construct the $c(\gamma)$ gradient magnitude function and how the model learns explicit energy.

Contrast to FM/diffusion, the inference part seems to be simpler. A ‘Gradient Descent Sampling’ method can be used.

All the details are worth reading in the paper.

The training and inference pseudocode: