Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

ArXiv

Abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos.

Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis.

Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the art models. Stability AI is considering making experimental data, code, and model weights publicly available.

Introduction

  1. In recent years, diffusion models have become the de-facto approach for generating high-resolution images and videos from natural language inputs with impressive generalization capabilities.
  2. research on formulations for more efficient training and/or faster sampling of these models has increased
  3. While specifying a forward path from data to noise leads to efficient training, it also raises the question of which path to choose. This choice can have important implications for sampling. For example, a forward process that fails to remove all noise from the data can lead to a discrepancy in training and test distribution and result in artifacts such as gray image samples (Lin et al., 2024). Importantly, the choice of the forward process also influences the learned backward process and, thus, the sampling efficiency. While curved paths require many integration steps to simulate the process, a straight path could be simulated with a single step and is less prone to error accumulation. Since each step corresponds to an evaluation of the neural network, this has a direct impact on the sampling speed.
  4. A particular choice for the forward path is a so-called Rectified Flow (Liu et al., 2022; Albergo & Vanden-Eijnden, 2022; Lipman et al., 2023), which connects data and noise on a straight line. In this work, we change this by introducing a re-weighting of the noise scales in rectified flow models, similar to noise-predictive diffusion models (Ho et al., 2020). Through a large-scale study, we compare our new formulation to existing diffusion formulations and demonstrate its benefits.
  5. We show that the widely used approach for text-to-image synthesis, where a fixed text representation is fed directly into the model (e.g., via cross-attention (Vaswani et al., 2017; Rombach et al., 2022)), is not ideal, and present a new architecture that incorporates learnable streams for both image and text tokens, which enables a two-way flow of information between them. We combine this with our improved rectified flow formulation and investigate its scalability. We demonstrate a predictable scaling trend in the validation loss and show that a lower validation loss correlates strongly with improved automatic and human evaluations.

Core contributions

  1. We conduct a large-scale, systematic study on different diffusion model and rectified flow formulations to identify the best setting. For this purpose, we introduce new noise samplers for rectified flow models that improve performance over previously known samplers.
  2. We devise a novel, scalable architecture for text-to-image synthesis that allows bi-directional mixing between text and image token streams within the network. We show its benefits compared to established backbones such as UViT (Hoogeboom et al., 2023) and DiT (Peebles & Xie, 2023).
  3. perform a scaling study of our model and demonstrate that it follows predictable scaling trends. We show that a lower validation loss correlates strongly with improved text-to-image performance assessed via metrics such as T2I-CompBench (Huang et al., 2023), GenEval (Ghosh et al., 2023) and human ratings. We make results, code, and model weights publicly available.

Simulation-Free Training of Flows

We consider generative models that define a mapping between samples \(x_1\) from a noise distribution \(p_1\) to samples \(x_0\) from a data distribution \(p_0\) in terms of an ordinary differential equation (ODE), \[ d y_t=v_{\Theta}\left(y_t, t\right) d t \] where the velocity \(v\) is parameterized by the weights \(\Theta\) of a neural network. Prior work by Chen et al. (2018) suggested to directly solve Equation (1) via differentiable ODE solvers. However, this process is computationally expensive, especially for large network architectures that parameterize \(v_{\Theta}\left(y_t, t\right)\). A more efficient alternative is to directly regress a vector field \(u_t\) that generates a probability path between \(p_0\) and \(p_1\). To construct such a \(u_t\), we define a forward process, corresponding to a probability path \(p_t\) between \(p_0\) and \(p_1=\mathcal{N}(0,1)\), as \[ z_t=a_t x_0+b_t \epsilon \quad \text { where } \epsilon \sim \mathcal{N}(0, I) . \] \(\text { For } a_0=1, b_0=0, a_1=0 \text { and } b_1=1 \text {, the marginals, }\) \[ p_t\left(z_t\right)=\mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} p_t\left(z_t \mid \epsilon\right) \] are consistent with the data and noise distribution.