SDXL

Abstract

  • SDXL leverages a three times larger UNet backbone: more attention blocks and a larger cross-attention context as SDXL uses a second text encoder

  • We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios

  • Introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique.

Introduction

We train the final model, \(S D X L\), in a multi-stage procedure. \(S D X L\) uses the autoencoder from Sec. 2.4 and a discrete-time diffusion schedule \([14,45]\) with 1000 steps.

  1. First, we pretrain a base model on an internet dataset for 600,000 optimization steps at a resolution of \(256 \times 256\) pixels and a batch size of 2048, using size and crop-conditioning.
  2. We continue training on \(512 \times 512\) pixel images for another 200,000 optimization steps and finally utilize multi-aspect training in combination with an offset-noise [11, 25] level of 0.05 to train the model on different aspect ratios of \(\sim 1024 \times 1024\) pixel area.

UNet Structure

\[ \begin{aligned} &\text { Table 1: Comparison of SDXL and older Stable Diffusion models. }\\ &\begin{array}{lccc} \hline \text { Model } & S D X L & \text { SD 1.4/1.5 } & \text { SD 2.0/2.1 } \\ \hline \text { # of UNet params } & 2.6 \mathrm{~B} & 860 \mathrm{M} & 865 \mathrm{M} \\ \text { Transformer blocks } & {[0,2,10]} & {[1,1,1,1]} & {[1,1,1,1]} \\ \text { Channel mult. } & {[1,2,4]} & {[1,2,4,4]} & {[1,2,4,4]} \\ \text { Text encoder } & \text { CLIP ViT-L & OpenCLIP ViT-bigG } & \text { CLIP ViT-L } & \text { OpenCLIP ViT-H } \\ \text { Context dim. } & 2048 & 768 & 1024 \\ \text { Pooled text emb. } & \text { OpenCLIP ViT-bigG } & \text { N/A } & \text { N/A } \\ \hline \end{array} \end{aligned} \]

we concatenate the penultimate text encoder outputs along the channel-axis.

latent noise (VAE down sample ratio 8):

  • sdxl:128x128
  • sd1.5: 64*64

Micro-Conditioning

\[ \begin{array}{ccccccc} \hline \text { Height } & \text { Width } & \text { Aspect Ratio } & & \text { Height } & \text { Width} & \text { Aspect Ratio } \\ 512 & 2048 & 0.25 & & 1024 & 1024 & 1.0 \\ 512 & 1984 & 0.26 & & 1024 & 960 & 1.07 \\ 512 & 1920 & 0.27 & & 1088 & 960 & 1.13 \\ 512 & 1856 & 0.28 & & 1088 & 896 & 1.21 \\ 576 & 1792 & 0.32 & & 1152 & 896 & 1.29 \\ 576 & 1728 & 0.33 & & 1152 & 832 & 1.38 \\ 576 & 1664 & 0.35 & & 1216 & 832 & 1.46 \\ 640 & 1600 & 0.4 & & 1280 & 768 & 1.67 \\ 640 & 1536 & 0.42 & & 1344 & 768 & 1.75 \\ 704 & 1472 & 0.48 & & 1408 & 704 & 2.0 \\ 704 & 1408 & 0.5 & & 1472 & 704 & 2.09 \\ 704 & 1344 & 0.52 & & 1536 & 640 & 2.4 \\ 768 & 1344 & 0.57 & & 1600 & 640 & 2.5 \\ 768 & 1280 & 0.6 & & 1664 & 576 & 2.89 \\ 832 & 1216 & 0.68 & & 1728 & 576 & 3.0 \\ 832 & 1152 & 0.72 & & 1792 & 576 & 3.11 \\ 896 & 1152 & 0.78 & & 1856 & 512 & 3.62 \\ 896 & 1088 & 0.82 & & 1920 & 512 & 3.75 \\ 960 & 1088 & 0.88 & & 1984 & 512 & 3.88 \\ 960 & 1024 & 0.94 & & 2048 & 512 & 4.0 \\ \hline \end{array} \]

  • Conditioning the Model on Image Size:

    we provide the original (i.e., before any rescaling) height and width of the images as an additional conditioning to the model \(\mathbf{c}_{\text {size }}=\left(h_{\text {original }}, w_{\text {original }}\right) .\) Each component is independently embedded using a Fourier feature encoding, and these encodings are concatenated into a single vector that we feed into the model by adding it to the timestep embedding.

    At inference time, a user can then set the desired apparent resolution of the image via this size conditioning.

  • Conditioning the Model on Cropping Parameters (complementary):

    SD1.5 and SD2.1 may generate cropped images (cutting head images). Since a typical processing pipeline is to (i) resize an image such that the shortest size matches the desired target size (ii) randomly cropping the image along the longer axis. While random cropping is a natural form of data augmentation, it can leak into the generated samples, causing the malicious effects shown above.

    To fix this problem, we propose another simple yet effective conditioning method: During data loading, we uniformly sample crop coordinates \(c_{top}\) and \(c_{left}\) (integers specifying the amount of pixels cropped from the top-left corner along the height and width axes, respectively) and feed them into the model as conditioning parameters via Fourier feature embeddings, similar to the size conditioning described above. The concatenated embedding \(c_{crop}\) is then used as an additional conditioning parameter.

    Given that in our experience large scale datasets are, on average, object-centric, we set \(\left(c_{\text {top }}, c_{\text {left }}\right)=\) \((0,0)\) during inference and thereby obtain object-centered samples from the trained model.

  • Multi-Aspect Training (complementary):

    While the common output resolutions for text-to-image models are square images of 512 × 512 or 1024 × 1024 pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens. We follow common practice [31] and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to \(1024^2\) pixels as possibly, varying height and width accordingly in multiples of 64.

    During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers \(\mathbf{c a r}=\left(h_{\mathrm{tgt}}, w_{\mathrm{tgt}}\right)\) which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above.

    In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques introduced in Sec. 2.2 via concatenation along the channel axis.

    Note that crop-conditioning and multi-aspect training are complementary operations, and crop-conditioning then only works within the bucket boundaries (usually 64 pixels).

Improved Autoencoder

We train the same autoencoder architecture used for the original Stable Diffusion at a larger batch-size e (256 vs 9) and additionally track the weights with an exponential moving average. The resulting autoencoder outperforms the original model in all evaluated reconstruction metrics, see Tab. 3. We use this autoencoder for all of our experiments.