SDXL Lightning

Abstract

We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We opensource our distilled SDXL-Lightning models both as LoRA and full UNet weights.

Introduction

Conceptually, the generation involves a probability flow that gradually transports samples between the data and the noise probability distribution. Formally, the flow can be expressed as an ordinary differential equation (ODE). In practice, generating a high-quality data sample requires more than 50 inference steps.

Different approaches to reduce the number of inference steps have been researched. Prior works have proposed better ODE solvers to account for the curving nature of the flow [19, 30, 34, 35, 64, 78]. Others have proposed formulations to make the flow straighter \([29,31]\). Nonetheless, these approaches generally still require more than 20 inference steps.

Our method combines the best of both worlds from progressive [54] and adversarial distillation [58]. Progressive distillation ensures that the distilled model follows the same probability flow and has the same mode coverage as the original model. However, progressive distillation with mean squared error (MSE) loss produces blurry results under 8 inference steps and we provide theoretical analysis in our paper.

To mitigate the issue, we use adversarial loss at every stage of the distillation to strike a balance between quality and mode coverage.