Taming Transformers
Taming Transformers for High-Resolution Image Synthesis
Introduction
We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
CNN vs Transformers
- CNNS: inductive prior, designed to exploit prior knowledge about strong local correlations within images
- Transformers: no built-in inductive prior, free to learn complex relationships among its inputs.
How to obtain an effective and expressive model?
Problem: low-level image structure is well described by CNN while it ceases to be effective on higher semantic levels.
Method:
use a convolutional approach to efficiently learn a codebook of context-rich visual parts
use transformer to learn the long-range interactions within these compositions
use an adversarial approach to ensure that the dictionary of local parts captures perceptually important local structure
