PuLid

PuLid

Introduces both contrastive alignment loss and accurate ID loss to keep ID fidelity.

GitHub

Screenshot 2024-12-08 at 10.48.39


Abstract

For ID custimization, two branches

  • fine-tunes certain parameters:
    • Textual Inversion
    • Dreambooth
    • Lora
    • Multi-Concept Customization of Text-to-Image Diffusion
  • Forgoes fine-tuning for each ID:
    • Tuning-free multi-subject image generation with localized attention
    • IP-Adapter
    • Photoverse
    • Face0: Instantaneously conditioning a text-to-image model on a face.
    • Photomaker
    • When stylegan meets stable diffusion: a w plus adapter for personalized image generation.
    • Instantid

Introduction

Current Challenges

  1. Insertion of ID disrupts the original model’s behavior

    Only altering ID-related aspects. While IP-Adapter, Instantid, Photomaker has shown the ability for stylized ID generation, notable style degradation occurs when compared with images before ID insertion.

    Retain the ability of the original T2I model to follow prompts.

    • Enhancing the encoder: IP-Adapter faceid shift from CLIP extraction to Face ID by Arcface to extract more abstract and relevant ID information.(the paper says the ID fidelity is not high enough) InstantID includes an additional ID&Landmark ControlNet for more effective modulation.(compromises some degree of editability and flexibility)
    • Photomaker constructing datasets grouped by ID; each ID includes several images.(demands significant effort, effects to non-celebrities may be limited)
  2. Lack of ID fidelity

    Introduce ID loss within diffusion training: Due to the iterative denoising nature of diffusion models, achieving an accurate x0 needs multiple steps.

    predict x0 directly from the current timestep and then calculate the ID loss: when the current timestep is large, the predicted x0 is often noisy and flawed.

    PortraitBooth calculates ID loss only at less noisy timesteps: ignores such loss in the early steps, thereby limiting its overall effectiveness

Proposal

introduce a Lightning T2I branch alongside the standard diffusion-denoising training branch. the lighting T2I branch can generate high-quality images from pure noise with a limited and manageable number of steps.

  • minimize the influence on the original model’s behavior
  • naturally extract its face embedding and calculate an accurate ID loss

Contributions

  • We propose a tuning-free method, namely, PuLID, which preserves high ID similarity while mitigating the impact on the original model’s behavior.

  • We introduce a Lightning T2I branch alongside the regular diffusion branch. Within this branch, we incorporate a contrastive alignment loss and ID loss to minimize the contamination of ID information on the original model while ensuring fidelity. Compared to the current mainstream approaches that improve the ID encoder or datasets, we offer a new perspective and training paradigm.

  • Experiments show that our method achieves SOTA performance in terms of both ID fidelity and editability. Moreover, compared to existing methods, our ID information is less invasive to the model, making our method more flexible for practical applications.

Related work

Notably, models with higher ID fidelity often cause more significant disruptions to the behavior of the original model.

  • advanced sampling methods: DDIM, DPM-solver++, DPS

    • Recent distill-based works:
      • Progressive adversarial diffusion distillation.
      • Latent consistency models: Synthesizing high-resolution images with few-step inference.
      • Adversarial diffusion distillation

    In this study, the Lightning T2I training branch we introduce leverages the SDXL-Lightning [23] acceleration technology: Sdxl-lightning: Progressive adversarial diffusion distillation.

  • ID Loss

    To improve ID fidelity, ID loss is employed in previous works [18,3], motivated by its effectiveness in prior GAN-based works [35,45]. However, in these methods, \(x_0\) is typically directly predicted from the current timestep using a single step, often resulting in noisy and flawed images. Such images are not ideal for the face recognition models [6], as they are trained on real-world images. PortraitBooth [29] alleviates this issue by only applying ID loss at less noisy stages, which ignores such loss in the early steps, thereby limiting its overall effectiveness. Diffswap [54] obtains a better predicted \(x_0\) by employing two steps instead of just one, even though this estimation still contains noisy artifacts. In our work, with the introduced Lightning T2I training branch, we can calculate ID loss in a more accurate setting.

Basic setting

Currently, tuning-free ID customization methods generally face a challenge: the embedding of the ID disrupts the behavior of the original model.

In conventional ID Customization diffusion training process, as formulated in Eq. 1, the ID condition Cid is usually cropped from the target image x0 [50, 44]. In this scenario, the ID condition aligns completely with the prompt and UNET features, implying the ID condition does not constitute contamination to the T2I diffusion model during the training process. This essentially forms a reconstruction training task. So, to better reconstruct x0 (or predict noise ϵ), the model will make the utmost effort to use all the information from ID features (which may likely contain ID-irrelevant information), as well as bias the training parameters towards the dataset distribution, typically in the realistic portrait domain. Consequently, during testing, when we provide a prompt that is in conflict or misaligned with the ID condition, such as altering ID attributes or changing styles, these methods tend to fail. This is because there exists a disparity between the testing and training settings.

  • ID insert way: ip-adapter
  • Embeding: faceid and clip image embeds concatenate and use MLP to map them into 5 tokens.
  • Contrastive Alignment: One path is conditioned only by the prompt, while the other path employs both the ID and the prompt as conditions.

Loss

  • Alignment Loss

    the response region of attention mask aplied to Q should remain the same with and without ID.

    • semantic alignment loss: can be interpreted as the response of the UNET features to the prompt. If the embedding of ID does not affect the original model’s behavior, then the response of the UNET features to the prompt should be similar in both paths \[ \mathcal{L}_{\text {align-sem }}=\left\|\operatorname{Softmax}\left(\frac{K Q_{\text {tid }}^T}{\sqrt{d}}\right) Q_{\text {tid }}-\operatorname{Softmax}\left(\frac{K Q_t^T}{\sqrt{d}}\right) Q_t\right\|_2 \]

    • layout alignment loss: \[ \mathcal{L}_{\text {align-layout }}=\left\|Q_{\text {tid }}-Q_t\right\|_2 . \]

    • add together \[ \mathcal{L}_{\text {align }}=\lambda_{\text {align-sem }} \mathcal{L}_{\text {align-sem }}+\lambda_{\text {align-layout }} \mathcal{L}_{\text {align-layout }}, \]

  • ID Loss

    Directly predict \(x_0\) using a single step from \(\epsilon\) will produce a flawed noisy image, will make ID loss inaccurate.

    In this study, thanks to the introduced Lightning T2I branch, the above issue can be fundamentally resolved. Firstly, we can swiftly generate an accurate \(x_0\) conditioned on the ID from pure noise within 4 steps. \[ \mathcal{L}_{\text {id }}=1-\operatorname{CosSim}\left(\phi\left(C_{i d}\right), \phi\left(\operatorname{L}-\mathrm{T2I}\left(x_T, C_{i d}, C_{t x t}\right)\right)\right), \]

  • Total Loss \[ \mathcal{L}=\mathcal{L}_{\text {diff }}+\mathcal{L}_{\text {align }}+\lambda_{\text {id }} \mathcal{L}_{\text {id }} . \]

Experiments

  • ID encoder: antelopev2

  • CLIP Image encoder: EVA-CLIP

  • Dataset: BLIP-2

  • Training process

    1. In the first stage, we use the conventional diffusion loss \(L_{diff}\) to train the model.
    2. we resume from the first stage model and train with the ID loss \(L_{id}\) and \(L_{diff}\)
    3. add the alignment loss \(L_{align}\)
  • test set: Unsplash-50

  • Comparison

    when compared to SOTA methods such as IPAdapter and InstantID, our PuLID tends to achieve higher ID fidelity while creating less disruption to the original model.

    disruption to the model decreases, accurately replicate the lighting (1st column), style (4th column), and even layout (5th column) of the original model.

    possesses respectable prompt-editing capabilities

  • Limitaion

    • demands more CUDA memory
    • ID loss impacts image quality to some extent, such as causing blurriness in faces

Source code

Get ID Embedding

Use antelopev2 to get id embedding

1
2
3
4
5
6
image_bgr = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
# get antelopev2 embedding
face_info = self.app.get(image_bgr)
***
id_ante_embedding = face_info['embedding'] # (512,) unsqueeze(0) to (1, 512)
***
origin image
antelop

Parsing

when parsing, the ears are removed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# using facexlib to detect and align face
self.face_helper.read_image(image_bgr)
self.face_helper.get_face_landmarks_5(only_center_face=True)
self.face_helper.align_warp_face()
if len(self.face_helper.cropped_faces) == 0:
raise RuntimeError('facexlib align face fail')
align_face = self.face_helper.cropped_faces[0]

# parsing
input = img2tensor(align_face, bgr2rgb=True).unsqueeze(0) / 255.0
input = input.to(self.device)
parsing_out = self.face_helper.face_parse(normalize(input, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]))[
0
]
parsing_out = parsing_out.argmax(dim=1, keepdim=True)
bg_label = [0, 16, 18, 7, 8, 9, 14, 15]
bg = sum(parsing_out == i for i in bg_label).bool()
white_image = torch.ones_like(input)
# only keep the face features
face_features_image = torch.where(bg, white_image, self.to_gray(input))

input
face_features_image

Concatenate id embedding from atelopev2 with clip embedding from vit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# transform img before sending to eva-clip-vit
face_features_image = resize(
face_features_image, self.clip_vision_model.image_size, InterpolationMode.BICUBIC
)
face_features_image = normalize(face_features_image, self.eva_transform_mean, self.eva_transform_std)
id_cond_vit, id_vit_hidden = self.clip_vision_model(
face_features_image, return_all_features=False, return_hidden=True, shuffle=False
) # shape: [1,768], 5*[1, 577, 1024]

id_cond_vit_norm = torch.norm(id_cond_vit, 2, 1, True)
id_cond_vit = torch.div(id_cond_vit, id_cond_vit_norm)

id_cond = torch.cat([id_ante_embedding, id_cond_vit], dim=-1) # [1, 1280] = 512 + 768
id_cond_list.append(id_cond)
id_vit_hidden_list.append(id_vit_hidden)
1
2
3
4
5
6
7
8
9
10
11
12
id_cond = torch.stack(id_cond_list, dim=1) # [1, 1, 1280]
id_vit_hidden = id_vit_hidden_list[0] # the vit hidden state of primary image
for i in range(1, len(image_list)):
for j, x in enumerate(id_vit_hidden_list[i]):
# concatenate the vit hiddent state with other image
id_vit_hidden[j] = torch.cat([id_vit_hidden[j], x], dim=1)

id_embedding = self.id_adapter(id_cond, id_vit_hidden) # [1, 1, 1280], 5*[1, 577, 1024]
uncond_id_embedding = self.id_adapter(id_uncond, id_vit_hidden_uncond)

# return id_embedding
return uncond_id_embedding, id_embedding