[평범한 학부생이 하는 논문 리뷰] DiT (ICCV 2023 oral) & MM-DiT (ICML 2024)

Notice

Recent Posts

Recent Comments

Link

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

평범한 필기장

[평범한 학부생이 하는 논문 리뷰] DiT (ICCV 2023 oral) & MM-DiT (ICML 2024) 본문

AI/Generative Models

[평범한 학부생이 하는 논문 리뷰] DiT (ICCV 2023 oral) & MM-DiT (ICML 2024)

junseok-rh 2025. 7. 23. 16:51

Paper : https://arxiv.org/abs/2212.09748

Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our D

arxiv.org

Paper : https://arxiv.org/abs/2403.03206

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative

arxiv.org

1. DiT

DiT는 ViT의 아키텍처를 기반으로한 diffusion model이다.

Patchify

DiT Block Design

Diffusion model은 noisy input image외에도 time, class, natural language와 같은 다른 condition을 받아야한다. DiT논문에서는 여러가지 방식을 실험한 결과 adaLN-Zero방식을 선택했다. AdaLN-Zero는 MLP가 모든 $\alpha$에 대해서는 처음에 $0$을 뱉도록 한다.

Transformer Decoder

최종 DiT블록의 output인 sequence of image token들을 original spatial input과 동일한 크기의 output noise prediction과 output diagonal covariance prediction으로 decoding해야한다. DiT 논문에서는 standard linear decoder를 사용한다. Layer norm (adaptive if using adaLN)를 적용하고 각 토큰을 $p \times p \times 2C$의 tensor로 linearly decoding한다. 최종적으로 decoding된 token들을 original spatial layout으로 rearrange한다.

2. MM-DiT

MM-DiT는 DiT를 기반으로 한다. 두 개의 DiT 브랜치로 구성된 느낌이고, text와 image embedding을 따로 처리한다. 그리고 attention 계산할 때에만 concat해서 joint하게 계산한다. 사실 아키텍처에 대한 내용은 논문에서도 크게 다루고 있지 않다. 논문에서는 주로 noise를 어떻게 잘 샘플링할 것인가, 어떤 forward process formulation을 사용할 것인가를 다루고 실험적으로 어떤 것이 결과가 좋았는 지를 다루는 것으로 파악했다.

3. SiT

번외로 DiT의 variant인 SiT 또한 앞선 MM-DiT에서와 같이 모델 아키텍쳐 자체에 집중하기 보다는,

vector field prediction이냐 score function prediction이냐
continuous time로 학습하냐 discrete time으로 학습하냐
forward process를 어떻게 둘 것이냐

등에 대해 실험해보고 어떤 design이 성능이 좋더라를 보여준 것으로 이해했다.

'AI > Generative Models' 카테고리의 다른 글

[평범한 대학원생이 하는 논문 간단 요약] How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization? (NeurIPS 2024) (0)	2025.09.12
[평범한 대학원생이 하는 논문 간단 요약] One Image is Worth a Thousand Words:A Usability Preservable Text-Image Collaborative Erasing Framework (ICML 2025) (1)	2025.09.09
[평범한 학부생이 하는 논문 리뷰] Text-to-Image Rectified Flow as Plug-and-Play Priors (ICLR 2025) (0)	2025.07.17
[평범한 학부생이 하는 논문 리뷰] EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers (ICML 2025) (1)	2025.07.16
[평범한 학부생이 하는 논문 리뷰] ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation (ICCV 2025) (2)	2025.07.06

'AI/Generative Models' Related Articles

평범한 필기장

[평범한 학부생이 하는 논문 리뷰] DiT (ICCV 2023 oral) & MM-DiT (ICML 2024) 본문

[평범한 학부생이 하는 논문 리뷰] DiT (ICCV 2023 oral) & MM-DiT (ICML 2024)

1. DiT

Patchify

DiT Block Design

Transformer Decoder

2. MM-DiT

3. SiT

'AI > Generative Models' 카테고리의 다른 글

티스토리툴바