[평범한 학부생이 하는 논문 리뷰] DragAnything : Motion Control for Anything using Entity Representation (ECCV 2024)

Notice

Recent Posts

Recent Comments

Link

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

평범한 필기장

[평범한 학부생이 하는 논문 리뷰] DragAnything : Motion Control for Anything using Entity Representation (ECCV 2024) 본문

AI/Video

[평범한 학부생이 하는 논문 리뷰] DragAnything : Motion Control for Anything using Entity Representation (ECCV 2024)

junseok-rh 2024. 11. 12. 14:43

Paper : https://arxiv.org/abs/2403.07420

DragAnything: Motion Control for Anything using Entity Representation

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is

arxiv.org

Project Page : https://weijiawu.github.io/draganything_page/

SEED Multimodal

Project page for SEED Multimodal.

ailab-cvc.github.io

1. Introduction

Controllable video generation은 더 많은 본질적인 challenge를 가지고, 이는 spatial content manipulation 뿐만 아니라 precise temporal motion control을 요구한다.

최근 controllabel video generation에서 trajectory-based motion control이 user-friendly하고 efficient solution이라는 것이 증명됐다. 대표적인 예로 Drag-NUWA와 MotionCtrl이 있다고 한다. 이 두 모델은 controllable video generation에서 상당한 contribution을 만들었다고 한다. 그런데 중요한 질문인 "Can a single point onn the target truly represent the target?"을 간과해왔다.

Single pixel point는 전체 object를 대표할 수 없다. 그렇기에 single pixel point를 드래그하는 것은 대응되는 object를 정확하게 control할 수 없다. 이러한 issue를 해결하기 위해 두 가지 컨셉을 명확하게 할 필요가 있다.

What entity. 드래그될 특정 지역이나 객체를 식별하기.
How to drag. 선택된 구역만 드래그하는 것을 어떻게 달성하는가.

첫번째 문제는 interactive segmenation을 사용하는 것으로 해결될 수 있다. 두번째 문제가 어려운데, 본 논문에서는 어떠한 개체에 대하느 정확한 motion control을 달성하기 위해 새로운 Entity Representation을 제안한다.

기존에 latent feature를 사용하는 것에 영감을 받아 본 논문은 각 entity를 대표하기 위해 diffusion model의 latent feature를 사용하는 DragAnything을 제안한다. Entity mask의 coordinate indices에 기초해서, 첫 frame의 diffusion feature로부터 대응되는 semantic features를 추출할 수 있다. 그러고 나서 이러한 feature들을 entity를 대표하기위해 사용할 수 있고, 이를 통해서 대응되는 latent feature의 spatial position을 조작함으로써 entity-level motion control을 달성할 수 있다.

본 논문의 contribution은 다음과 같다.

2. Methodology

2.1 Task Formulation and Motivation

Task Formulation

Trajectory-based video generation task는 주어진 motion trajectory를 base로 video를 합성하는 모델을 필요로 한다. 이때 point trajectory $(x_1, y_1), \cdots, (x_L, y_L)$가 주어지면 conditional denoising autoencoder $\epsilon_\theta(z,c)$가 motion trajectory에 대응되는 video를 생성하도록 활용된다. 여기서 $c$에는 trajectory points, video의 first frame, first frame의 entity mask가 들어간다.

Motivation

이전의 trajectory-based work인 DragNUWA와 MotionCtrl은 제공된 trajectory coordinates나 그들의 derivatives를 이용해서 대응되는 픽셀들이나 픽셀 영역들을 직접 조작한다. 하지만 이 방식들은 "제공된 trajectory points가 control하려고 하는 entity를 전부 나타내지는 못할 수 있다."라는 중요한 문제를 간과한다. 그래서 이러한 point들을 드래깅하는 것이 object의 motion을 정확하게 control하지 않을 수 있다.

본 논문에서는 toy experiments를 진행해서 다음과 같은 insight를 얻었다.

위 두 가지 insight들을 통해 본 논문은 object의 representation으로써 control하고 싶은 그 object의 latent feature를 추출하는 Entity Representation을 제안한다.

2.2 Architecture

전체적인 architecture는 위 이미지와 같다.

2.3 Entity Semantic Representation Extraction

본 논문의 method는 gaussian representation과 그에 대응되는 entity representation이다.

Entity Representation Extraction

(1) 첫번째 frame $\mathbf{I}^{H \times W \times 3}$와 그에 대응되는 entity mask $\mathbf{M}$가 주어지면, diffusion inversion을 통해 image의 latent noise $\mathbf{x}$를 얻는다.

(2) Denoising U-Net $\epsilon_\theta$를 통해 대응되는 latent diffusion feature $\mathcal{F} \in \mathbb{R}^{H \times W \times C}$를 추출한다.

(3) Diffusion feature $\mathcal{F}$를 가지고, entity mask $M$로부터 대응되는 coordinate를 indexing함으로써 대응되는 entity embedding을 얻을 수 있다.

(4) Average pooling을 통해 final embedding $\{ e_1, e_2, \cdots, e_k \}$를 얻는다.

(5) Entity embedding들을 대응되는 trajectory point들과 연관시키기 위해, zero matrix $\mathbf{E} \in \mathbb{R}^{H \times W \times C}$에 entity embedding들을 trajectory sequence point들을 base로 끼워넣는다.

위 이미지에서와 같이 첫번째 frame의 entity의 center coordinate $\{ (x^1, y^1), \cdots, (x^k, y^k) \}$가 주어지면, Co-Tracker를 통해 이 point들을 track하고 대응되는 motion trajectory $\{ \{(x^1_i, y^1_i) \}^L_{i=1}, \cdots, \{ (x^k_i, y^k_i) \}^L_{i=1} \}$를 얻는다. 이를 통해 대응되는 entity representation $\{ \mathbf{\hat{E}_i} \}^L_{i=1}$을 얻을 수 있다.

2D Gaussian Representation Extraction

Entity의 중앙에 가까운 pixel들은 일반적으로 더 중요하다고 할 수 있다. 그래서 본 논문은 제안된 entity representation이 edge pixel의 weight를 줄이면서 중앙 지역에 더 집중하도록 하는 것을 목표로 한다. 2D Gaussian Representation은 이를 효율적으로 강화할 수 있다. 위 이미지와 같이 $\{ \{ (x^1_i, y^1_i) \}^L_{i=1}, \cdots, \{ (x^k_i, y^k_i) \}^L_{i=1} \}$와 $\{ r^1, \cdots, r^k \}$를 가지고, 대응되는 2D Gaussian Distribution Representation trajectory sequence $\{ \mathbf{G_i} \}^L_{i=1}$를 얻을 수 있다. 그 후 $\mathcal{E}$를 이용해서 이를 처리하고 이를 entity representation과 merge한다.

Encoder for Entity Representation and 2D Gaussian Map

위와 같이 entity representation과 2D Gaussian map을 latent feature로 encoding하기 위해 encoder $\mathcal{E}$를 사용한다. 여기서 $\mathcal{E}$는 두 개의 convolution layer와 SiLU activation function으로 되어있는 블록 4개를 활용한다. 두 개의 encoder는 첫번째 블록 channel 수만 다르고 동일한 구조를 가진다. 이는 두 representation이 다를 때 달라진다. 두 encoder를 통해 나온 것들은 다음과 같이 latent noise와 더해진다.

이렇게 얻어진 $\{ \mathbf{\hat{R}_i} \}^L_{i=1}$은 denoising 3D Unet의 encoder를 통과하고 이를 통해 4개의 feature를 얻게된다. 이 feature들은 ControlNet과 동일하게 denoising 3D Unet에 더해지면서 latent condition signal로 작용한다.

2.4 Training and Inference

Ground Truth Label Generation

Entity에 대한 mask의 incircle에 대해서 계산해서 center $(x,y)$와 radius $r$을 얻는다.
Co-Tracker를 통해 $(x,y)$로 trajectory points $\{ (x_i, y_i) \}^L_{i=1}$를 얻는다.
Trajectory Points와 radius를 통해, Trajectory of 2D Gaussian을 얻는다.
Circle의 trajectory에서의 각 circle마다 entity embedding을 넣어줌으로써 Entity Representation의 trajectory를 얻는다.

Loss Function

여기서 $\mathbf{M}$은 각 frame에서 entity들의 mask이다. 이를 통해 optimize하기 원하는 영역에만 backpropagate하기 위해서 MSE loss를 제한한다.

Inference of User-Trajectory Interaction

Inference시에는 유저가 컨트롤하고 싶은 영역을 click하고 픽셀을 드래그한다. 이를 통해 DragAnything은 비디오를 생성할 수 있다.

3. Experiments

3.1 Experiment Settings

Evaluation Metrics

FID : Visual Quality
FVD : Temporal Coherence
ObjMC : Object Motion Control $\rightarrow$ predicted와 GT trajectory사이의 euclidian distance

3.2 Comparisons with SOTA methods

Evaluation of Video Quality

표 1에서 FID score를 보면 DragNUWA보다 좋은 것을 확인할 수 있다. DragAnything의 video quality에 대한 정성적인 결과는 다음과 같다.

Evaluation of Temporal Coherence

표 1에서 FVD score를 보면 temporal coherence가 DragNUWA보다 더 좋은 것을 볼 수 있다.

Evaluation of Object Motion

표 1에서 ObjMC score를 통해 DragNUWA보다 motion control performance가 더 좋은 것을 볼 수 있다. 정성적인 결과는 아래와 같다.

User Study for Motion Control and Video Quality

3.3 Ablation Studies

Effect of Entity Representation $\mathbf{\hat{E}}$

위 수식 (2)에서 Entity Representation $\mathbf{\hat{E}}$를 넣는 것에 대한 결과를 통해서 이에 대한 효과를 확인한다. $\mathbf{\hat{E}}$가 생성된 video에서 object motion에 주요하게 영향을 끼치기 때문에, ObjMC만 비교할 필요가 있다. 표 2에서의 결과를 보면 ObjMC가 좋아지는 것을 볼 수 있다.

Effect of 2D Gaussian Representation

위와 동일한 방식으로 2D Gaussian Representation의 효과에 대한 결과를 관측했다. 표 2에서의 결과에서 볼 수 있듯이 Gaussian representation을 넣었을 때, ObjMC의 결과가 좋아진 것을 볼 수 있다. 그리고 Entity representation과 ObjMC representation 둘 모두를 넣었을 때 가장 좋은 결과를 보인다.

Effect of Loss Mask $\mathbf{M}$

표 3은 Loss Mask $\mathbf{M}$에 대한 ablation 결과를 보여준다. 결과에서 볼 수 있듯이 $\mathbf{M}$의 효과가 있는 것을 볼 수 있다.

3.4 Discussion for Various Motion Control

4. Limitation & Bad Case Analysis

Trajectory-based motion control method들이 공통적으로 가지는 한계점으로, 2D dimension으로 제한되고 뒤를 돌거나 정확한 body rotation과 같은 3D scene에 대한 motion을 다룰 수 없다.
Foundation model의 performance에 제한되고 위 이미지처럼 매우 거대한 motion들을 가진 장면을 생성할 수 없다.

위 이미지에서의 예시를 보면 공룡의 다리가 5개이거나 모션이 이상한 경우, 독수리의 날개가 희미한 경우가 발생한다. 이는 foundation model의 생성 능력을 넘어서는 과도한 motion때문일 수 있다고 한다.

'AI > Video' 카테고리의 다른 글

[평범한 학부생이 하는 논문 리뷰] WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing (ECCV 2024) (0)	2024.12.19
[평범한 학부생이 하는 논문 리뷰] DreamMotion : Space-Time Self-Similar Score Distillation for Zero-shot Video Editing (ECCV 2024) (0)	2024.12.14
[평범한 학부생이 하는 논문 리뷰] MagDiff : Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing (ECCV 2024) (4)	2024.10.31
[평범한 학부생이 하는 논문 리뷰] VIDEOSHOP : Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion (ECCV 2024) (3)	2024.10.29
[DEVOCEAN OpenLab] PEEKABOO : Interactive Video Generation via Masked Diffusion (0)	2024.05.23

'AI/Video' Related Articles

평범한 필기장

[평범한 학부생이 하는 논문 리뷰] DragAnything : Motion Control for Anything using Entity Representation (ECCV 2024) 본문

[평범한 학부생이 하는 논문 리뷰] DragAnything : Motion Control for Anything using Entity Representation (ECCV 2024)

1. Introduction

2. Methodology

2.1 Task Formulation and Motivation

Task Formulation

Motivation

2.2 Architecture

2.3 Entity Semantic Representation Extraction

Entity Representation Extraction

2D Gaussian Representation Extraction

Encoder for Entity Representation and 2D Gaussian Map

2.4 Training and Inference

Ground Truth Label Generation

Loss Function

Inference of User-Trajectory Interaction

3. Experiments

3.1 Experiment Settings

Evaluation Metrics

3.2 Comparisons with SOTA methods

Evaluation of Video Quality

Evaluation of Temporal Coherence

Evaluation of Object Motion

User Study for Motion Control and Video Quality

3.3 Ablation Studies

Effect of Entity Representation $\mathbf{\hat{E}}$

Effect of 2D Gaussian Representation

Effect of Loss Mask $\mathbf{M}$

3.4 Discussion for Various Motion Control

4. Limitation & Bad Case Analysis

'AI > Video' 카테고리의 다른 글

티스토리툴바