SPACE

SPACE : Speech-driven Portrait Animation with Controllable Expression

Paper (arxiv)

We present SPACE, a method for generating high-resolution, expressive videos with realistic head pose, using just speech and a single image. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons.

Speech-driven animation of a portrait, with control over the output pose, emotions, and intensities of expressions

Pose	Generated	Transferred	Generated	Generated
Emotion	Neutral	Neutral	Happy	Surprise


Emotion	Neutral	Neutral	Sad	Fear
Pose	Generated	Transferred	Generated	Generated

Summary

SPACE is a powerful tool for animating a facial image using just speech input.

Existing methods perform poorly when the input face has large head rotations. Some require tight face crops, or animate just the lips. SPACE animates the entire face and can even generate realistic head pose sequences.

SPACE offers unprecedented controllability of the outputs — head pose, emotion label and intensity, blinking, and eye gaze control.

Unlike prior work, SPACE produces high-quality photorealistic and temporally stable outputs at \(512\times512\) resolution by using a pretrained face-vid2vid generator.

@article{gururani2022SPACE, title={{SPACE: Speech-driven Portrait Animation with Controllable Expression}}, author={Siddharth Gururani and Arun Mallya and Ting-Chun Wang and Rafael Valle and Ming-Yu Liu}, journal={arXiv preprint arXiv:2211.09809}, year={2022} }

Input image	PC-AVS	MakeItTalk	Wav2Lip	SPACE (ours)





Input image	PC-AVS	MakeItTalk	Wav2Lip	SPACE (ours)

Input	Intermediate predictions			Final
Single Image	Normalized facial landmarks	Posed facial landmarks	Latent face-vid2vid keypoints	Animated output

Input image	0.5 Happy	1.0 Happy

Input image	0.5 Angry	1.0 Angry

SPACE : Speech-driven Portrait Animation with Controllable Expression

Siddharth Gururani

Arun Mallya

Ting-Chun Wang

Rafael Valle

Ming-Yu Liu

NVIDIA

Speech-driven animation of a portrait, with control over the output pose, emotions, and intensities of expressions

Overview

What exactly is SPACE trying to solve?

The "Why don't you just use X? " Question

Method

What makes SPACE so controllable?

Emotion control

Eye control — blinking and gaze

Summary

Citation

Input image	Blinking	Gaze change