3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation

1 Max Planck Institute for Intelligent Systems, Tübingen, Germany,

2Technical University of Darmstadt,

3 Mesh Labs, Microsoft, Cambridge, UK.


Italian Trulli

3DiFACE is a novel diffusion-based method for synthesizing and editing holistic 3D facial animation from an audio sequence, wherein one can synthesize a diverse set of facial animations (top), seamlessly edit facial animations between two or multiple user-specified keyframes, and extrapolating motion from past motion (bottom).


Creating an animation of a specific person with audio-synced lip motions, realistic head motion and editing via artist-defined keyframes are a set of tasks that challenge existing speech-driven 3D facial animation methods. Especially, editing 3D facial animation is a complex and time-consuming task carried out by highly skilled animators. Also, most existing works overlook the inherent one-to-many relationship between speech and facial motion, where multiple plausible lip and head animations could sync with the audio input. To this end, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation, which produces diverse plausible lip and head motions for a single audio input, while also allowing editing via keyframing and interpolation. 3DiFACE is a lightweight audio-conditioned diffusion model, which can be fine-tuned to generate personalized 3D facial animation requiring only a short video of the subject. Specifically, we leverage the viseme-level diversity in our training corpus to train a fully-convolutional diffusion model that produces diverse sequences for single audio input. Additionally, we employ a modified guided motion diffusion to enable head-motion synthesis and editing using masking. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity.


Proposed Method

Italian Trulli

Our goal is to synthesize and edit holistic 3D facial animation given input audio signal. To acheive this, we model the facial and head motion using two diffusion-based networks, motivated by the fact that the face motion is highly correlated to the speech signal, while the head motion is relatively less correlated and thus requires a longer context of information, hence, a different training scheme (and data).


A user can control the synthesis process by either modifying an existing sequence or employing keyframes to direct the synthesis outcome.

Holisitc motion synthesis comparison

Lip motion synthesis comparison

Lip motion diversity comparison

Ablation: Editing



        title={3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing}, 
        author={Balamurugan Thambiraja and Sadegh Aliakbarian and Darren Cosker and Justus Thies},