SadTalker

https://arxiv.org/pdf/2211.12194.pdf

Continued from -

From the above observation, we propose SadTalker, a Stylized Audio-Driven Talking-head video generation system through implicit 3D coefficient modulation

Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

implicit 3D coefficient modulation

Challenges - unnatural head movement, distorted expression, and identity modification

—> We argue that these issues are mainly because of learning from the coupled 2D motion fields

Problems in 3D - explicitly using 3D information also suffers problems of stiff expression and incoherent video.

SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation

To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually

present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces

As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles.

{ VAE - to make head motion }

the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video.

Recent works also aim to generate a realistic talking face video containing other related motions, e.g., head pose. Their methods mainly introduce 2D motion fields by landmarks [52] and latent warping [39, 40]. However, the quality of the generated videos is still unnatural and restricted by the preference pose [17, 51], month blur [30], identity modification [39, 40], and distorted face

{ 2 things - landmark, latent warping, earlier we just focused on lip sync, now we also want head pose }

Generating a natural-looking talking head video contains many challenges since the connections between audio and different motions are different. i.e., the lip movement has the strongest connection with audio, but audio can be talked via different head poses and eye blink.

Thus, previous facial landmark-based methods [2, 52] and 2D flow-based audio to expression networks [39,40] may generate the distorted face since the head motion and expression are not fully disentangled in their representation.

Another popular type of method is the latent-based face animation( not very good)

Our observation is that the 3D facial model contains a highly decoupled representation and can be used to learn each type of motion individually