Continued from -
From the above observation, we propose SadTalker, a
Stylized Audio-Driven Talking-head video generation system through implicit 3D coefficient modulation
Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
- implicit 3D coefficient modulation
Challenges - unnatural head movement, distorted expression, and
identity modification
—> We argue that these issues are mainly because of learning from the coupled 2D motion fields
Problems in 3D - explicitly using 3D information also suffers problems of stiff expression and incoherent video.
SadTalker, which generates
3D motion coefficients
(head pose, expression) of the 3DMM from audio
and implicitly modulates a novel 3D-aware face render for talking head generation
To learn the
realistic motion coefficients,
we explicitly model the connections between audio and different types of motion coefficients individuallypresent ExpNet to
learn the accurate facial expression from audio
by distilling both coefficients and 3D-rendered facesAs for the head pose, we design PoseVAE via a
conditional VAE
to synthesize head motion in different styles.{ VAE - to make head motion }
the generated
3D motion coefficients
are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video.Recent works also aim to generate a realistic talking face video containing other related motions, e.g., head pose.
Their methods mainly introduce
2D motion fields
by landmarks [52] and latent warping [39, 40]. However, the quality of the generated videos is still unnatural and restricted by
the preference pose [17, 51], month blur [30], identity modification [39, 40], and distorted face{ 2 things - landmark, latent warping, earlier we just focused on lip sync, now we also want head pose }
Generating a natural-looking talking head video contains many challenges since the connections between audio and different motions are different. i.e., the lip movement has the strongest connection with audio, but
audio can be talked via different head poses and eye blink.
Thus, previous
facial landmark-based methods
[2, 52] and 2D flow-based audio to expression networks
[39,40] may generate the distorted face since the head motion and expression are not fully disentangled in their representation.Another
popular type of method is the latent-based face animation( not very good)
Our observation is that the 3D facial model contains a highly decoupled representation and can be used to learn each type of motion individually