Although existing speech-driven talking face generation methods achieve significant progress, they are far from real-world application due to the avatar-specific training demand and unstable lip movements.
To address the above issues, we propose the GSmoothFace, a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model, which can synthesize smooth lip dynamics while preserving the speaker's identity. Our proposed GSmoothFace model mainly consists of the Audio to Expression Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT) module. Specifically, we first develop the A2EP module to predict expression parameters synchronized with the driven speech. It uses a transformer to capture the long-term audio context and learns the parameters from the fine-grained 3D facial vertices, resulting in accurate and smooth lip-synchronization performance. Afterward, the well-designed TAFT module, empowered by Morphology Augmented Face Blending (MAFB), takes the predicted expression parameters and target video as inputs to modify the facial region of the target video without distorting the background content. The TAFT effectively exploits the identity appearance and background context in the target video, which makes it possible to generalize to different speakers without retraining.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip-synchronization, and visual quality.
(a) A2EP. Given driven audio and history expressions with identity vectors as well, the predicted expressions are obtained via a weighted attention transformer in an auto-regressive manner. Note that the expressions are supervised in the fine-grained vertices aspects, as (c) shows.
(b) TAFT. Taking the predicted expression parameters and original 3DMM parameters reconstructed by a pre-trained state-of-the-art method R-Net from the target video as inputs, the blended images are obtained via the simple but effective fully differential Morphology Augmented Face Blending (MAFB) module. Afterward, the blended images, combined with reference images containing the identity information, are used to synthesize a smooth photo-realistic talking face video by a generator network. Which can be generalized to unseen speakers without retraining.