1
|
Zhang Z, Zhang J, Mai W. VPT: Video portraits transformer for realistic talking face generation. Neural Netw 2025; 184:107122. [PMID: 39799718 DOI: 10.1016/j.neunet.2025.107122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 12/09/2024] [Accepted: 01/02/2025] [Indexed: 01/15/2025]
Abstract
Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements. To solve these problems, a novel talking face generation framework, termed video portraits transformer (VPT) with controllable blink movements is proposed and applied. It separates the process of video generation into two stages, i.e., audio-to-landmark and landmark-to-face stages. In the audio-to-landmark stage, the transformer encoder serves as the generator used for predicting whole facial landmarks from given audio and continuous eye aspect ratio (EAR). During the landmark-to-face stage, the video-to-video (vid-to-vid) network is employed to transfer landmarks into realistic talking face videos. Moreover, to imitate real blink movements during inference, a transformer-based spontaneous blink generation module is devised to generate the EAR sequence. Extensive experiments demonstrate that the VPT method can produce photo-realistic videos of talking faces with natural blink movements, and the spontaneous blink generation module can generate blink movements close to the real blink duration distribution and frequency.
Collapse
Affiliation(s)
- Zhijun Zhang
- School of Automation Science and Engineering, South China University of Technology, China; Key Library of Autonomous Systems and Network Control, Ministry of Education, China; Jiangxi Thousand Talents Plan, Nanchang University, China; College of Computer Science and Engineering, Jishou University, China; Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), China; Shaanxi Provincial Key Laboratory of Industrial Automation, School of Mechanical Engineering, Shaanxi University of Technology, Hanzhong, China; School of Information Science and Engineering, Changsha Normal University, Changsha, China; School of Automation Science and Engineering, and also with the Institute of Artificial Intelligence and Automation, Guangdong University of Petrochemical Technology, Maoming, China; Key Laboratory of Large-Model Embodied-Intelligent Humanoid Robot (2024KSYS004), China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
| | - Jian Zhang
- School of Automation Science and Engineering, South China University of Technology, China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
| | - Weijian Mai
- School of Automation Science and Engineering, South China University of Technology, China.
| |
Collapse
|
2
|
Qu L, Shang J, Han X, Fu H. ReenactArtFace: Artistic Face Image Reenactment. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:4080-4092. [PMID: 37028041 DOI: 10.1109/tvcg.2023.3253184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Large-scale datasets and deep generative models have enabled impressive progress in human face reenactment. Existing solutions for face reenactment have focused on processing real face images through facial landmarks by generative models. Different from real human faces, artistic human faces (e.g., those in paintings, cartoons, etc.) often involve exaggerated shapes and various textures. Therefore, directly applying existing solutions to artistic faces often fails to preserve the characteristics of the original artistic faces (e.g., face identity and decorative lines along face contours) due to the domain gap between real and artistic faces. To address these issues, we present ReenactArtFace, the first effective solution for transferring the poses and expressions from human videos to various artistic face images. We achieve artistic face reenactment in a coarse-to-fine manner. First, we perform 3D artistic face reconstruction, which reconstructs a textured 3D artistic face through a 3D morphable model (3DMM) and a 2D parsing map from an input artistic image. The 3DMM can not only rig the expressions better than facial landmarks but also render images under different poses/expressions as coarse reenactment results robustly. However, these coarse results suffer from self-occlusions and lack contour lines. Second, we thus perform artistic face refinement by using a personalized conditional adversarial generative model (cGAN) fine-tuned on the input artistic image and the coarse reenactment results. For high-quality refinement, we propose a contour loss to supervise the cGAN to faithfully synthesize contour lines. Quantitative and qualitative experiments demonstrate that our method achieves better results than the existing solutions.
Collapse
|
3
|
Chai Y, Shao T, Weng Y, Zhou K. Personalized Audio-Driven 3D Facial Animation via Style-Content Disentanglement. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:1803-1820. [PMID: 37015450 DOI: 10.1109/tvcg.2022.3230541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
We present a learning-based approach for generating 3D facial animations with the motion style of a specific subject from arbitrary audio inputs. The subject style is learned from a video clip (1-2 minutes) either downloaded from the Internet or captured through an ordinary camera. Traditional methods often require many hours of the subject's video to learn a robust audio-driven model and are thus unsuitable for this task. Recent research efforts aim to train a model from video collections of a few subjects but ignore the discrimination between the subject style and underlying speech content within facial motions, leading to inaccurate style or articulation. To solve the problem, we propose a novel framework that disentangles subject-specific style and speech content from facial motions. The disentanglement is enabled by two novel training mechanisms. One is two-pass style swapping between two random subjects, and the other is joint training of the decomposition network and audio-to-motion network with a shared decoder. After training, the disentangled style is combined with arbitrary audio inputs to generate stylized audio-driven 3D facial animations. Compared with start-of-the-art methods, our approach achieves better results qualitatively and quantitatively, especially in difficult cases like bilabial plosive and bilabial nasal phonemes.
Collapse
|
4
|
Wang C, Chai M, He M, Chen D, Liao J. Cross-Domain and Disentangled Face Manipulation With 3D Guidance. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:2053-2066. [PMID: 34982684 DOI: 10.1109/tvcg.2021.3139913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Face image manipulation via three-dimensional guidance has been widely applied in various interactive scenarios due to its semantically-meaningful understanding and user-friendly controllability. However, existing 3D-morphable-model-based manipulation methods are not directly applicable to out-of-domain faces, such as non-photorealistic paintings, cartoon portraits, or even animals, mainly due to the formidable difficulties in building the model for each specific face domain. To overcome this challenge, we propose, as far as we know, the first method to manipulate faces in arbitrary domains using human 3DMM. This is achieved through two major steps: 1) disentangled mapping from 3DMM parameters to the latent space embedding of a pre-trained StyleGAN2 [1] that guarantees disentangled and precise controls for each semantic attribute; and 2) cross-domain adaptation that bridges domain discrepancies and makes human 3DMM applicable to out-of-domain faces by enforcing a consistent latent space embedding. Experiments and comparisons demonstrate the superiority of our high-quality semantic manipulation method on a variety of face domains with all major 3D facial attributes controllable - pose, expression, shape, albedo, and illumination. Moreover, we develop an intuitive editing interface to support user-friendly control and instant feedback. Our project page is https://cassiepython.github.io/cddfm3d/index.html.
Collapse
|
5
|
Zhang C, Ni S, Fan Z, Li H, Zeng M, Budagavi M, Guo X. 3D Talking Face With Personalized Pose Dynamics. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:1438-1449. [PMID: 34606458 DOI: 10.1109/tvcg.2021.3117484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recently, we have witnessed a boom in applications for 3D talking face generation. However, most existing 3D face generation methods can only generate 3D faces with a static head pose, which is inconsistent with how humans perceive faces. Only a few articles focus on head pose generation, but even these ignore the attribute of personality. In this article, we propose a unified audio-driven approach to endow 3D talking faces with personalized pose dynamics. To achieve this goal, we establish an original person-specific dataset, providing corresponding head poses and face shapes for each video. Our framework is composed of two separate modules: PoseGAN and PGFace. Given an input audio, PoseGAN first produces a head pose sequence for the 3D head, and then, PGFace utilizes the audio and pose information to generate natural face models. With the combination of these two parts, a 3D talking head with dynamic head movement can be constructed. Experimental evidence indicates that our method can generate person-specific head pose sequences that are in sync with the input audio and that best match with the human experience of talking heads.
Collapse
|
6
|
Afzal S, Ghani S, Hittawe MM, Rashid SF, Knio OM, Hadwiger M, Hoteit I. Visualization and Visual Analytics Approaches for Image and Video Datasets: A Survey. ACM T INTERACT INTEL 2023. [DOI: 10.1145/3576935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Image and video data analysis has become an increasingly important research area with applications in different domains such as security surveillance, healthcare, augmented and virtual reality, video and image editing, activity analysis and recognition, synthetic content generation, distance education, telepresence, remote sensing, sports analytics, art, non-photorealistic rendering, search engines, and social media. Recent advances in Artificial Intelligence (AI) and particularly deep learning have sparked new research challenges and led to significant advancements, especially in image and video analysis. These advancements have also resulted in significant research and development in other areas such as visualization and visual analytics, and have created new opportunities for future lines of research. In this survey paper, we present the current state of the art at the intersection of visualization and visual analytics, and image and video data analysis. We categorize the visualization papers included in our survey based on different taxonomies used in visualization and visual analytics research. We review these papers in terms of task requirements, tools, datasets, and application areas. We also discuss insights based on our survey results, trends and patterns, the current focus of visualization research, and opportunities for future research.
Collapse
Affiliation(s)
- Shehzad Afzal
- King Abdullah University of Science & Technology, Saudi Arabia
| | - Sohaib Ghani
- King Abdullah University of Science & Technology, Saudi Arabia
| | | | | | - Omar M Knio
- King Abdullah University of Science & Technology, Saudi Arabia
| | - Markus Hadwiger
- King Abdullah University of Science & Technology, Saudi Arabia
| | - Ibrahim Hoteit
- King Abdullah University of Science & Technology, Saudi Arabia
| |
Collapse
|
7
|
Expression-tailored talking face generation with adaptive cross-modal weighting. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|