We propose a joint training scheme of an any-to-one voice conversion (VC) system with LPCNet to improve the speech naturalness, speaker similarity, and intelligibility of the converted speech. Recent advancements in neural-based vocoders, such as LPCNet, have enabled the production of more natural and clear speech. However, other components in typical VC systems are often designed independently, such as the conversion model. Hence, separate training strategies are used for each component that is not in direct correlation to the training objective of the vocoder preventing exploitation of the full potential of LPCNet. This problem is addressed by proposing a jointly trained conversion model and LPCNet. To accurately capture the linguistic contents of the given utterance, we use speaker-independent (SI) features derived from an automatic speech recognition (ASR) model trained using a mixed-language speech corpus. Subsequently, a conversion model maps the SI features to the acoustic representations used as input features to LPCNet. The possibility to synthesize cross-language speech using the proposed approach is also explored in this paper. Experimental results show that the proposed model can achieve real-time VC, unlocking the full potential of LPCNet and outperforming the state of the art.
Face beautification: Beyond makeup transfer
Xudong Liu, Ruizhe Wang, Hao Peng, and
3 more authors
Facial appearance plays an important role in our social lives. Subjective perception of women’s beauty depends on various face-related (e.g., skin, shape, hair) and environmental (e.g., makeup, lighting, angle) factors. Similarly to cosmetic surgery in the physical world, virtual face beautification is an emerging field with many open issues to be addressed. Inspired by the latest advances in style-based synthesis and face beauty prediction, we propose a novel framework for face beautification. For a given reference face with a high beauty score, our GAN-based architecture is capable of translating an inquiry face into a sequence of beautified face images with the referenced beauty style and the target beauty score values. To achieve this objective, we propose to integrate both style-based beauty representation (extracted from the reference face) and beauty score prediction (trained on the SCUT-FBP database) into the beautification process. Unlike makeup transfer, our approach targets many-to-many (instead of one-to-one) translation, where multiple outputs can be defined by different references with various beauty scores. Extensive experimental results are reported to demonstrate the effectiveness and flexibility of the proposed face beautification framework. To support reproducible research, the source codes accompanying this work will be made publicly available on GitHub.
2021
Sparse Feature Representation Learning for Deep Face Gender Transfer
Xudong Liu, Ruizhe Wang, Hao Peng, and
3 more authors
In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
Why do people think Tom Hanks and Juliette Lewis look alike? Can we modify the gender appearance of a face image without changing its identity information? Is there any specific feature responsible for the perception of femininity/masculinity in a given face image? Those questions are appealing from both computer vision and visual perception perspectives. To shed light upon them, we propose to develop a GAN based approach toward face gender transfer and study the relevance of learned feature representations to face gender perception. Our key contributions include: 1) an architecture design with specially tailored loss functions in the feature space for face gender transfer; 2) the introduction of a novel probabilistic gender mask to facilitate achieving both the objectives of gender transfer and identity preservation; and 3) identification of sparse features (approx 20 out of 256) uniquely responsible for face gender perception. Extensive experimental results are reported to demonstrate not only the superiority of the proposed face gender transfer technique (in terms of visual quality of reconstructed images) but also the effectiveness of gender feature representation learning (in terms of the high correlation between the learned sparse features and the perceived gender information). Our findings seem to corroborate a hypothesis about the independence between face recognizability and gender classifiability in the literature of psychology. We expect this work will stimulate more computational studies of face perception including race, age, attractiveness, and trustworthiness.
2020
Learning 3D Faces from Photo-Realistic Facial Synthesis
Ruizhe Wang, Chih-Fan Chen, Hao Peng, and
2 more authors
In International Conference on 3D Vision (3DV), 2020
We present an approach to efficiently learn an accurate and complete 3D face model from a single image. Previous methods heavily rely on 3D Morphable Models to populate the facial shape space as well as an over-simplified shading model for image formulation. By contrast, our method directly augments a large set of 3D faces from a compact collection of facial scans and employs a high-quality rendering engine to synthesize the corresponding photo-realistic facial images. We first use a deep neural network to regress vertex coordinates from the given image and then refine them by a non-rigid deformation process to more accurately capture local shape similarity. We have conducted extensive experiments to demonstrate the superiority of the proposed approach on 2D-to-3D facial shape inference, especially its excellent generalization property on real-world selfie images.
2019
arXiv
Digital twin: Acquiring high-fidelity 3D avatar from a single image
Ruizhe Wang, Chih-Fan Chen, Hao Peng, and
3 more authors
We present an approach to generate high fidelity 3D face avatar with a high-resolution UV texture map from a single image. To estimate the face geometry, we use a deep neural network to directly predict vertex coordinates of the 3D face model from the given image. The 3D face geometry is further refined by a non-rigid deformation process to more accurately capture facial landmarks before texture projection. A key novelty of our approach is to train the shape regression network on facial images synthetically generated using a high-quality rendering engine. Moreover, our shape estimator fully leverages the discriminative power of deep facial identity features learned from millions of facial images. We have conducted extensive experiments to demonstrate the superiority of our optimized 2D-to-3D rendering approach, especially its excellent generalization property on real-world selfie images. Our proposed system of rendering 3D avatars from 2D images has a wide range of applications from virtual/augmented reality (VR/AR) and telepsychiatry to human-computer interaction and social networks.
Show, attend, and translate: Unsupervised image translation with self-regularization and attention
Chao Yang, Taehwan Kim, Ruizhe Wang, and
2 more authors
Image translation between two domains is a class of problems aiming to learn mapping from an input image in the source domain to an output image in the target domain. It has been applied to numerous applications, such as data augmentation, domain adaptation, and unsupervised training. When paired training data is not accessible, image translation becomes an ill-posed problem. We constrain the problem with the assumption that the translated image needs to be perceptually similar to the original image and also appears to be drawn from the new domain, and propose a simple yet effective image translation model consisting of a single generator trained with a self-regularization term and an adversarial term. We further notice that the existing image translation techniques are agnostic to the subjects of interest and often introduce unwanted changes or artifacts to the input. Thus, we propose to add an attention module to predict an attention map to guide the image translation process. The module learns to attend to key parts of the image while keeping everything else unaltered, essentially avoiding undesired artifacts or changes. Extensive experiments and evaluations show that our model while being simpler, achieves significantly better performance than existing image translation methods.
Understanding beauty via deep facial features
Xudong Liu, Tao Li, Hao Peng, and
3 more authors
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019
The concept of beauty has been debated by philosophers and psychologists for centuries, but most definitions are subjective and metaphysical, and deficit in accuracy, generality, and scalability. In this paper, we present a novel study on mining beauty semantics of facial attributes based on big data, with an attempt to objectively construct descriptions of beauty in a quantitative manner. We first deploy a deep Convolutional Neural Network (CNN) to extract facial attributes, and then investigate correlations between these features and attractiveness on two large-scale datasets labelled with beauty scores. Not only do we discover the secrets of beauty verified by statistical significance tests, our findings also align perfectly with existing psychological studies that, eg, small nose, high cheekbones, and femininity contribute to attractiveness. We further leverage these high-level representations to original images by a generative adversarial network (GAN). Beauty enhancements after synthesis are visually compelling and statistically convincing verified by a user survey of 10,000 data points.
2018
ESTHER: Extremely Simple Image Translation Through Self-Regularization
Chao Yang, Taehwan Kim, Ruizhe Wang, and
2 more authors
In The British Machine Vision Conference (BMVC), 2018
Image translation between two domains is a class of problems where the goal is to learn the mapping from an input image in the source domain to an output image in the target domain. It has important applications such as data augmentation, domain adaptation, and unsupervised training. When paired training data are not accessible, the mapping between the two domains is highly under-constrained and we are faced with an ill-posed task. Existing approaches tackling this challenge usually make assumptions and introduce prior constraints. For example, CycleGAN assumes cycle-consistency while UNIT assumes shared latent-space between the two domains. We argue that none of these assumptions explicitly guarantee that the learned mapping is the desired one. We, taking a step back, observe that most image translations are based on the intuitive requirement that the translated image needs to be perceptually similar to the original image and also appear to come from the new domain. On the basis of such observation, we propose an extremely simple yet effective image translation approach, which consists of a single generator and is trained with a self-regularization term and an adversarial term. We further propose an adaptive method to search for the best weight between the two terms. Extensive experiments and evaluations show that our model is significantly more cost-effective and can be trained under budget, yet easily achieves better performance than other methods on a broad range of tasks and applications.
2016
Capturing dynamic textured surfaces of moving targets
Ruizhe Wang, Lingyu Wei, Etienne Vouga, and
4 more authors
In Proceedings of the European Conference on Computer Vision (ECCV), 2016
We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15 % overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.
2015
Surface Oriented Traverse for robust instance detection in RGB-D
Ruizhe Wang, Gérard G Medioni, and Wenyi Zhao
In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015
We address the problem of robust instance detection in RGB-D image in the presence of noisy data, cluttering, partial occlusion and large pose variation. We extract contour points from the depth image, construct a Surface Oriented Traverse (SOT) feature for each contour point and further classify it as either belonging or not belonging to the instance of interest. Starting from each contour point, its SOT feature is constructed by traversing and uniformly sampling along an oriented geodesic path on the object surface. After classification, all contour points vote for an instance-specific saliency map, from which the instance of interest is finally localized. Compared with the holistic template-based and learning-based methods, our method inherits advantages of the feature-based methods in dealing with cluttering, partial occlusion, and large pose variation. Furthermore, our method does not require accurate 3D models or high quality laser scan data as input and takes noisy data from commodity 3D sensors. Experimental results on the public RGB-D Object Dataset and our FindMe RGB-D Dataset demonstrate the effectiveness and robustness of our proposed instance detection algorithm.
SIGGRAPH Talks
Blendshapes from commodity RGB-D sensors
Dan Casas, Oleg Alexander, Andrew W Feng, and
6 more authors
Creating and animating a realistic 3D human face is an important task in computer graphics. The capability of capturing the 3D face of a human subject and reanimate it quickly will find many applications in games, training simulations, and interactive 3D graphics. We demonstrate a system to capture photorealistic 3D faces and generate the blendshape models automatically using only a single commodity RGB-D sensor. Our method can rapidly generate a set of expressive facial poses from a single depth sensor, such as a Microsoft Kinect version 1, and requires no artistic expertise in order to process those scans. The system takes only a matter of seconds to capture and produce a 3D facial pose and only requires a few minutes of processing time to transform it into a blendshape-compatible model. Our main contributions include an end-to-end pipeline for capturing and generating face blendshape models automatically, and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects and puppeteering their 3D faces in an animation system with real-time facial performance retargeting.
I3D
Rapid photorealistic blendshapes from commodity RGB-D sensors
Dan Casas, Oleg Alexander, Andrew W Feng, and
6 more authors
In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games (I3D), 2015
Creating and animating a realistic 3D human face has been an important task in computer graphics. The capability of capturing the 3D face of a human subject and reanimate it quickly will find many applications in games, training simulations, and interactive 3D graphics. In this paper, we propose a system to capture photorealistic 3D faces and generate the blendshape models automatically using only a single commodity RGB-D sensor. Our method can rapidly generate a set of expressive facial poses from a single Microsoft Kinect and requires no artistic expertise on the part of the capture subject. The system takes only a matter of seconds to capture and produce a 3D facial pose and only requires 4 minutes of processing time to transform it into a blendshape model. Our main contributions include an end-to-end pipeline for capturing and generating face blendshape models automatically, and a registration method that solves dense correspondences between two face scans by utilizing facial landmark detection and optical flow. We demonstrate the effectiveness of the proposed method by capturing 3D facial models of different human subjects and puppeteering their models in an animation system with real-time facial performance retargeting.
2014
SIGGRAPH Talks
Rapid avatar capture and simulation using commodity depth sensors
Andrew Feng, Ari Shapiro, Ruizhe Wang, and
3 more authors
We demonstrate a method of acquiring a 3D model of a human using commodity scanning hardware and then controlling that 3D figure in a simulated environment in only a few minutes. The model acquisition requires four static poses taken at 90° angles relative to each other. The 3D model is then given a skeleton and smooth binding information necessary for control and simulation. The 3D models that are captured are suitable for use in applications where recognition and distinction among characters by shape, form, or clothing is important, such as small group or crowd simulations or other socially oriented applications. Because of the speed at which a human figure can be captured and the low hardware requirements, this method can be used to capture, track, and model human figures as their appearances change over time.
IEEE VR
Automatic acquisition and animation of virtual avatars
Ari Shapiro, Andrew Feng, Ruizhe Wang, and
3 more authors
The USC Institute for Creative Technologies will demonstrate a pipline for automatic reconstruction and animation of lifelike 3D avatars acquired by rotating the user’s body in front of a single Microsoft Kinect sensor. Based on a fusion of state-of-the-art techniques in computer vision, graphics, and animation, this approach can produce a fully rigged character model suitable for real-time virtual environments in less than four minutes.
3D modeling from wide baseline range scans using contour coherence
Ruizhe Wang, Jongmoo Choi, and Gérard Medioni
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014
Registering 2 or more range scans is a fundamental problem, with application to 3D modeling. While this problem is well addressed by existing techniques such as ICP when the views overlap significantly at a good initialization, no satisfactory solution exists for wide baseline registration. We propose here a novel approach which leverages contour coherence and allows us to align two wide baseline range scans with limited overlap from a poor initialization. Inspired by ICP, we maximize the contour coherence by building robust corresponding pairs on apparent contours and minimizing their distances in an iterative fashion. We use the contour coherence under a multi-view rigid registration framework, and this enables the reconstruction of accurate and complete 3D models from as few as 4 frames. We further extend it to handle articulations, and this allows us to model articulated objects such as human body. Experimental results on both synthetic and real data demonstrate the effectiveness and robustness of our contour coherence based registration approach to wide baseline range scans, and to 3D modeling.
2013
Wireless Health
Monitoring mobility disorders at home using 3D visual sensors and mobile sensors
Farnoush B Kashani, Gerard Medioni, Khanh Nguyen, and
8 more authors
In Proceedings of the 4th Conference on Wireless Health, 2013
In this paper, we present PoCM2 (Point-of-Care Mobility Monitoring), a generic and extensible at-home mobility evaluation and monitoring system. PoCM2 uses both 3D visual sensors (such as Microsoft Kinect) and mobile sensors (i.e., internal and external sensors embedded with/connected to a mobile device such as a smartphone) for complementary data acquisition, as well as a series of analytics that allow evaluation of both archived and real-time mobility data. We demonstrate the performance of PoCM2 with a specific application developed for freeze detection and quantification from Parkinson’s Disease mobility data, as an approach to estimate the medication level of the PD patients and potentially recommend adjustments.
Home monitoring musculo-skeletal disorders with a single 3d sensor
Ruizhe Wang, Gérard Medioni, Carolee Winstein, and
1 more author
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013
We address the problem of automated quantitative evaluation of musculo-skeletal disorders using a 3D sensor. This enables a non-invasive home monitoring system which extracts and analyzes the subject’s motion symptoms and provides clinical feedback. The subject is asked to perform several clinically validated standardized tests (eg sit-to-stand, repeated several times) in front of a 3D sensor to generate a sequence of skeletons (ie locations of 3D joints). While the complete sequence consists of multiple repeated Skeletal Action Units (SAU)(eg sit-to-stand, one repetition), we generate a single robust Representative Skeletal Action Unit (RSAU) which encodes the subject’s most consistent spatio-temporal motion pattern. Based on the Representative Skeletal Action Unit (RSAU) we extract a series of clinical measurements (eg step size, swing level of hand) which are crucial for prescription and rehabilitation plan design. In this paper, we propose a Temporal Alignment Spatial Summarization (TASS) method to decouple the complex spatio-temporal information of multiple Skeletal Action Units (SAU). Experimental results from people with Parkinson’s Disease (PD) and people without Parkinson’s Disease (non-PD) demonstrate the effectiveness of our methodology which opens the way for many related applications.
2012
Accurate full body scanning from a single fixed 3d camera
Ruizhe Wang, Jongmoo Choi, and Gerard Medioni
In Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, 2012
3D body modeling has been a long studied topic in computer vision and computer graphics. While several solutions have been proposed using either multiple sensors or a moving sensor, we propose here an approach when the user turns, in a natural motion, in front of a fixed 3D low cost camera. This opens the door to a wide range of applications where scanning is performed at home. Our scanning system can be easily set up and the instructions are straightforward to follow. We propose an articulated, part-based cylindrical representation for the body model, and show that accurate 3D shape can be automatically estimated from 4 key views detected from a depth video sequence. The registration between 4 key views is performed in a top-bottom-top manner which fully considers the kinematic constraints. We validate our approach on a large number of users, and compare accuracy to that of a reference laser scan. We show that even using a simplified model (5 cylinders) an average error of 5mm can be consistently achieved.