In the first example, the speaker talks at a particular pace that can be seen as periodic gaps in the ground-truth mel-spectrogram (shown below). In both examples, the video frames provide prosody and word timing clues, visual information that is not available to the TTS model. To showcase the unique strength of VDTTS in this post, we have selected two inference examples from the VoxCeleb2 test dataset and compare the performance of VDTTS to a standard text-to-speech (TTS) model. Throughout our testing, we have determined that VDTTS cannot generate arbitrary text, thus making it less prevalent for misuse (e.g., the generation of fake content). ![]() We train VDTTS using video and text pairs from LSVSR in which the text corresponds to the exact words spoken by a person in a video. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output. Text and video encoders process the inputs and then a multisource attention mechanism connects these to a decoder that produces mel-spectrograms. The VDTTS model resembles Tacotron at its core and has four main components: (1) text and video encoders that process the inputs (2) a multi-source attention mechanism that connects encoders to a decoder (3) a spectrogram decoder that incorporates the speaker embedding (similarly to VoiceFilter), and produces mel-spectrograms (which are a form of compressed representation in the frequency domain) and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms. Given a text and video frames of a speaker, VDTTS generates speech with prosody that matches the video signal. Despite not being explicitly trained to generate speech that is synchronized to the input video, the learned model still does so. This gives the VDTTS model enough information to generate speech that matches the video while also recovering aspects of prosody, such as timing and emotion. As opposed to standard visual speech recognition models, which focus on the mouth region, we detect and crop full faces using MediaPipe to avoid potentially excluding information pertinent to the speaker’s delivery. Given a text and the original video frames of the speaker, VDTTS is trained to generate the corresponding speech. In “ More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech”, we present a proof-of-concept visually-driven text-to-speech model, called VDTTS, that automates the dialog replacement process. ![]() However, the dialog replacement process can be difficult and tedious because the newly recorded audio needs to be well synced with the video, requiring several edits to match the exact timing of mouth movements. In some cases dialogue is re-recorded (referred to as dialog replacement, post-sync or dubbing) in a studio in order to achieve high quality and replace original audio that might have been recorded in noisy conditions. The process of creating high quality content can include several stages from video capturing and captioning to video and audio editing. ![]() ![]() Recent years have seen a tremendous increase in the creation and serving of video content to users across the world in a variety of languages and over numerous platforms. Posted by Tal Remez, Software Engineer, Google Research and Micheal Hassid, Software Engineer Intern, Google Research
0 Comments
Leave a Reply. |