A Deep Learning Framework for Efficient High-Fidelity Speech Synthesis: StyleTTS
Keywords:
Deep Learning, Artificial Intelligence, GAN, Open-source, Audio-synthesis, Text-to-SpeechAbstract
As we transition into the age of Artificial Intelligence (AI), one of the most incredible feats that it has achieved is the ability to talk and engage with human beings. An integral part of this task, referred to as speech synthesis, is to make the computer sound more human. Currently, the generative adversarial networks (GANs) have emerged as effective generative models and have more or less dominated the image generation domain. However, there is still untapped potential in what they can offer in the audio domain. Due to their highly parallelizable structure, GANs can produce hours of audio within seconds. Moreover, their inherent nature of modelling the latent space can afford some artistic control as well. In this paper, we propose Style text-to-speech (StyleTTS) model that uses image-based StyleGAN to efficiently generate high-fidelity speech. The model takes a character string and generates the corresponding speech. In this paper, we present results for digits from zero to nine. We also compare our results with older TTS approaches and GAN models and report gains by using the newer architecture. We intend to provide our pre-trained GAN to the open-source community in the form of a library. Upon release, this would be trained on audio samples of spoken English statements by various speakers. The library will be designed in such a way that it can be easily extended by researchers on more data. It will be simple for practitioners, and fast and robust in industrial deployments.