A Deep Learning Framework for Efficient High-Fidelity Speech Synthesis: StyleTTS

Authors

  • Ather Fawaz Department of Computer Science, FAST - NUCES, Lahore, Pakistan
  • Mubariz Barkat Ali Department of Computer Science, FAST - NUCES, Lahore, Pakistan
  • Muhammad Adan Department of Computer Science, FAST - NUCES, Lahore, Pakistan
  • Malik Mujtaba Department of Computer Science, FAST - NUCES, Lahore, Pakistan
  • Aamir Wali Department of Computer Science, FAST - NUCES, Lahore, Pakistan

Keywords:

Deep Learning, Artificial Intelligence, GAN, Open-source, Audio-synthesis, Text-to-Speech

Abstract

As we transition into the age of Artificial Intelligence (AI), one of the most incredible feats that it has achieved is the ability to talk and engage with human beings. An integral part of this task, referred to as speech synthesis, is to make the computer sound more human. Currently, the generative adversarial networks (GANs) have emerged as effective generative models and have more or less dominated the image generation domain. However, there is still untapped potential in what they can offer in the audio domain. Due to their highly parallelizable structure, GANs can produce hours of audio within seconds. Moreover, their inherent nature of modelling the latent space can afford some artistic control as well. In this paper, we propose Style text-to-speech (StyleTTS) model that uses image-based StyleGAN to efficiently generate high-fidelity speech. The model takes a character string and generates the corresponding speech. In this paper, we present results for digits from zero to nine.  We also compare our results with older TTS approaches and GAN models and report gains by using the newer architecture. We intend to provide our pre-trained GAN to the open-source community in the form of a library. Upon release, this would be trained on audio samples of spoken English statements by various speakers. The library will be designed in such a way that it can be easily extended by researchers on more data. It will be simple for practitioners, and fast and robust in industrial deployments.

Downloads

Published

2021-01-11

How to Cite

Ather Fawaz, Mubariz Barkat Ali, Muhammad Adan, Malik Mujtaba, & Aamir Wali. (2021). A Deep Learning Framework for Efficient High-Fidelity Speech Synthesis: StyleTTS. IKSP Journal of Computer Science and Engineering, 1(1). Retrieved from https://iksp.org/journals/index.php/ijcse/article/view/100