Abstract
1- INTRODUCTION
2- RELATED WORK
3- METHOD
4- EVALUATION
5- CONCLUSION
References
Abstract
We introduce a perceptually motivated approach to bandwidth expansion for speech. Our method pairs a new 3-way split variant of the FFTNet neural vocoder structure with a perceptual loss function, combining objectives from both the time and frequency domains. Mean opinion score tests show that it outperforms baseline methods from both domains, even for extreme bandwidth expansion.
INTRODUCTION
This paper introduces a deep learning-based method for bandwidth expansion of human speech. The goal of the bandwidth expansion (BWE) problem, also called “bandwidth extension” and “audio super-resolution,” is to expand the frequency range of an input audio signal. Its traditional applications are in telephony, where the bandwidth of telephone speech may be limited to below 4 kHz, thus aiming to render muffled speech more intelligible [1]. In the context of newer audio synthesis tasks, such as textto-speech (TTS) and consumer digital media creation, there arises a need for more extreme BWE, such as to 44.1 kHz or 48 kHz. In WaveNet-like applications, for example, speech is synthesized at a low sampling rate for efficiency reasons [2]. BWE may be applied to synthesized audio to improve the listening experience. In another use case, many consumers record speech on low-bandwidth devices, such as a consumergrade microphone, and would like higher-resolution versions of their recordings for podcasts or other artistic purposes. In these applications, the input bandwidth might not be as low as that of telephone transmission, but rather around 8 kHz. Our objective is to super-resolve speech to high-definition audio – in our experiments, we convert 8 kHz to 44.1 kHz, although these are just parameters of the method. By expanding beyond 16 kHz, we emphasize not intelligibility as in traditional BWE, but high perceptual quality and sense of presence in the recording, since the extreme upper bands offer information beyond just speech content, including the finer details of the speaker’s voice and environment.