For non-intrusive speech quality assessment, we treat the mean-opinion-score (MOS) of a speech signal as a latent, and propose a latent MOS network (LaMOSNet) to estimate the MOS. At the time of training, the proposed LaMOSNet has two parts in series, with the first part providing the latent estimate, i.e. the MOS of an input speech signal, and the second part providing an estimated score by a given judge. Only the first part is used for testing. We address two inherent aspects - limited-data and noisy-data aspects - in training using stochastic gradient noise and a student-teacher type of training, motivated by semi-supervised learning. It is shown that LaMOSNet provides good performance on the Voice Conversion Challenge 2018 dataset, and state-of-the-art correlation performance on the Voice Conversion Challenge 2016 dataset.
Part of ISBN 9789464593600
QC 20231214