kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN
CVBLab, Instituto Universitario de Investigacin en Tecnologa Centrada en el Ser Humano (HUMAN-tech), addressline=Universitat Politcnica de Valncia, city=Valencia, postcode=46022, state=Valencia, country=Spain, Valencia.
CVBLab, Instituto Universitario de Investigacin en Tecnologa Centrada en el Ser Humano (HUMAN-tech), addressline=Universitat Politcnica de Valncia, city=Valencia, postcode=46022, state=Valencia, country=Spain, Valencia; organization=ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence, addressline=Universitat Politcnica de Valncia, city=Valencia, postcode=46022, state=Valencia, country=Spain, Valencia.
KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.ORCID iD: 0000-0002-1244-881x
CVBLab, Instituto Universitario de Investigacin en Tecnologa Centrada en el Ser Humano (HUMAN-tech), addressline=Universitat Politcnica de Valncia, city=Valencia, postcode=46022, state=Valencia, country=Spain, Valencia.
2024 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 156, article id 103022Article in journal (Refereed) Published
Abstract [en]

Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best K generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of K during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.

Place, publisher, year, edition, pages
Elsevier BV , 2024. Vol. 156, article id 103022
Keywords [en]
Generative adversarial networks, Non-parallel, Speech synthesis, Top-k, Voice conversion
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-341442DOI: 10.1016/j.specom.2023.103022ISI: 001133269900001Scopus ID: 2-s2.0-85178611630OAI: oai:DiVA.org:kth-341442DiVA, id: diva2:1825742
Note

QC 20240110

Available from: 2024-01-10 Created: 2024-01-10 Last updated: 2024-01-16Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Panariello, Claudio

Search in DiVA

By author/editor
Panariello, Claudio
By organisation
Media Technology and Interaction Design, MID
In the same journal
Speech Communication
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 124 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf