kth.sePublications
Change search
Link to record
Permanent link

Direct link
Mendelson, Joseph
Publications (6 of 6) Show all publications
Mendelson, J. & Aylett, M. (2017). Beyond the listening test: An interactive approach to TTS Evaluation. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 249-253). International Speech Communication Association, 2017
Open this publication in new window or tab >>Beyond the listening test: An interactive approach to TTS Evaluation
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2017, Vol. 2017, p. 249-253Conference paper (Refereed)
Abstract [en]

Traditionally, subjective text-To-speech (TTS) evaluation is performed through audio-only listening tests, where participants evaluate unrelated, context-free utterances. The ecological validity of these tests is questionable, as they do not represent real-world end-use scenarios. In this paper, we examine a novel approach to TTS evaluation in an imagined end-use, via a complex interaction with an avatar. 6 different voice conditions were tested: Natural speech, Unit Selection and Parametric Synthesis, in neutral and expressive realizations. Results were compared to a traditional audio-only evaluation baseline. Participants in both studies rated the voices for naturalness and expressivity. The baseline study showed canonical results for naturalness: Natural speech scored highest, followed by Unit Selection, then Parametric synthesis. Expressivity was clearly distinguishable in all conditions. In the avatar interaction study, participants rated naturalness in the same order as the baseline, though with smaller effect size; expressivity was not distinguishable. Further, no significant correlations were found between cognitive or affective responses and any voice conditions. This highlights 2 primary challenges in designing more valid TTS evaluations: in real-world use-cases involving interaction, listeners generally interact with a single voice, making comparative analysis unfeasible, and in complex interactions, the context and content may confound perception of voice quality.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keywords
Expressive speech synthesis, Human-computer interaction, Interactive virtual agents, Listening tests, Statistical parametric speech synthesis, Subjective evaluation, TTS evaluation, Unit selection, User experience, Voice interface design
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-220753 (URN)10.21437/Interspeech.2017-1438 (DOI)000457505000051 ()2-s2.0-85039163349 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180105

Available from: 2018-01-05 Created: 2018-01-05 Last updated: 2024-03-15Bibliographically approved
Oertel, C., Jonell, P., Kontogiorgos, D., Mendelson, J., Beskow, J. & Gustafson, J. (2017). Crowdsourced design of artificial attentive listeners. In: : . Paper presented at INTERSPEECH: Situated Interaction, Augusti 20-24 Augusti, 2017.
Open this publication in new window or tab >>Crowdsourced design of artificial attentive listeners
Show others...
2017 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-215505 (URN)
Conference
INTERSPEECH: Situated Interaction, Augusti 20-24 Augusti, 2017
Note

QC 20171011

Available from: 2017-10-10 Created: 2017-10-10 Last updated: 2024-03-15Bibliographically approved
Oertel, C., Jonell, P., Kontogiorgos, D., Mendelson, J., Beskow, J. & Gustafson, J. (2017). Crowd-Sourced Design of Artificial Attentive Listeners. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017; Stockholm; Sweden; 20 August 2017 through 24 August 2017 (pp. 854-858). International Speech Communication Association, 2017
Open this publication in new window or tab >>Crowd-Sourced Design of Artificial Attentive Listeners
Show others...
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2017, Vol. 2017, p. 854-858Conference paper, Published paper (Refereed)
Abstract [en]

Feedback generation is an important component of humanhuman communication. Humans can choose to signal support, understanding, agreement or also sceptiscism by means of feedback tokens. Many studies have focused on the timing of feedback behaviours. In the current study, however, we keep the timing constant and instead focus on the lexical form and prosody of feedback tokens as well as their sequential patterns. For this we crowdsourced participant's feedback behaviour in identical interactional contexts in order to model a virtual agent that is able to provide feedback as an attentive/supportive as well as attentive/sceptical listener. The resulting models were realised in a robot which was evaluated by third-party observers.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X ; 2017
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-268357 (URN)10.21437/Interspeech.2017-926 (DOI)000457505000181 ()2-s2.0-85028998444 (Scopus ID)978-1-5108-4876-4 (ISBN)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017; Stockholm; Sweden; 20 August 2017 through 24 August 2017
Note

QC 20200703

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2025-02-07Bibliographically approved
Jonell, P., Mendelson, J., Storskog, T., Hagman, G., Östberg, P., Leite, I., . . . Kjellström, H. (2017). Machine Learning and Social Robotics for Detecting Early Signs of Dementia.
Open this publication in new window or tab >>Machine Learning and Social Robotics for Detecting Early Signs of Dementia
Show others...
2017 (English)Other (Other academic)
National Category
Geriatrics
Identifiers
urn:nbn:se:kth:diva-268358 (URN)
Note

QC 20200703

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2022-06-26Bibliographically approved
Mendelson, J., Oplustil, P., Watts, O. & King, S. (2017). Nativization of foreign names in TTS for automatic reading of world news in Swahili. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 2188-2192). International Speech Communication Association, 2017
Open this publication in new window or tab >>Nativization of foreign names in TTS for automatic reading of world news in Swahili
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 2188-2192Conference paper, Published paper (Refereed)
Abstract [en]

When a text-To-speech (TTS) system is required to speak world news, a large fraction of the words to be spoken will be proper names originating in a wide variety of languages. Phonetization of these names based on target language letter-To-sound rules will typically be inadequate. This is detrimental not only during synthesis, when inappropriate phone sequences are produced, but also during training, if the system is trained on data from the same domain. This is because poor phonetization during forced alignment based on hidden Markov models can pollute the whole model set, resulting in degraded alignment even of normal target-language words. This paper presents four techniques designed to address this issue in the context of a Swahili TTS system: Automatic transcription of proper names based on a lexicon from a better-resourced language; the addition of a parallel phone set and special part-of-speech tag exclusively dedicated to proper names; a manually-crafted phone mapping which allows substitutions for potentially more accurate phones in proper names during forced alignment; the addition in proper names of a grapheme-derived frame-level feature, supplementing the standard phonetic inputs to the acoustic model. We present results from objective and subjective evaluations of systems built using these four techniques.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X ; 2017
Keywords
Code-switching, Multi-lingual speech synthesis, Speech synthesis, Text processing, TTS, Under-resourced languages
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-222073 (URN)10.21437/Interspeech.2017-1398 (DOI)000457505000457 ()2-s2.0-85039169583 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180131

Available from: 2018-01-31 Created: 2018-01-31 Last updated: 2025-02-07Bibliographically approved
Szekely, E., Mendelson, J. & Gustafson, J. (2017). Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 804-808). International Speech Communication Association, 2017
Open this publication in new window or tab >>Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2017, Vol. 2017, p. 804-808Conference paper, Published paper (Refereed)
Abstract [en]

As synthetic voices become more flexible, and conversational systems gain more potential to adapt to the environmental and social situation, the question needs to be examined, how different modifications to the synthetic speech interact with each other and how their specific combinations influence perception. This work investigates how the vocal effort of the synthetic speech together with added disfluencies affect listeners' perception of the degree of uncertainty in an utterance. We introduce a DNN voice built entirely from spontaneous conversational speech data and capable of producing a continuum of vocal efforts, prolongations and filled pauses with a corpus-based method. Results of a listener evaluation indicate that decreased vocal effort, filled pauses and prolongation of function words increase the degree of perceived uncertainty of conversational utterances expressing the speaker's beliefs. We demonstrate that the effect of these three cues are not merely additive, but that interaction effects, in particular between the two types of disfluencies and between vocal effort and prolongations need to be considered when aiming to communicate a specific level of uncertainty. The implications of these findings are relevant for adaptive and incremental conversational systems using expressive speech synthesis and aspiring to communicate the attitude of uncertainty.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keywords
Conversational Systems, Disfluencies, Speech Synthesis, Uncertainty, Vocal Effort
National Category
Media and Communication Studies
Identifiers
urn:nbn:se:kth:diva-220749 (URN)10.21437/Interspeech.2017-1507 (DOI)000457505000163 ()2-s2.0-85039172286 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180105

Available from: 2018-01-05 Created: 2018-01-05 Last updated: 2025-02-17Bibliographically approved
Organisations

Search in DiVA

Show all publications