Perception of charisma, the ability to influence others by virtueof one’s personal qualities, appears to be influenced to someextent by cultural factors. We compare results of five studies of charisma speech in which American, Palestinian, andSwedish subjects rated Standard American English politicalspeech and Americans and Palestinians rated Palestinian Arabic speech. We identify acoustic-prosodic and lexical featurescorrelated with charisma ratings of both languages for nativeand non-native speakers and find that 1) some acoustic-prosodicfeatures correlated with charisma ratings appear similar acrossall five experiments; 2) other acoustic-prosodic and lexical features correlated with charisma appear specific to the languagerated, whatever the native language of the rater; and 3) stillother acoustic-prosodic cues appear specific to both rater nativelanguage and to language rated. We also find that, while theabsolute ratings non-native raters assign tend to be lower thanthose of native speakers, the ratings themselves are strongly correlated.
Perception of charisma, the ability to influence others by virtue of one's personal qualities, appears to be influenced to some extent by cultural factors. We compare results of five studies of charisma speech in which American, Palestinian, and Swedish subjects rated Standard American English political speech and Americans and Palestinians rated Palestinian Arabic speech. We identify acoustic-prosodic and lexical features correlated with charisma ratings of both languages for native and non-native speakers and find that 1) some acoustic-prosodic features correlated with charisma ratings appear similar across all five experiments; 2) other acoustic-prosodic and lexical features correlated with charisma appear specific to the language rated, whatever the native language of the rater; and 3) still other acoustic-prosodic cues appear specific to both rater native language and to language rated. We also find that, while the absolute ratings non-native raters assign tend to be lower than those of native speakers, the ratings themselves are strongly correlated.
This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing.
We and others have found it fruitful to assume that users, when interacting with spoken dialogue systems, perceive the systems and their actions metaphorically. Common metaphors include the human metaphor and the interface metaphor (cf. Edlund, Heldner, & Gustafson, 2006). In the interface metaphor, the spoken dialogue system is perceived as a machine interface – often but not always a computer interface. Speech is used to accomplish what would have otherwise been accomplished by some other means of input, such as a keyboard or a mouse. In the human metaphor, on the other hand, the computer is perceived as a creature (or even a person) with humanlike conversational abilities, and speech is not a substitute or one of many alternatives, but rather the primary means of communicating with this creature. We are aware that more “natural ” or human-like behaviour does not automatically make a spoken dialogue system “better ” (i.e. more efficient or more well-liked by its users). Indeed, we are quite convinced that the advantage (or disadvantage) of humanlike behaviour will be highly dependent on the application. However, a dialogue system that is coherent with a human metaphor may profit from a number of characteristics.
This paper reports on a study that explores to what extent listeners are able to judge where a particular utterance fragment is located in a speaker's pitch range. The research consists of a perception study that makes use of 100 stimuli, selected from 50 different speakers whose speech was originally collected for a multi-speaker database of Swedish speech materials. The fragments are presented to subjects whom are asked to estimate whether the fragment is located in the lower or higher part of that speaker's range. Results reveal that listeners' judgments are dependent on the gender of the speaker, but that within a gender they tend to hear differences in range.
This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing different models.
In this chapter, we review some of the issues in rule-based synthesis and specifically discuss formant synthesis. Formant synthesis and the theory behind have played an important role in both the scientific progress in understanding how humans talk and also the development of the first speech technology applications. Its flexibility and small footprint makes the approach still of interest and a valuable complement to the current dominant methods based on concatenative data-driven synthesis. As already mentioned in the overview by Schroeter (Chap. 19) we also see a new trend to combine the rule-based and data-driven approaches. Formant features from a database that can be used both to optimize a rule-based formant synthesis system and to optimize the search for good units in a concatenative system.
This paper describes our work on building aformant synthesis system based on both rule generated and database driven methods. Three parametric synthesis systems are discussed: our traditional rule based system, a speaker adapted system, and finally a gesture system.The gesture system is a further development of the adapted system in that it includes concatenated formant gestures from a data-driven unit library. The systems are evaluated technically, comparing the formant tracks with an analysed test corpus. The gesture system results in a 25% error reduction in the formant frequencies due to the inclusion of the stored gestures. Finally, a perceptual evaluation shows a clear advantage in naturalness for the gesture system compared to both the traditional system and the speaker adapted system.
The current study investigates acoustic correlates to perceived hesitation based on previous work showing that pause duration and final lengthening both contribute to the perception of hesitation. It is the total duration increase that is the valid cue rather than the contribution by either factor. The present experiment using speech synthesis was designed to evaluate F0 slope and presence vs. absence of creaky voice before the inserted hesitation in addition to durational cues. The manipulations occurred in two syntactic positions, within a phrase and between two phrases, respectively. The results showed that in addition to durational increase, variation of both F0 slope and creaky voice had perceptual effects, although to a much lesser degree. The results have a bearing on efforts to model spontaneous speech including disfluencies, to be explored, for example, in spoken dialogue systems.
In our efforts to model spontaneous speech for use in, for example, spoken dialogue systems, a series of experiments have been conducted in order to investigate correlates to perceived hesitation. Previous work has shown that it is the total duration increase that is the valid cuerather than the contribution by either of the two factors pause duration and final lengthening. In the present experiment we explored the effects of F0 slope variation and the presence vs. absence of creaky voice in addition to durational cues, using synthetic stimuli. The results showed that variation of both F0 slope and creaky voice did have perceptual effects, but to amuch lesser degree than the durational increase.
The current work deals with the modelling of one type of disfluency, hesitations. A perceptual experiment using speech synthesis was designed to evaluate two duration features found to be correlates to hesitation, pause duration and final lengthening. A variation of F0 slope before the hesitation wasalso included. The most important finding is that it is the totalduration increase that is the valid cue rather than the contribution by either factor. In addition, our findings lead us to assume an interaction with syntax. The absence of strong effects of the induced F0 variation was unexpected and we consider several possible explanations for this result.
Studies of perceptually based predictions of upcoming prosodic boundaries in spontaneous Swedish speech, both by native speakers of Swedish and of native speakers of standard American English reveal marked similarity in judgments. We examined whether Swedish and American listeners were able to predict the occurrence and strength of upcoming boundaries in a series of web-based perceptive experiments. Utterance fragments (in both long and short versions) were selected from a corpus of spontaneous Swedish speech, which was first labeled for boundary presence and strength by expert labelers. These fragments were then presented to listeners, who were instructed to guess whether or not they were followed by a prosodic break, and if so, what the strength of the break was. Results revealed that both Swedish and American listening groups were indeed able to predict whether or not a boundary (of a particular strength) followed the fragment. This suggests that acoustic and prosodic, rather than lexico-grammatical and semantic information was being used by listeners as a primary cue. Acoustic and prosodic correlates of these judgments were then examined, with significant correlations found between judgments and the presence/absence of final creak and phrase-final f0 level and slope.
We discuss perception studies of two low level indicators of discourse phenomena by Swedish. Japanese, and Chinese native speakers. Subjects were asked to identify upcoming prosodic boundaries and disfluencies in Swedish spontaneous speech. We hypothesize that speakers of prosodically unrelated languages should be less able to predict upcoming phrase boundaries but potentially better able to identify disfluencies, since indicators of disfluency are more likely to depend upon lexical, as well as acoustic information. However, surprisingly, we found that both phenomena were fairly well recognized by native and non-native speakers, with, however, some possible interference from word tones for the Chinese subjects.
We describe results of a study of perceptually based predictions of upcoming prosodic breaks in spontaneous Swedish speech materials by native speakers of Swedish and of standard American English. The question addressed here is the extent to which listeners are able, on the basis of acoustic and prosodic features, to predict the occurrence of upcoming boundaries, and if so, whether they are able to distinguish different degrees of boundary strength. An experiment was conducted in which spontaneous utterance fragments (both long and short versions) were presented to listeners, who were instructed to guess whether or not the fragments were followed by a prosodic break, and if so, what the strength of the break was, where boundary presence and strength had been independently labeled. Results revealed that both listening groups were indeed able to predict whether or not a boundary (of a particular strength) followed the fragment, suggesting that prosodic rather than lexico-grammatical information was being used as a primary cue.
In this paper, an overview of the Higgins project and the research within the project is presented. The project incorporates studies of error handling for spoken dialogue systems on several levels, from processing to dialogue level. A domain in which a range of different error types can be studied has been chosen: pedestrian navigation and guiding. Several data collections within Higgins have been analysed along with data from Higgins' predecessor, the AdApt system. The error handling research issues in the project are presented in light of these analyses.
Most current work on spoken human-computer interaction has so far concentrated on interactions between a single user and a dialogue system. The advent of ideas of the computer or dialogue system as a conversational partner in a group of humans, for example within the CHIL-project1 and elsewhere (e.g. Kirchhoff & Ostendorf, 2003), introduces new requirements on the capabilities of the dialogue system. Among other things, the computer as a participant in a multi-part conversation has to appreciate the human turn-taking system, in order to time its' own interjections appropriately. As the role of a conversational computer is likely to be to support human collaboration, rather than to guide or control it, it is particularly important that it does not interrupt or disturb the human participants. The ultimate goal of the work presented here is to predict suitable places for turn-takings, as well as positions where it is impossible for a conversational computer to interrupt without irritating the human interlocutors.
We propose a novel human-robot-interaction framework for robust visual scene understanding. Without any a-priori knowledge about the objects, the task of the robot is to correctly enumerate how many of them are in the scene and segment them from the background. Our approach builds on top of state-of-the-art computer vision methods, generating object hypotheses through segmentation. This process is combined with a natural dialog system, thus including a ‘human in the loop’ where, by exploiting the natural conversation of an advanced dialog system, the robot gains knowledge about ambiguous situations. We present an entropy-based system allowing the robot to detect the poorest object hypotheses and query the user for arbitration. Based on the information obtained from the human-robot dialog, the scene segmentation can be re-seeded and thereby improved. We present experimental results on real data that show an improved segmentation performance compared to segmentation without interaction.
This paper describes a recently started inter-disciplinary research program aiming at inves-tigating and modelling fundamental aspects of the language acquisition process. The working hypothesis assumes that general purpose per-ception and memory processes, common to both human and other mammalian species, along with the particular context of initial adult-infant interaction, underlie the infant’s ability to progressively derive linguistic struc-ture implicitly available in the ambient lan-guage. The project is conceived as an interdis-ciplinary research effort involving the areas of Phonetics, Psychology and Speech recognition. Experimental speech perception techniques will be used at Dept. of Linguistics, SU, to investi-gate the development of the infant’s ability to derive linguistic information from situated con-nected speech. These experiments will be matched by behavioural tests of animal sub-jects, carried out at CMU, Pittsburgh, to dis-close the potential significance that recurrent multi-sensory properties of the stimuli may have for spontaneous category formation. Data from infant and child vocal productions as well as infant-adult interactions will also be col-lected and analyzed to address the possibility of a production-perception link. Finally, the data from the infant and animal studies will be inte-grated and tested in mathematical models of the language acquisition process, developed at TMH, KTH.
This paper presents the current status of the research in the Higgins project and provides background for a demonstration of the spoken dialogue system implemented within the project. The project represents the latest development in the ongoing dialogue systems research at KTH. The practical goal of the project is to build collaborative conversational dialogue systems in which research issues such as error handling techniques can be tested empirically.