Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered.Speaker independent classification results using nu-SVMthe the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.
In this paper, a generic solution is presented for capturing, representing and employing the context of use in dialogue processing. The implementation of the solution within the framework of the SesaME dialogue manager and the Butler demonstrator is also described.
This paper introduces a general approach for binaryclassification of audiovisual data. The intended application ismispronunciation detection for specific phonemic errors, usingvery sparse training data. The system uses a Support VectorMachine (SVM) classifier with features obtained from a TimeVarying Discrete Cosine Transform (TV-DCT) on the audiolog-spectrum as well as on the image sequences. Theconcatenated feature vectors from both the modalities werereduced to a very small subset using a combination of featureselection methods. We achieved 95-100% correctclassification for each pair-wise classifier on a database ofSwedish vowels with an average of 58 instances per vowel fortraining. The performance was largely unaffected when testedon data from a speaker who was not included in the training.
Research suggests that blind people are superior to sighted in echolocation, but systematic psychoacoustic studies on environmental conditions such as distance to objects, signal duration, and reverberation are lacking. Therefore, two experiments were conducted. Noise bursts of 5, 50, or 500 ms were reproduced by a loudspeaker on an artificial manikin in an ordinary room and in an anechoic chamber. The manikin recorded the sounds binaurally in the presence and absence of a reflecting 1.5-mm thick aluminium disk, 0.5 in in diameter, placed in front, at distances of 0.5 to 5 m. These recordings were later presented to ten visually handicapped and ten sighted people, 30 62 years old, using a 2AFC paradigm with feedback. The task was to detect which of two sounds that contained the reflecting object. The blind performed better than the sighted participants. All performed well with the object at < 2 in distance. Detection increased with longer signal durations. Performance was slightly better in the ordinary room than in the anechoic chamber. A supplementary experiment on the two best blind persons showed that their superior performance at distances > 2 m was not by chance. Detection thresholds showed that blind participants could detect the object at longer distances in the conference room than in the anechoic chamber, when using the longer-duration sounds and also as compared to the sighted people. Audiometric tests suggest that equal hearing in both ears is important for echolocation. Possible echolocation mechanisms are discussed.
Blind persons emit sounds to detect objects by echolocation. Both perceived pitch and perceived loudness of the emitted sound change as they fuse with the reflections from nearby objects: Blind persons generally are better than sighted at echolocation, but it is unclear whether this superiority is related to detection of pitch, loudness, or both. We measured the ability of twelve blind and twenty-five sighted listeners to determine which of two sounds, 500 ms noise bursts, that had been recorded in the presence of a reflecting object in a room with reflecting walls using an artificial head. The sound pairs were original recordings differing in both pitch and loudness, or manipulated recordings with either the pitch or the loudness information removed. Observers responded using a 2AFC method with verbal feedback. For both blind and sighted listeners the performance declined more with the pitch information removed than with the loudness information removed. In addition, the blind performed clearly better than the sighted as long as the pitch information was present, but not when it was removed. Taken together, these results show that the ability to detect pitch is a main factor underlying high performance in human echolocation.
This paper presents a data-driven decision-theoretic approach to making grounding decisions in spoken dialogue systems, i.e., to decide which recognition hypotheses to consider as correct and which grounding action to take. Based on task analysis of the dialogue domain, cost functions are derived, which take dialogue efficiency, consequence of task failure and information gain into account. Dialogue data is then used to estimate speech recognition confidence thresholds that are dependent on the dialogue context.
This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality side-view did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters. The results indicate that it cannot be expected that intra-oral animations support speech perception in general, but that information on some articulatory features can be extracted. Animations of tongue movements have hence more potential for use in computer-assisted pronunciation and perception training than as a communication aid for the hearing-impaired.
This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality sideview did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters.
A series of novel capabilities have been designed to extend the repertoire of Ville, a virtual language teacher for Swedish, created at the Centre for Speech technology at KTH. These capabilities were tested by twenty-seven language students at KTH. This paper reports on qualitative surveys and quantitative performance from these sessions which suggest some general lessons for automated language training.
This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.