We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to “ideal” articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct respectively), but not as well as the “ideal” articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.
This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.
This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum
We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to.
Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.
In the four days of the Robotville exhibition at the London Science Museum, UK, during which the back-projected head Furhat in a situated spoken dialogue system was seen by almost 8 000 visitors, we collected a database of 10 000 utterances spoken to Furhat in situated interaction. The data collection is an example of a particular kind of corpus collection of human-machine dialogues in public spaces that has several interesting and specific characteristics, both with respect to the technical details of the collection and with respect to the resulting corpus contents. In this paper, we take the Furhat data collection as a starting point for a discussion of the motives for this type of data collection, its technical peculiarities and prerequisites, and the characteristics of the resulting corpus.
In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.
In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.
SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.
In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.
Our recent work within the research projectSIMULEKT (Simulating Intonational Varieties of Swedish) involves a pilot perceptiontest, used for detecting tendencies in humanclustering of Swedish dialects. 30 Swedishlisteners were asked to identify the geographical origin of 72 Swedish native speakers by clicking on a map of Sweden. Resultsindicate for example that listeners from thesouth of Sweden are generally better at recognizing some major Swedish dialects thanlisteners from the central part of Sweden.
Our recent work within the research project SIMULEKT (Simulating Intonational Varieties of Swedish) includes two approaches. The first involves a pilot perception test, used for detecting tendencies in human clustering of Swedish dialects. 30 Swedish listeners were asked to identify the geographical origin of Swedish native speakers by clicking on a map of Sweden. Results indicate for example that listeners from the south of Sweden are better at recognizing some major Swedish dialects than listeners from the central part of Sweden, which includes the capital area. The second approach concerns a method for modelling intonation using the newly developed SWING (Swedish INtonation Generator) tool, where annotated speech samples are resynthesized with rule based intonation and audiovisually analysed with regards to the major intonational varieties of Swedish. We consider both approaches important in our aim to test and further develop the Swedish prosody model.
The aim of this paper is to present the multimodal speech corpora collected at KTH, in the framework of the European project PF-Star, and discuss some of the issues related to the analysis and implementation of human communicative and emotional visual correlates of speech in synthetic conversational agents. Two multimodal speech corpora have been collected by means of an opto-electronic system, which allows capturing the dynamics of emotional facial expressions with very high precision. The data has been evaluated through a classification test and the results show promising identification rates for the different acted emotions. These multimodal speech corpora will truly represent a valuable source to get more knowledge about how speech articulation and communicative gestures are affected by the expression of emotions.
We share our experiences with integrating motion capture recordings in speech and dialogue research by describing (1) Spontal, a large project collecting 60 hours of video, audio and motion capture spontaneous dialogues, is described with special attention to motion capture and its pitfalls; (2) a tutorial where we use motion capture, speech synthesis and an animated talking head to allow students to create an active listener; and (3) brief preliminary results in the form of visualizations of motion capture data over time in a Spontal dialogue. We hope that given the lack of writings on the use of motion capture for speech research, these accounts will prove inspirational and informative.
This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”
This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at "main-streaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all". The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as "When was I supposed to meet Sara?" or "What's on my schedule today?"
We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.
Simultaneous measurements of tongue and facial motion,using a combination of electromagnetic articulography(EMA) and optical motion tracking, are analysed to improvethe articulation of an animated talking head and toinvestigate the correlation between facial and vocal tractmovement. The recorded material consists of VCV andCVC words and 270 short everyday sentences spoken byone Swedish subject. The recorded articulatory movementsare re-synthesised by a parametrically controlled 3D modelof the face and tongue, using a procedure involvingminimisation of the error between measurement and model.Using linear estimators, tongue data is predicted from theface and vice versa, and the correlation betweenmeasurement and prediction is computed.
Speech and sounds are important sources of information in our everyday lives for communication with our environment, be it interacting with fellow humans or directing our attention to technical devices with sound signals. For hearing impaired persons this acoustic information must be supplemented or even replaced by cues using other senses. We believe that the most natural modality to use is the visual, since speech is fundamentally audiovisual and these two modalities are complementary. We are hence exploring how different visualization methods for speech and audio signals may support hearing impaired persons. The goal in this line of research is to allow the growing number of hearing impaired persons, children as well as the middle-aged and elderly, equal participation in communication. A number of visualization techniques are proposed and exemplified with applications for hearing impaired persons.
I dag finns stora brister i tillgängligheten i samhället vad gäller teckentolkning. Nya tekniska landvinningar inom dator- och animationsteknologi, och det senaste decenniets forskning kring syntetisk teckentolkning har lett till att det nu finns nya förutsättningar att hitta tekniska lösningar med potential att förbättra tillgängligheten avsevärt för teckenspråkiga, för vissa typer av tjänster eller situationer. I Sverige finns idag ca 30 000 teckenspråksanvändare. Kunskapsläget har utvecklats mycket under senare år, både vad gäller förståelse/beskrivning av teckenspråk och tekniska förutsättningar för att analysera, lagra och generera teckenspråk. I kapitlet beskriver vi de olika tekniker som krävs för att utveckla teckenspråkteknologi. Det senaste decenniet har forskningen kring teckenspråkteknogi tagit fart, och ett flertal internationella projekt har startat. Ännu har bara ett fåtal tillämpningarblivit allmänt tillgängliga. Vi ger exempel på både forskningsprojekt och tidiga tillämpningar, speciellt från Europa där utvecklingen varit mycket stark. Utsikterna att starta en svensk utveckling inom området får anses goda. De kunskapsmässiga förutsättningarna är utmärkta; teknikkunnande inom språkteknologi, multimodal registrering och animering bl.a. vid KTH i kombination med fackkunskaper inom svenskt teckenspråk och teckenspråksanvändning vid Stockholms Universitet.
The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is closely related to the speech acoustics, while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face. Many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. This chapter looks into the communicative function of the animated talking agent, and its effect on intelligibility and the flow of the dialogue.
In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting of short, read utterances in which focal accent was systetnatically varied. The utterances were recorded in a variety of expressive modes including Certain, Confirming,Questioning, Uncertain, Happy, Angry and Neutral. Results showed that in all expressive modes, words with focal accent are accompanied by a greater variation of the facial parameters than are words in non-focal positions. Moreover, interesting differences between the expressions in terms of different parameters were found.
In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting of short, read utterances in which focal accent was systematically varied. The utterances were recorded in a variety of expressive modes including certain, confirming, questioning, uncertain, happy, angry and neutral. Results showed that in all expressive modes, words with focal accent are accompanied by a greater variation of the facial parameters than are words in non-focal positions. Moreover, interesting differences between the expressions in terms of different parameters were found.
The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.
The present tutorial paper is addressed to a wide audience with different discipline backgrounds as well as variable expertise on intonation. The paper is structured into five sections. In Section 1, Introduction, basic concepts of intonation and prosody are summarised and cornerstones of intonation research are highlighted. In Section 2, Functions and forms of intonation, a wide range of functions from morpholexical and phrase levels to discourse and dialogue levels are discussed and forms of intonation with examples from different languages are presented. In Section 3, Modelling and labelling of intonation, established models of intonation as well as labelling systems are presented. In Section 4, Applications of intonation the most widespread applications of intonation and especially technological ones are presented and methodological issues are discussed. In Section 5, Research perspective research avenues and ultimate goals as well as the significance and benefits of intonation research in the upcoming years are outlined.
The research project Simulating intonational varieties of Swedish (SIMULEKT) aims to gain more precise and thorough knowledge about some major regional varieties of Swedish: South, Göta, Svea, Gotland, Dala, North, and Finland Swedish. In this research effort, the Swedish prosody model and different forms of speech synthesis play a prominent role. The two speech databases SweDia 2000 and SpeechDat constitute our main material for analysis. As a first test case for our prosody model, we compared Svea and North Swedish intonation in a pilot production-oriented perception test. Näi{dotless}ve Swedish listeners were asked to identify the most Svea and North sounding stimuli. Results showed that listeners can differentiate between the two varieties from intonation only. They also provided information on how intonational parameters affect listeners' impression of Swedish varieties. All this indicates that our experimental method can be used to test perception of different regional varieties of Swedish.
This paper introduces a new research project Simulating Intonational Varieties of Swedish (SIMULEKT). The basic goal of the project is to produce more precise and thorough knowledge about some major intonational varieties of Swedish. In this research effort the Swedish prosody model plays a prominent role. A fundamental idea is to take advantage of speech synthesis in different forms. In our analysis and synthesis work we will focus on some major intonational types: South, Göta, Svea, Gotland, Dala, North, and Finland Swedish. The significance of our project work will be within basic research as well as in speech technology applications.
This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing different models.
In this chapter, we review some of the issues in rule-based synthesis and specifically discuss formant synthesis. Formant synthesis and the theory behind have played an important role in both the scientific progress in understanding how humans talk and also the development of the first speech technology applications. Its flexibility and small footprint makes the approach still of interest and a valuable complement to the current dominant methods based on concatenative data-driven synthesis. As already mentioned in the overview by Schroeter (Chap. 19) we also see a new trend to combine the rule-based and data-driven approaches. Formant features from a database that can be used both to optimize a rule-based formant synthesis system and to optimize the search for good units in a concatenative system.
In this paper we discuss work in progress on an interactive talking agent as a virtual language tutor in CALL applications. The ambition is to create a tutor that can be engaged in many aspects of language learning from detailed pronunciation to conversational training. Some of the crucial components of such a system is described. An initial implementation of a stress/quantity training scheme will be presented.
Efficient language learning is one of the keys to social inclusion. In this paper we present some work aiming at creating a virtual language tutor. The ambition is to create a tutor that can be engaged in many aspects of language learning from detailed pronunciation training to conversational practice. Some of the crucial components of such a system are described. An initial implementation of a stress/quantity training tutor for Swedish will be presented.
Speech communication research and speech technology has found many applications for handicapped individuals. One of the very first examples of an application of speech synthesis was the reading machine for the blind. It is natural that results and devices in the speech communication field can be utilized for (re)habilitation of persons with communication disabilities. AAC - Augmentative and Alternative Communication - has evolved into an independent research area with strong input from speech and language processing. In this presentation we will look at the development of the field from very early speech training devices based on speech analysis to advanced systems including robotics and avatars capable of human like interaction. We will show examples where pressing needs of disabled persons have inspired avant-garde applications and development that have eventually spread to more general use in widely used applications. In this sense the "design for all" paradigm has been a rewarding and fruitful driving force for many speech communication and technology researchers.
In this paper we present some work aiming atcreating a virtual language tutor. The ambitionis to create a tutor that can be engaged inmany aspects of language learning from detailedpronunciation training to conversationalpractise. Some of the crucial componentsof such a system are described. An initialimplementation of a stress/quantitytraining tutor for Swedish will be presented.
Prosody in a single speaking style-often read speech-has been studied extensively in acoustic speech. During the past few years we have expanded our interest in two directions: (1) Prosody in expressive speech communication and (2) prosody as an audiovisual expression. Understanding the interactions between visual expressions (primarily in the face) and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons tightly connected to the acoustics (e.g. lip and jaw movements), but there are other articulatory movements that do not show up on the outside of the face. Furthermore, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. In this presentation we will give some examples of recent work, primarily at KTH, addressing these questions. We will report on methods for the acquisition and modeling of visual and acoustic data, and some evaluation experiments in which audiovisual prosody is tested. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher.
At the Centre for Speech Technology at KTH, we have for the past several years been developing spoken dialogue applications that include animated talking agents. Our motivation for moving into audiovisual output is to investigate the advantages of multimodality in human-system communication. While the mainstream character animation area has focussed on the naturalness and realism of the animated agents, our primary concern has been the possible increase of intelligibility and efficiency of interaction resulting from the addition of a talking face. In our first dialogue system, Waxholm, the agent used the deictic function of indicating specific information on the screen by eye gaze. In another project, Synface, we were specifically concerned with the advantages in intelligibility that a talking face could provide. In recent studies we have investigated the use of facial gesture cues to convey such dialogue-related functions as feedback and turn-taking as well as prosodic functions such as prominence. Results show that cues such as eyebrow and head movement can independently signal prominence. Current results also indicate that there can be considerable differences in cue strengths among visual cues such as smiling and nodding and that such cues can contribute in an additive manner together with auditory prosody as cues to different dialogue functions. Results from some of these studies are presented in the chapter along with examples of spoken dialogue applications using talking heads.
In face-to-face communication both visual andauditory information play an obvious andsignificant role. In this presentation we will discusswork done, primarily at KTH, that aims atanalyzing and modelling verbal and non-verbalcommunication from a multi-modal perspective. Inour studies, it appears that both segmental andprosodic phenomena are strongly affected by thecommunicative context of speech interaction. Oneplatform for modelling audiovisual speechcommunication is the ECA, embodiedconversational agent. We will describe how ECAshave been used in our research, including examplesof applications and a series of experiments forstudying multimodal aspects of speechcommunication.
Understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher. In this presentation we will give some examples of recent work, primarily at KTH, involving the collection and analysis of a database for audiovisual prosody. We will report on methods for the acquisition and modeling of visual and acoustic data, and provide some examples of analysis of head nods and eyebrow settings.
The use of animated talking agents is a novel feature of many multimodal experimental spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. Established techniques for evaluating the quality, efficiency, and other impacts of this technology have not yet appeared in standard textbooks. The focus of this chapter is to look into the communicative function of the agent, both the capability to increase intelligibility of the spoken interaction and the possibility to make the flow of the dialogue smoother, through different kinds of communicative gestures such as gestures for emphatic stress, emotions, turntaking, and negative or positive system feedback. The chapter reviews state-of-the-art animated agent technologies and their applications primarily in dialogue systems. The chapter also includes examples of methods of evaluating communicative gestures in different contexts.
While much of the state-of-the-art research in human-robot interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not predefined. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scientific museum, but without a predefined purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot's capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would define a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its first marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasant experience.
Facial gestures are used to convey e.g. emotions, dialogue states and conversational signals, which support us in the interpretation of other people's feelings and intentions. Synthesising this behaviour with an animated talking head would widen the possibilities of this intuitive interface. The dynamic characteristics of these facial gestures during speech affect articulation. Previously, articulation for neutral speech has been studied and implemented in animation rules. The results obtained in this study show how some articulatory parameters are affected by the influence of expressiveness in speech for a selection of Swedish vowels. Our focus has primarily been on attitudes and emotions conveying information that is intended to make an animated agent more "human-like". A multimodal corpus of acted expressive speech has been collected for this purpose.