kth.sePublications
Change search
Refine search result
12 51 - 88 of 88
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Engwall, Olov
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Tongue Talking: Studies in Intraoral Speech Synthesis2002Doctoral thesis, comprehensive summary (Other scientific)
    Download full text (pdf)
    FULLTEXT01
  • 52.
    Engwall, Olov
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Vocal Tract Modeling i 3D1999In: TMH Quarterly Status and Progress Report, p. 31-38Article in journal (Other academic)
  • 53.
    Engwall, Olov
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Badin, P
    An MRI Study of Swedish Fricatives: Coarticulatory effects2000In: Proceedings of the 5th Speech Production Seminar, 2000, p. 297-300Conference paper (Other academic)
  • 54.
    Engwall, Olov
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Badin, P
    Collecting and Analysing Two- and Three-dimensional MRI data for Swedish1999In: TMH Quarterly Status and Progress Report, p. 11-38Article in journal (Other academic)
  • 55.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bandera Rubio, Juan Pedro
    Departemento de Tecnología Electrónica, University of Málaga, Málaga, Spain.
    Bensch, Suna
    Department of Computing Science, Umeå University, Umeå, Sweden.
    Haring, Kerstin Sophie
    Robots and Sensors for the Human Well-Being, Ritchie School of Engineering and Computer Science, University of Denver, Denver, United States.
    Kanda, Takayuki
    HRI Lab, Kyoto University, Kyoto, Japan.
    Núñez, Pedro
    Tecnología de los Computadores y las Comunicaciones Department, University of Extremadura, Badajoz, Spain.
    Rehm, Matthias
    The Technical Faculty of IT and Design, Aalborg University, Aalborg, Denmark.
    Sgorbissa, Antonio
    Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genoa, Genoa, Italy.
    Editorial: Socially, culturally and contextually aware robots2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1232215Article in journal (Other academic)
  • 56.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Pronunciation feedback from real and virtual language teachers2007In: Computer Assisted Language Learning, ISSN 0958-8221, E-ISSN 1744-3210, Vol. 20, no 3, p. 235-262Article in journal (Refereed)
    Abstract [en]

    The aim of this paper is to summarise how pronunciation feedback on the phoneme level should be given in computer-assisted pronunciation training (CAPT) in order to be effective. The study contains a literature survey of feedback in the language classroom, interviews with language teachers and their students about their attitudes towards pronunciation feedback, and observations of how feedback is given in their classrooms. The study was carried out using focus group meetings, individual semi-structured interviews and classroom observations. The feedback strategies that were advocated and observed in the study on pronunciation feedback from human teachers were implemented in a computer-animated language tutor giving articulation feedback. The virtual tutor was subsequently tested in a user trial and evaluated with a questionnaire. The article proposes several feedback strategies that would improve the pedagogical soundness of CAPT systems.

  • 57.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Designing the user interface of the computer-based speech training system ARTUR based on early user tests2006In: Behavior and Information Technology, ISSN 0144-929X, E-ISSN 1362-3001, Vol. 25, no 4, p. 353-365Article in journal (Refereed)
    Abstract [en]

    This study has been performed in order to evaluate a prototype for the human - computer interface of a computer-based speech training aid named ARTUR. The main feature of the aid is that it can give suggestions on how to improve articulations. Two user groups were involved: three children aged 9 - 14 with extensive experience of speech training with therapists and computers, and three children aged 6, with little or no prior experience of computer-based speech training. All children had general language disorders. The study indicates that the present interface is usable without prior training or instructions, even for the younger children, but that more motivational factors should be introduced. The granularity of the mesh that classifies mispronunciations was satisfactory, but the flexibility and level of detail of the feedback should be developed further.

  • 58.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Feedback management in the pronunciation training system ARTUR2006In: Proceedings of CHI 2006, 2006, p. 231-234Conference paper (Refereed)
    Abstract [en]

    This extended abstract discusses the development of a computer-assisted pronunciation training system that gives articulatory feedback, and in particular the management of feedback given to the user.

  • 59.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lopes, Jose
    Heriot Watt Univ, Edinburgh, Midlothian, Scotland..
    Ljung, Mikael
    KTH.
    Månsson, Linnea
    KTH.
    Identification of Low-engaged Learners in Robot-led Second Language Conversations with Adults2022In: ACM Transactions on Human-Robot Interaction, E-ISSN 2573-9522, Vol. 11, no 2, article id 18Article in journal (Refereed)
    Abstract [en]

    The main aim of this study is to investigate if verbal, vocal, and facial information can be used to identify low-engaged second language learners in robot-led conversation practice. The experiments were performed on voice recordings and video data from 50 conversations, in which a robotic head talks with pairs of adult language learners using four different interaction strategies with varying robot-learner focus and initiative. It was found that these robot interaction strategies influenced learner activity and engagement. The verbal analysis indicated that learners with low activity rated the robot significantly lower on two out of four scales related to social competence. The acoustic vocal and video-based facial analysis, based on manual annotations or machine learning classification, both showed that learners with low engagement rated the robot's social competencies consistently, and in several cases significantly, lower, and in addition rated the learning effectiveness lower. The agreement between manual and automatic identification of low-engaged learners based on voice recordings or face videos was further found to be adequate for future use. These experiments constitute a first step towards enabling adaption to learners' activity and engagement through within- and between-strategy changes of the robot's interaction with learners.

  • 60.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Majlesi, Ali Reza
    Socio-cultural perception of robot backchannels2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10Article in journal (Refereed)
    Abstract [en]

    Introduction: Backchannels, i.e., short interjections by an interlocutor to indicate attention, understanding or agreement regarding utterances by another conversation participant, are fundamental in human-human interaction. Lack of backchannels or if they have unexpected timing or formulation may influence the conversation negatively, as misinterpretations regarding attention, understanding or agreement may occur. However, several studies over the years have shown that there may be cultural differences in how backchannels are provided and perceived and that these differences may affect intercultural conversations. Culturally aware robots must hence be endowed with the capability to detect and adapt to the way these conversational markers are used across different cultures. Traditionally, culture has been defined in terms of nationality, but this is more and more considered to be a stereotypic simplification. We therefore investigate several socio-cultural factors, such as the participants’ gender, age, first language, extroversion and familiarity with robots, that may be relevant for the perception of backchannels.

    Methods: We first cover existing research on cultural influence on backchannel formulation and perception in human-human interaction and on backchannel implementation in Human-Robot Interaction. We then present an experiment on second language spoken practice, in which we investigate how backchannels from the social robot Furhat influence interaction (investigated through speaking time ratios and ethnomethodology and multimodal conversation analysis) and impression of the robot (measured by post-session ratings). The experiment, made in a triad word game setting, is focused on if activity-adaptive robot backchannels may redistribute the participants’ speaking time ratio, and/or if the participants’ assessment of the robot is influenced by the backchannel strategy. The goal is to explore how robot backchannels should be adapted to different language learners to encourage their participation while being perceived as socio-culturally appropriate.

    Results: We find that a strategy that displays more backchannels towards a less active speaker may substantially decrease the difference in speaking time between the two speakers, that different socio-cultural groups respond differently to the robot’s backchannel strategy and that they also perceive the robot differently after the session.

    Discussion: We conclude that the robot may need different backchanneling strategies towards speakers from different socio-cultural groups in order to encourage them to speak and have a positive perception of the robot.

     

    Download full text (pdf)
    fulltext
  • 61.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Águas Lopes, José David
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Ljung, Mikael
    Månsson, Linnea
    Identification of low-engaged learners in robot-led second language conversations with adultsManuscript (preprint) (Other academic)
    Abstract [en]

    The main aim of this study is to investigate if verbal, vocal and facial information can be used to identify low-engaged second language learners in robot-led conversation practice. The experiments were performed on voice recordings and video data from 50 conversations, in which a robotic head talks with pairs of adult language learners using four different interaction strategies with varying robot-learner focus and initiative. It was found that these robot interaction strategies influenced learner activity and engagement. The verbal analysis indicated that learners with low activity rated the robot significantly lower on two out of four scales related to social competence. The acoustic vocal and video-based facial analysis, based on manual annotations or machine learning classification, both showed that learners with low engagement rated the robot’s social competencies consistently, and in several cases significantly, lower, and in addition rated the learning effectiveness lower. The agreement between manual and automatic identification of low-engaged learners based on voice recordings or face videos was further found to be adequate for future use. These experiments constitute a first step towards enabling adaption to learners’ activity and engagement through within- and between-strategy changes of the robot’s interaction with learners. 

    Download full text (pdf)
    fulltext
  • 62.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    David Lopes, José
    Interaction Lab, Heriot-Watt University, Edinburgh, UK.
    Interaction and collaboration in robot-assisted language learning for adults2022In: Computer Assisted Language Learning, ISSN 0958-8221, E-ISSN 1744-3210, Vol. 35, no 5-6, p. 1273-1309Article in journal (Refereed)
    Abstract [en]

    This article analyses how robot–learner interaction in robot-assisted language learning (RALL) is influenced by the interaction behaviour of the robot. Since the robot behaviour is to a large extent determined by the combination of teaching strategy, robot role and robot type, previous studies in RALL are first summarised with respect to which combinations that have been chosen, the rationale behind the choice and the effects on interaction and learning. The goal of the summary is to determine a suitable pedagogical set-up for RALL with adult learners, since previous RALL studies have almost exclusively been performed with children and youths. A user study in which 33 adult second language learners practice Swedish in three-party conversations with an anthropomorphic robot head is then presented. It is demonstrated how different robot interaction behaviours influence interaction between the robot and the learners and between the two learners. Through an analysis of learner interaction, collaboration and learner ratings for the different robot behaviours, it is observed that the learners were most positive towards the robot behaviour that focused on interviewing one learner at the time (highest average ratings), but that they were the most active in sessions when the robot encouraged learner–learner interaction. Moreover, the preferences and activity differed between learner pairs, depending on, e.g., their proficiency level and how well they knew the peer. It is therefore concluded that the robot behaviour needs to adapt to such factors. In addition, collaboration with the peer played an important part in conversation practice sessions to deal with linguistic difficulties or communication problems with the robot.

    Download full text (pdf)
    fulltext
  • 63.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    David Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Åhlund, Anna
    Stockholms universitet.
    Robot interaction styles for conversation practice in second language learning2020In: International Journal of Social Robotics, ISSN 1875-4791, E-ISSN 1875-4805Article in journal (Refereed)
    Abstract [en]

    Four different interaction styles for the social robot Furhat acting as a host in spoken conversation practice with two simultaneous language learners have been developed, based on interaction styles of human moderators of language cafés.We first investigated, through a survey and recorded sessions of three-party language café style conversations, how the interaction styles of human moderators are influenced by different factors (e.g., the participants language level and familiarity).Using this knowledge, four distinct interaction styles were developed for the robot: sequentially asking one participant questions at the time (Interviewer); the robot speaking about itself, robots and Sweden or asking quiz questions about Sweden (Narrator); attempting to make the participants talk with each other (Facilitator); and trying to establish a three-party robot-learner-learner interaction with equal participation (Interlocutor).A user study with 32 participants, conversing in pairs with the robot, was carried out to investigate how the post-session ratings of the robot's behavior along different dimensions (e.g., the robot's conversational skills and friendliness, the value of practice) are influenced by the robot's interaction style and participant variables (e.g., level in the target language, gender, origin).The general findings were that Interviewer received the highest mean rating, but that different factors influenced the ratings substantially, indicating that the preference of individual participants needs to be anticipated in order to improve learner satisfaction with the practice. We conclude with a list of recommendations for robot-hosted conversation practice in a second language.

    Download full text (pdf)
    fulltext
  • 64.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Delvaux, V.
    Metens, T.
    Interspeaker Variation in the Articulation of French Nasal Vowels2006In: In Proceedings of the Seventh International Seminar on Speech Production, 2006, p. 3-10Conference paper (Refereed)
  • 65.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Lopes, J.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berndtson, Gustav
    KTH, School of Industrial Engineering and Management (ITM).
    Lindström, Ruben
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Ekman, Patrik
    KTH.
    Hartmanis, Eric
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Jin, Emelie
    KTH, School of Industrial Engineering and Management (ITM).
    Johnston, Ella
    KTH, School of Industrial Engineering and Management (ITM).
    Tahir, Gara
    KTH, School of Industrial Engineering and Management (ITM).
    Mekonnen, Michael
    KTH.
    Learner and teacher perspectives on robot-led L2 conversation practice2022In: ReCALL, ISSN 0958-3440, E-ISSN 1474-0109, Vol. 34, no 3, p. 344-359Article in journal (Refereed)
    Abstract [en]

    This article focuses on designing and evaluating conversation practice in a second language (L2) with a robot that employs human spoken and non-verbal interaction strategies. Based on an analysis of previous work and semi-structured interviews with L2 learners and teachers, recommendations for robot-led conversation practice for adult learners at intermediate level are first defined, focused on language learning, on the social context, on the conversational structure and on verbal and visual aspects of the robot moderation. Guided by these recommendations, an experiment is set up, in which 12 pairs of L2 learners of Swedish interact with a robot in short social conversations. These robot-learner interactions are evaluated through post-session interviews with the learners, teachers' ratings of the robot's behaviour and analyses of the video-recorded conversations, resulting in a set of guidelines for robot-led conversation practice: (1) societal and personal topics increase the practice's meaningfulness for learners; (2) strategies and methods for providing corrective feedback during conversation practice need to be explored further; (3) learners should be encouraged to support each other if the robot has difficulties adapting to their linguistic level; (4) the robot should establish a social relationship by contributing with its own story, remembering the participants' input, and making use of non-verbal communication signals; and (5) improvements are required regarding naturalness and intelligibility of text-to-speech synthesis, in particular its speed, if it is to be used for conversations with L2 learners. 

  • 66.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lopes, José
    Herriot-Watt University.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berndtson, Gustav
    KTH.
    Lindström, Ruben
    KTH.
    Ekman, Patrik
    KTH.
    Hartmanis, Eric
    KTH.
    Jin, Emelie
    KTH.
    Johnston, Ella
    KTH.
    Tahir, Gara
    KTH.
    Mekonnen, Michael
    KTH.
    Learner and teacher perspectives on robot-led L2 conversation practiceManuscript (preprint) (Other academic)
    Abstract [en]

    This article focuses on designing and evaluating conversation practice in a second language (L2) with a robot that employs human spoken and non-verbal interaction strategies. Based on an analysis of previous work and semi-structured interviews with L2 learners and teachers, recommendations for robot-led conversation practice for adult learners at intermediate level are first defined, focused on language learning, on the social context, on the conversational structure and on verbal and visual aspects of the robot moderation. Guided by these recommendations, an experiment is set up, in which 12 pairs of L2 learners of Swedish interact with a robot in short social conversations. These robot-learner interactions are evaluated through post-session interviews with the learners, teachers’ ratings of the robot’s behaviour and analyses of the video-recorded conversations, resulting in a set of guidelines for robot-led conversation practice, in particular: 1) Societal and personal topics increase the practice’s meaningfulness for learners. 2) Strategies and methods for providing corrective feedback during conversation practice need to be explored further. 3) Learners should be encouraged to support each other if the robot has difficulties adapting to their linguistic level. 4) The robot should establish a social relationship, by contributing with its own story, remembering the participants’ input, and making use of non-verbal communication signals. 5) Improvements are required regarding naturalness and intelligibility of text-to-speech synthesis, in particular its speed, if it is to be used for conversations with L2 learners.

    Download full text (pdf)
    fulltext
  • 67.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Are real tongue movements easier to speech read than synthesized?2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 824-827Conference paper (Refereed)
    Abstract [en]

    Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue.

  • 68.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Can you tell if tongue movements are real or synthetic?2009In: Proceedings of AVSP, 2009Conference paper (Refereed)
    Abstract [en]

    We have investigated if subjects are aware of what natural tongue movements look like, by showing them animations based on either measurements or rule-based synthesis. The issue is of interest since a previous audiovisual speech perception study recently showed that the word recognition rate in sentences with degraded audio was significantly better with real tongue movements than with synthesized. The subjects in the current study could as a group not tell which movements were real, with a classification score at chance level. About half of the subjects were significantly better at discriminating between the two types of animations, but their classification score was as often well below chance as above. The correlation between classification score and word recognition rate for subjects who also participated in the perception study was very weak, suggesting that the higher recognition score for real tongue movements may be due to subconscious, rather than conscious, processes. This finding could potentially be interpreted as an indication that audiovisual speech perception is based onarticulatory gestures.

  • 69.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Real vs. rule-generated tongue movements as an audio-visual speech perception support2009In: Proceedings of Fonetik 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 30-35Conference paper (Other academic)
    Abstract [en]

    We have conducted two studies in which animations created from real tongue movements and rule-based synthesis are compared. We first studied if the two types of animations were different in terms of how much support they give in a perception task. Subjects achieved a significantly higher word recognition rate insentences when animations were shown compared to the audio only condition, and a significantly higher score with real movements than with synthesized. We then performed a classification test, in which subjects should indicate if the animations were created from measurements or from rules. The results show that the subjects as a group are unable to tell if the tongue movements are real or not. The stronger support from real movements hence appears to be due to subconscious factors.

  • 70.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Design strategies for a virtual language tutor2004In: INTERSPEECH 2004, ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004 / [ed] Kim, S. H.; Young, D. H., Jeju Island, Korea, 2004, p. 1693-1696Conference paper (Refereed)
    Abstract [en]

    In this paper we discuss work in progress on an interactive talking agent as a virtual language tutor in CALL applications. The ambition is to create a tutor that can be engaged in many aspects of language learning from detailed pronunciation to conversational training. Some of the crucial components of such a system is described. An initial implementation of a stress/quantity training scheme will be presented.

  • 71.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Águas Lopes, José David
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Is a Wizard-of-Oz Required for Robot-Led Conversation Practice in a Second Language?2022In: International Journal of Social Robotics, ISSN 1875-4791, E-ISSN 1875-4805Article in journal (Refereed)
    Abstract [en]

    The large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.

    Download full text (pdf)
    fulltext
  • 72.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Águas Lopes, José David
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berndtson, Gustav
    Lindström, Ruben
    Ekman, Patrik
    Hartmanis, Eric
    Jin, Emelie
    Johnston, Ella
    Tahir, Gara
    Mekonnen, Michael
    Learner and teacher perspectives on robot-led L2 conversation practiceIn: Article in journal (Refereed)
    Abstract [en]

    This article focuses on designing and evaluating conversation practice in a second language (L2) with a robot that employs human spoken and non-verbal interaction strategies. Based on an analysis of previous work and semi-structured interviews with L2 learners and teachers, recommendations for robot-led conversation practice for adult learners at intermediate level are first defined, focused on language learning, on the social context, on the conversational structure and on verbal and visual aspects of the robot moderation. Guided by these recommendations, an experiment is set up, in which 12 pairs of L2 learners of Swedish interact with a robot in short social conversations. These robot-learner interactions are evaluated through post-session interviews with the learners, teachers’ ratings of the robot’s behaviour and analyses of the video-recorded conversations, resulting in a set of guidelines for robot-led conversation practice, in particular: 1) Societal and personal topics increase the practice’s meaningfulness for learners. 2) Strategies and methods for providing corrective feedback during conversation practice need to be explored further. 3) Learners should be encouraged to support each other if the robot has difficulties adapting to their linguistic level. 4) The robot should establish a social relationship, by contributing with its own story, remembering the participants’ input, and making use of non-verbal communication signals. 5) Improvements are required regarding naturalness and intelligibility of text-to-speech synthesis, in particular its speed, if it is to be used for conversations with L2 learners. 

    Download full text (pdf)
    fulltext
  • 73.
    Eriksson, Elina
    et al.
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Öster, Anne-Marie
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
    Design Recommendations for a Computer-Based Speech Training System Based on End User Interviews2005In: Proceedings of the Tenth International Conference on Speech and Computers, 2005, p. 483-486Conference paper (Refereed)
    Abstract [en]

    This study has been performed in order to improve theusability of computer-based speech training (CBST) aids.The aim was to engage the users of speech training systemsin the first step of creating a new CBST aid. Speechtherapists and children with hearing- or speech impairmentwere interviewed and the result of the interviews ispresented in the form of design recommendations.

  • 74.
    Gillet, Sarah
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lopes, José
    Heriot-Watt University.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Robot Gaze Can Mediate Participation Imbalance in Groups with Different Skill Levels2021In: Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery , 2021, p. 303-311Conference paper (Refereed)
    Abstract [en]

    Many small group activities, like working teams or study groups, have a high dependency on the skill of each group member. Differences in skill level among participants can affect not only the performance of a team but also influence the social interaction of its members. In these circumstances, an active member could balance individual participation without exerting direct pressure on specific members by using indirect means of communication, such as gaze behaviors. Similarly, in this study, we evaluate whether a social robot can balance the level of participation in a language skill-dependent game, played by a native speaker and a second language learner. In a between-subjects study (N = 72), we compared an adaptive robot gaze behavior, that was targeted to increase the level of contribution of the least active player, with a non-adaptive gaze behavior. Our results imply that, while overall levels of speech participation were influenced predominantly by personal traits of the participants, the robot’s adaptive gaze behavior could shape the interaction among participants which lead to more even participation during the game.

  • 75. Katsamanis, N.
    et al.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Papandreou, G.
    Maragos, P.
    NTU, Athens, Greece.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Audiovisual speech inversion by switching dynamical modeling Governed by a Hidden Markov Process2008In: Proceedings of EUSIPCO, 2008Conference paper (Refereed)
    Abstract [en]

    We propose a unified framework to recover articulation from audiovisual speech. The nonlinear audiovisual-to-articulatory mapping is modeled by means of a switching linear dynamical system. Switching is governed by a state sequence determined via a Hidden Markov Model alignment process. Mel Frequency Cepstral Coefficients are extracted from audio while visual analysis is performed using Active Appearance Models. The articulatory state is represented by the coordinates of points on important articulators, e.g., tongue and lips. To evaluate our inversion approach, instead of just using the conventional correlation coefficients and root mean squared errors, we introduce a novel evaluation scheme that is more specific to the inversion problem. Prediction errors in the positions of the articulators are weighted differently depending on their relevant importance in the production of the corresponding sound. The applied weights are determined by an articulatory classification analysis using Support Vector Machines with a radial basis function kernel. Experiments are conducted in the audiovisual-articulatory MOCHA database.

  • 76.
    Kjellström, Hedvig
    et al.
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Audiovisual-to-articulatory inversion2009In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, no 3, p. 195-209Article in journal (Refereed)
    Abstract [en]

    It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons.

  • 77.
    Kjellström, Hedvig
    et al.
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Abdou, Sherif
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Audio-visual phoneme classification for pronunciation training applications2007In: INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2007, p. 57-60Conference paper (Refereed)
    Abstract [en]

    We present a method for audio-visual classification of Swedish phonemes, to be used in computer-assisted pronunciation training. The probabilistic kernel-based method is applied to the audio signal and/or either a principal or an independent component (PCA or ICA) representation of the mouth region in video images. We investigate which representation (PCA or ICA) that may be most suitable and the number of components required in the base, in order to be able to automatically detect pronunciation errors in Swedish from audio-visual input. Experiments performed on one speaker show that the visual information help avoiding classification errors that would lead to gravely erroneous feedback to the user; that it is better to perform phoneme classification on audio and video separately and then fuse the results, rather than combining them before classification; and that PCA outperforms ICA for fewer than 50 components.

  • 78.
    Kjellström, Hedvig
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Reconstructing Tongue Movements from Audio and Video2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, Vol. 1-5, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 2238-2241Conference paper (Refereed)
    Abstract [en]

    This paper presents an approach to articulatory inversion using audio and video of the user's face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a low-dimensional representation of the mouth appearance. The inversion problem is treated as one of regression; a non-linear regressor using relevance vector machines is trained with a dataset of simultaneous images of a subject's face, acoustic features and positions of magnetic coils glued to the subjects's tongue. The results show the benefit of using both cues for inversion. We envisage the inversion method to be part of a pronunciation training system with articulatory feedback.

  • 79.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Perceptual differentiation modeling explains phoneme mispronunciation by non-native speakers2011In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2011, p. 5704-5707Conference paper (Refereed)
    Abstract [en]

    One of the difficulties in second language (L2) learning is the weakness in discriminating between acoustic diversity within an L2 phoneme category and between different categories. In this paper, we describe a general method to quantitatively measure the perceptual difference between a group of native and individual non-native speakers. Normally, this task includes subjective listening tests and/or a thorough linguistic study. We instead use a totally automated method based on a psycho-acoustic auditory model. For a certain phoneme class, we measure the similarity of the Euclidean space spanned by the power spectrum of a native speech signal and the Euclidean space spanned by the auditory model output. We do the same for a non-native speech signal. Comparing the two similarity measurements, we find problematic phonemes for a given speaker. To validate our method, we apply it to different groups of non-native speakers of various first language (L1) backgrounds. Our results are verified by the theoretical findings in literature obtained from linguistic studies.

  • 80.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Phoneme Level Non-Native Pronunciation Analysis by an Auditory Model-based Native Assessment Scheme2011In: 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011, International Speech Communication Association, INTERSPEECH , 2011, p. 1157-1160Conference paper (Refereed)
    Abstract [en]

    We introduce a general method for automatic diagnostic evaluation of the pronunciation of individual non-native speakers based on a model of the human auditory system trained with native data stimuli. For each phoneme class, the Euclidean geometry similarity between the native perceptual domain and the non-native speech power spectrum domain is measured. The problematic phonemes for a given second language speaker are found by comparing this measure to the Euclidean geometry similarity for the same phonemes produced by native speakers only. The method is applied to different groups of non-native speakers of various language backgrounds and the experimental results are in agreement with theoretical findings of linguistic studies.

  • 81.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 1, 2012, p. 898-901Conference paper (Refereed)
    Abstract [en]

    This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.

  • 82.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations2012In: International Symposium on Automatic Detection of Errors in Pronunciation Training (IS ADEPT), Stockholm, Sweden, June 6-8, 2012 / [ed] Olov Engwall, 2012, p. 59-64Conference paper (Refereed)
    Abstract [en]

    In this paper we demonstrate that a psychoacoustic model-based distance measure performs better than a speech signal distance measure in assessing the pronunciation of individual foreign speakers. The experiments show that the perceptual based-method performs not only quantitatively better than a speech spectrum-based method, but also qualitatively better, hence showing that auditory information is beneficial in the task of pronunciation error detection. We first present the general approach of the method, which is using the dissimilarity between the native perceptual domain and the non-native speech power spectrum domain. The problematic phonemes for a given non-native speaker are determined by the degree of disparity between the dissimilarity measure for the non-native and a group of native speakers. The two methods compared here are applied to different groups of non-native speakers of various language backgrounds and validated against a theoretical linguistic study.

  • 83.
    Koniaris, Christos
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    On mispronunciation analysis of individual foreign speakers using auditory periphery models2013In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 5, p. 691-706Article in journal (Refereed)
    Abstract [en]

    In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

  • 84.
    Lopes, José
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A First Visit to the Robot Language Café2017In: Proceedings of the ISCA workshop on Speech and Language Technology in Education / [ed] Engwall, Lopes, Stockholm, 2017Conference paper (Refereed)
    Abstract [en]

    We present an exploratory study on using a social robot in a conversational setting to practice a second language. The prac- tice is carried out within a so called language cafe ́, with two second language learners and one native moderator; a human or a robot; engaging in social small talk. We compare the in- teractions with the human and robot moderators and perform a qualitative analysis of the potentials of a social robot as a con- versational partner for language learning. Interactions with the robot are carried out in a wizard-of-Oz setting, in which the native moderator who leads the corresponding human moder- ator session controls the robot. The observations of the video recorded sessions and the subject questionnaires suggest that the appropriate learner level for the practice is elementary (A1 to A21), for whom the structured, slightly repetitive interaction pattern was perceived as beneficial. We identify both some key features that are appreciated by the learners and technological parts that need further development. 

    Download full text (pdf)
    fulltext
  • 85.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Acoustic to Articulation Mapping: Non-linear or Non-unique?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 1485-1488Conference paper (Refereed)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non-unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by fitting the data into a Gaussian Mixture Model. The kurtosis is used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non-unique articulator configurations. It is found that stop consonants and alveolar fricatives arc generally not only non-linear but also non-unique, while dental fricatives arc found to be highly non-linear but fairly unique. Two more investigations are also discussed: the first is on how well the best possible piecewise linear regression is likely to perform, the second is on whether the dynamic constraints improve the ability to predict different articulatory regions corresponding to the same region in the acoustic space.

  • 86.
    Picard, Sebastien
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Abdou, S.
    Detection of Specific Mispronunciations using Audiovisual Features2010In: Auditory-Visual Speech Processing (AVSP) 2010, The International Society for Computers and Their Applications (ISCA) , 2010Conference paper (Refereed)
    Abstract [en]

    This paper introduces a general approach for binaryclassification of audiovisual data. The intended application ismispronunciation detection for specific phonemic errors, usingvery sparse training data. The system uses a Support VectorMachine (SVM) classifier with features obtained from a TimeVarying Discrete Cosine Transform (TV-DCT) on the audiolog-spectrum as well as on the image sequences. Theconcatenated feature vectors from both the modalities werereduced to a very small subset using a combination of featureselection methods. We achieved 95-100% correctclassification for each pair-wise classifier on a database ofSwedish vowels with an average of 58 instances per vowel fortraining. The performance was largely unaffected when testedon data from a speaker who was not included in the training.

  • 87.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can visualization of internal articulators support speech perception?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2627-2630Conference paper (Refereed)
    Abstract [en]

    This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality side-view did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters. The results indicate that it cannot be expected that intra-oral animations support speech perception in general, but that information on some articulatory features can be extracted. Animations of tongue movements have hence more potential for use in computer-assisted pronunciation and perception training than as a communication aid for the hearing-impaired.

  • 88.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Looking at tongues – can it help in speech perception?2008In: Proceedings The XXIst Swedish Phonetics Conference, FONETIK 2008, 2008, p. 57-60Conference paper (Other academic)
    Abstract [en]

    This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality sideview did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters.

12 51 - 88 of 88
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf