Change search
Refine search result
1234567 151 - 200 of 472
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 151.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    vertical bar nailon vertical bar: Software for Online Analysis of Prosody2006Conference paper (Refereed)
    Abstract [en]

    This paper presents /nailon/ - a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for inter-action control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation patterns. The paper provides detailed information on how this is achieved. As an example application of /nailon/, we demonstrate how it is used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.

  • 152.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gravano, Agustín
    Computer Science Department, University of Buenos Aires.
    Hirschberg, Julia
    Department of Computer Science, Columbia University.
    Very short utterances in conversation2010In: Proceedings from Fonetik 2010, Lund, June 2-4, 2010 / [ed] Susanne Schötz, Gilbert Ambrazaitis, Lund, Sweden: Lund University , 2010, p. 11-16Conference paper (Other academic)
    Abstract [en]

    Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the extent to which the VSU/NONVSU distinction corresponds to backchannels/non-backchannels in a different data set that is manually annotated for backchannels – the Columbia Games Corpus; (2) we examine to the extent to which VSUS capture other short utterances with a vocabulary similar to backchannels; (3) we propose a VSU method for better managing turn-taking and barge-ins in spoken dialogue systems based on detection of backchannels; and (4) we attempt to detect backchannels with better precision by training a backchannel classifier using durations and inter-speaker relative loudness differences as features. The results show that VSUS indeed capture a large proportion of backchannels – large enough that VSUs can be used to improve spoken dialogue system turntaking; and that building a reliable backchannel classifier working in real time is feasible.

  • 153.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the effect of the acoustic environment on the accuracy of perception of speaker orientation from auditory cues alone2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 2, 2012, p. 1482-1485Conference paper (Refereed)
    Abstract [en]

    The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. The feature most often implicated for detection of sound source orientation is the inter-aural level difference - a feature which it is assumed is more easily exploited in anechoic chambers than in everyday surroundings. We expand here on our previous studies and compare detection of speaker orientation within and outside of the anechoic chamber. Our results show that listeners find the task easier, rather than harder, in everyday surroundings, which suggests that inter-aural level differences is not the only feature at play.

  • 154.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    Voice Technologies, Expert Functions, Teliasonera.
    Two faces of spoken dialogue systems2006In: Interspeech 2006 - ICSLP Satellite Workshop Dialogue on Dialogues: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems, Pittsburgh PA, USA, 2006Conference paper (Refereed)
    Abstract [en]

    This paper is intended as a basis for discussion. We propose that users may, knowingly or subconsciously, interpret the events that occur when interacting with spoken dialogue systems in more than one way. Put differently, there is more than one metaphor people may use in order to make sense of spoken human-computer dialogue. We further suggest that different metaphors may not play well together. The analysis is consistent with many observations in human-computer interaction and has implications that may be helpful to researchers and developers alike. For example, developers may want to guide users towards a metaphor of their choice and ensure that the interaction is coherent with that metaphor; researchers may need different approaches depending on the metaphor employed in the system they study; and in both cases one would need to have very good reasons to use mixed metaphors.

  • 155.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Gustafson, Joakim
    Voice Technologies, Expert Functions, Teliasonera, Haninge, Sweden.
    Utterance segmentation and turn-taking in spoken dialogue systems2005In: Computer Studies in Language and Speech / [ed] Fisseni, B.; Schmitz, H-C.; Schröder, B.; Wagner, P., Frankfurt am Main, Germany: Peter Lang , 2005, p. 576-587Chapter in book (Refereed)
    Abstract [en]

    A widely used method for finding places to take turn in spoken dialogue systems is to assume that an utterance ends where the user ceases to speak. Such endpoint detection normally triggers on a certain amount of silence, or non-speech. However, spontaneous speech frequently contains silent pauses inside sentencelike units, for example when the speaker hesitates. This paper presents /nailon/, an on-line, real-time prosodic analysis tool, and a number of experiments in which end-point detection has been augmented with prosodic analysis in order to segment the speech signal into what humans intuitively perceive as utterance-like units.

  • 156.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Who am I speaking at?: perceiving the head orientation of speakers from acoustic cues alone2012In: Proc. of LREC Workshop on Multimodal Corpora 2012, Istanbul, Turkey, 2012Conference paper (Refereed)
    Abstract [en]

    The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. We describe in passing some preliminary findings that led us onto this line of investigation, and in detail a study in which we extend an experiment design intended to measure perception of gaze direction to test instead for perception of sound source orientation. The results corroborate those of previous studies, and further show that people are very good at performing this skill outside of studio conditions as well.

  • 157.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    Stockholm University, Faculty of Humanities, Department of Linguistics.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    3rd party observer gaze during backchannels2012In: Proc. of the Interspeech 2012 Interdisciplinary Workshop on Feedback Behaviors in Dialog, Skamania Lodge, WA, USA, 2012Conference paper (Refereed)
    Abstract [en]

    This paper describes a study of how the gazes of 3rd party observers of dialogue move when a speaker is taking the turn and producing a back-channel, respectively. The data is collected and basic processing is complete, but the results section for the paper is not yet in place. It will be in time for the workshop, however, and will be presented there, should this paper outline be accepted..

  • 158.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pelcé, Antoine
    Prosodic features of very short utterances in dialogue2009In: Nordic Prosody: Proceedings of the Xth Conference / [ed] Vainio, Martti; Aulanko, Reijo; Aaltonen, Olli, Frankfurt am Main: Peter Lang , 2009, p. 57-68Conference paper (Refereed)
  • 159.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Applications of distributed dialogue systems: the KTH Connector2005In: Proceedings of ISCA Tutorial and Research Workshop on Applied Spoken Language Interaction in Distributed Environments (ASIDE 2005), 2005Conference paper (Refereed)
    Abstract [en]

    We describe a spoken dialogue system domain: that of the personal secretary. This domain allows us to capitalise on the characteristics that make speech a unique interface; characteristics that humans use regularly, implicitly, and with remarkable ease. We present a prototype system - the KTH Connector - and highlight several dialogue research issues arising in the domain.

  • 160.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Is it really worth it?: Cost-based selection of system responses to speech-in-overlap2012In: Proc. of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012), Santa Crux, CA, USA, 2012Conference paper (Refereed)
    Abstract [en]

    For purposes of discussion and feedback, we present a preliminary version of a simple yet powerful cost-based framework for spoken dialogue sys-tems to continuously and incrementally decide whether to speak or not. The framework weighs the cost of producing speech in overlap against the cost of not speaking when something needs saying. Main features include a small number of parameters controlling characteristics that are readily understood, al-lowing manual tweaking as well as interpretation of trained parameter settings; observation-based estimates of expected overlap which can be adapted dynami-cally; and a simple and general method for context dependency. No evaluation has yet been undertaken, but the effects of the parameters; the observation-based cost of expected overlap trained on Switchboard data; and the context de-pendency using inter-speaker intensity differences from the same corpus are demonstrated with generated input data in the context of user barge-ins.

  • 161.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tånnander, Christina
    The Swedish Library of Talking Books and Braille.
    Unconventional methods in perception experiments2012In: Proc. of Nordic Prosody XI, Tartu, Estonia, 2012Conference paper (Other academic)
  • 162.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech2012In: Prosodic and Visual Resources in Interactional Grammar / [ed] Bergmann, Pia; Brenning, Jana; Pfeiffer, Martin C.; Reber, Elisabeth, Walter de Gruyter, 2012Chapter in book (Refereed)
  • 163.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic Features in the Perception of Clarification Ellipses2005In: Proceedings of Fonetik 2005: The XVIIIth Swedish Phonetics Conference, Gothenburg, Sweden, 2005, p. 107-110Conference paper (Other academic)
    Abstract [en]

    We present an experiment where subjects were asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and subjects were asked to judge the computer's actual intention. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.

  • 164.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The effects of prosodic features on the interpretation of clarification ellipses2005In: Proceedings of Interspeech 2005: Eurospeech, 2005, p. 2389-2392Conference paper (Refereed)
    Abstract [en]

    In this paper, the effects of prosodic features on the interpretation of elliptical clarification requests in dialogue are studied. An experiment is presented where subjects were asked to listen to short human-computer dialogue fragments in Swedish, where a synthetic voice was making an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and the subjects were asked to judge what was actually intended by the computer. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.

  • 165.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Strömbergsson, Sofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Question types and some prosodic correlates in 600 questions in the Spontal database of Swedish dialogues2012In: Proceedings Of The 6th International Conference On Speech Prosody, Vols I and  II, Shanghai, China: Tongji Univ Press , 2012, p. 737-740Conference paper (Refereed)
    Abstract [en]

    Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. We present initial steps of a larger project investigating and describing intonational variation in the Spontal database of 120 half-hour spontaneous dialogues in Swedish, and testing the hypothesis that the concept of a standard question intonation such as a final pitch rise contrasting a final low declarative intonation is not consistent with the pragmatic use of intonation in dialogue. We report on the extraction of 600 questions from the Spontal corpus, coding and annotation of question typology, and preliminary results concerning some prosodic correlates related to question type.

  • 166.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Investigating negotiation for load-time in the GetHomeSafe project2012In: Proc. of Workshop on Innovation and Applications in Speech Technology (IAST), Dublin, Ireland, 2012, p. 45-48Conference paper (Refereed)
    Abstract [en]

    This paper describes ongoing work by KTH Speech, Music and Hearing in GetHomeSafe, a newly inaugurated EU project in collaboration with DFKI, Nuance, IBM and Daimler. Under the assumption that drivers will utilize technology while driving regardless of legislation, the project aims at finding out how to make the use of in-car technology as safe as possible rather than prohibiting it. We describe the project in general briefly and our role in some more detail, in particular one of our tasks: to build a system that can ask the driver if now is a good time to speak about X? in an unobtrusive manner, and that knows how to deal with rejection, for example by asking the driver to get back when it is a good time or to schedule a time that will be convenient.

  • 167.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Strömbergsson, Sofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Telling questions from statements in spoken dialogue systems2012In: Proc. of SLTC 2012, Lund, Sweden, 2012Conference paper (Refereed)
  • 168.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tånnander, Christina
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Audience response system-based assessment for analysis-by-synthesis2015In: Proc. of ICPhS 2015, ICPhS , 2015Conference paper (Refereed)
  • 169. Eklund, R.
    et al.
    Peters, G.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Mabiza, E.
    An acoustic analysis of lion roars. I: Data collection and spectrogram and waveform analyses2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 1-4Article in journal (Other academic)
    Abstract [en]

    This paper describes the collection of lion roar data at two different locations, an outdoor setting at Antelope Park in Zimbabwe and an indoor setting at Parken Zoo in Sweden. Preliminary analyses of spectrographic and waveform data are provided.

  • 170.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Accounting for Individual Speaker Properties in Automatic Speech Recognition2010Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    In this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames.

    Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate.

    A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique.

    A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique.

    The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature.

    An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database.

  • 171.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children.2005Conference paper (Refereed)
    Abstract [en]

    An experimental offline investigation of the performance of connected digits recognition was performed on children in the age range four to eight years. Poor performance using adult models was improved significantly by adaptation and vocal tract length normalisation but not to the same level as training on children. Age dependent models were tried with limited advantage. A combined adult and child raining corpus maintained the performance for the separately trained categories. Linear frequency compression for vocal tract length nor-malization was attempted but estimation of the warping factor was sensitive to non-speech segments and background noise. Phoneme-based word modeling outperformed the whole word models, even though the vocabulary only consisted of digits.

  • 172.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Characteristics of a Low Reject Mode Speaker Verification System2002Conference paper (Refereed)
    Abstract [en]

    The performance of a speaker verification (SV) system is normally determined by the false reject (FRR) and false accept (FAR) rates as averages on a population of test speakers. However, information on the FRR distribution is required when estimating the portion of clients that will suffer from an unacceptably high reject rate. This paper studies this distribu- tion in a population using a SV system operating in low reject mode. Two models of the distribution are proposed and compared with test data. An attempt is also made to tune the decision threshold in order to obtain a desired portion of clients having a reject rate lower than a specified value.

  • 173.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Comparing speech recognition for adults and children2004In: Proceedings of Fonetik 2004: The XVIIth Swedishi Phonetics Conference / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2004, p. 156-159Conference paper (Other academic)
    Abstract [en]

    This paper presents initial studies of the performanceof a speech recogniser on children's speech when trained on children or adults. Aconnected-digits recogniser was used for this purpose. The individual digit accuracy among the children is correlated to some features ofthe child, such as age, gender, fundamental frequency and height. A strong correlation between age and accuracy was found. The accuracy was also found to be lower for child recognition than for adult recognition, even though the recognisers were trained on the correct class of speakers.

  • 174.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Dynamic vocal tract length normalization in speech recognition2010In: Proceedings from Fonetik 2010: Working Papers 54, Centre for Languages and Literature, Lund University, Sweden, 2010, p. 29-34Conference paper (Other academic)
    Abstract [en]

    A novel method to account for dynamic speaker characteristic properties in aspeech recognition system is presented. The estimated trajectory of a property canbe constrained to be constant or to have a limited rate-of-change within a phone ora sub-phone state. The constraints are implemented by extending each state in thetrained Hidden Markov Model by a number of property-value-specific sub-statestransformed from the original model. The connections in the transition matrix ofthe extended model define possible slopes of the trajectory. Constraints on itsdynamic range during an utterance are implemented by decomposing the trajectoryinto a static and a dynamic component. Results are presented on vocal tract lengthnormalization in connected-digit recognition of children's speech using modelstrained on male adult speech. The word error rate was reduced compared with theconventional utterance-specific warping factor by 10% relative.

  • 175.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On Extending VTLN to Phoneme-specific Warping in Automatic Speech Recognition2009In: Proceedings of Fonetik 2009, 2009Conference paper (Other academic)
    Abstract [en]

    Phoneme- and formant-specific warping hasbeen shown to decrease formant and cepstralmismatch. These findings have not yet been fullyimplemented in speech recognition. This paperdiscusses a few reasons how this can be. A smallexperimental study is also included where phoneme-independent warping is extended towardsphoneme-specific warping. The results of thisinvestigation did not show a significant decreasein error rate during recognition. This isalso in line with earlier experiments of methodsdiscussed in the paper.

  • 176.
    Elenius, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Vocal tract length compensation in the signal and model domains in child speech recognition2007In: Proceedings of Fonetik: TMH-QPSR, 2007, p. 41-44Conference paper (Other academic)
    Abstract [en]

    In a newly started project, KOBRA, we study methods to reduce the required amount of training data for speech recognition by combining the conventional data-driven training approach with available partial knowledge on speech production, implemented as transformation functions in the acoustic, articulatory and speaker characteristic domains. Initially, we investigate one well-known dependence, the inverse proportional relation between vocal tract length and formant frequencies. In this report, we have replaced the conventional technique of frequency warping the unknown input utterance (VTLN) by transforming the training data instead. This enables phoneme-dependent warping to be performed. In another experiment, we expanded the available training data by duplicating each training utterance into a number of differently warped instances. Training on this expanded corpus results in models, each one representing the whole range of vocal tract length variation. This technique allows every frame of the utterance to be warped differently. The computational load is reduced by an order of magnitude compared to conventional VTLN without notice- able decrease in performance on the task of recognising children’s speech using models trained on adult speech.

  • 177.
    Elenius, Kjell
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Forsbom, Eva
    Uppsala University.
    Megyesi, Beata
    Uppsala University.
    Language Resources and Tools for Swedish: A Survey2008In: Proc of LREC 2008, Marrakech, Marocko, 2008Conference paper (Other academic)
    Abstract [en]

    Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.

  • 178.
    Enflo, Laura
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alternative Measures of Phonation: Collision Threshold Pressure and Electroglottographic Spectral Tilt: Extra: Perception of Swedish Accents2010Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    The collision threshold pressure (CTP), i.e. the smallest amount of subglottal pressure needed for vocal fold collision, has been explored as a possible complement or alternative to the now commonly used phonation threshold pressure (PTP), i.e. the smallest amount of subglottal pressure needed to initiate and sustain vocal fold oscillation. In addition, the effects of vocal warm-up (Paper 1) and vocal loading (Paper 2) on the CTP and the PTP have been investigated. Results confirm previous findings that PTP increases with an increase in fundamental frequency (F0) of phonation and this is true also for CTP, which on average is about 4 cm H2O higher than the PTP. Statistically significant increases of the CTP and PTP after vocal loading were confirmed and after the vocal warm-up, the threshold pressures were generally lowered although these results were significant only for the females. The vocal loading effect was minor for the two singer subjects who participated in the experiment of Paper 2.

    In Paper 3, the now commonly used audio spectral tilt (AST) is measured on the vowels of a large database (5277 sentences) containing speech of one male Swedish actor. Moreover, the new measure electroglottographic spectral tilt (EST) is calculated from the derivatives of the electroglottographic signals (DEGG) of the same database. Both AST and EST were checked for vowel dependency and the results show that while AST is vowel dependent, EST is not.

    Paper 4 reports the findings from a perception experiment on Swedish accents performed on 47 Swedish native speakers from the three main parts of Sweden. Speech consisting of one sentence chosen for its prosodically interesting properties and spoken by 72 speakers was played in headphones. The subjects would then try to locate the origin of every speaker on a map of Sweden. Results showed for example that the accents of the capital of Sweden (Stockholm), Gotland and southern Sweden were the ones placed correctly to the highest degree.

  • 179.
    Enflo, Laura
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Vowel Dependence for Electroglottography and Audio Spectral Tilt2010In: Proceedings of Fonetik, Lund, 2010, p. 35-39Conference paper (Other academic)
  • 180.
    Enflo, Laura
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sundberg, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Vocal fold collision threshold pressure: An alternative to phonation threshold pressure?2009In: Logopedics, Phoniatrics, Vocology, ISSN 1401-5439, E-ISSN 1651-2022, Vol. 34, no 4, p. 210-217Article in journal (Refereed)
    Abstract [et]

    Phonation threshold pressure (PTP), frequently used for characterizing vocal fold properties, is often difficult to measure. This investigation analyses the lowest pressure initiating vocal fold collision (CTP). Microphone, electroglottograph (EGG), and oral pressure signals were recorded, before and after vocal warm-up, in 15 amateur singers, repeating the syllable /pa:/ at several fundamental frequencies with gradually decreasing vocal loudness. Subglottal pressure was estimated from oral pressure during the p-occlusion, using the audio and the EGG amplitudes as criteria for PTP and CTP. The coefficient of variation was mostly lower for CTP than for PTP. Both CTP and PTP tended to be higher before than after the warm-up. The results support the conclusion that CTP is a promising parameter in investigations of vocal fold characteristics.

  • 181.
    Enflo, Laura
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sundberg, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pabst, Friedemann
    Hospital Dresden Friedrichstadt.
    Collision Threshold Pressure Before and After Vocal Loading2009In: INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association 2009, 2009, p. 764-767Conference paper (Refereed)
    Abstract [en]

    The phonation threshold pressure (PIP) has been found to increase during vocal fatigue. In the present study we compare PTP and collision threshold pressure (CTP) before and after vocal loading in singer and non-singer voices. Seven subjects repeated the vowel sequence /a,c,i,o,u/ at an SPL of at least 80 dB @ 0.3 m for 20 min. Before and after this loading the subjects' voices were recorded while they produced a diminuendo repeating the syllable /pa/. Oral pressure during the /p/ occlusion was used as a measure of subglottal pressure. Both CTP and PIP increased significantly after the vocal loading.

  • 182.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher2012In: Computer Assisted Language Learning, ISSN 0958-8221, E-ISSN 1744-3210, Vol. 25, no 1, p. 37-64Article in journal (Refereed)
    Abstract [en]

    Pronunciation errors may be caused by several different deviations from the target, such as voicing, intonation, insertions or deletions of segments, or that the articulators are placed incorrectly. Computer-animated pronunciation teachers could potentially provide important assistance on correcting all these types of deviations, but they have an additional benefit for articulatory errors. By making parts of the face transparent, they can show the correct position and shape of the tongue and provide audiovisual feedback on how to change erroneous articulations. Such a scenario however requires firstly that the learner's current articulation can be estimated with precision and secondly that the learner is able to imitate the articulatory changes suggested in the audiovisual feedback. This article discusses both these aspects, with one experiment on estimating the important articulatory features from a speaker through acoustic-to-articulatory inversion and one user test with a virtual pronunciation teacher, in which the articulatory changes made by seven learners who receive audiovisual feedback are monitored using ultrasound imaging.

  • 183.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Articulatory synthesis using corpus-based estimation of line spectrum pairs2005In: 9th European Conference on Speech Communication and Technology, 2005, p. 1909-1912Conference paper (Refereed)
    Abstract [en]

    An attempt to define a new articulatory synthesis method, in which the speech signal is generated through a statistical estimation of its relation with articulatory parameters, is presented. A corpus containing acoustic material and simultaneous recordings of the tongue and facial movements was used to train and test the articulatory synthesis of VCV words and short sentences. Tongue and facial motion data, captured with electromagnetic articulography and three-dimensional optical motion tracking, respectively, define articulatory parameters of a talking head. These articulatory parameters are then used as estimators of the speech signal, represented by line spectrum pairs. The statistical link between the articulatory parameters and the speech signal was established using either linear estimation or artificial neural networks. The results show that the linear estimation was only enough to synthesize identifiable vowels, but not consonants, whereas the neural networks gave a perceptually better synthesis.

  • 184.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Assessing MRI measurements: Effects of sustenation, gravitation and coarticulation2006In: Speech production: Models, Phonetic Processes and Techniques / [ed] Harrington, J.; Tabain, M., New York: Psychology Press , 2006, p. 301-314Chapter in book (Refereed)
  • 185.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Augmented Reality Talking Heads as a Support for Speech Perception and Production2011In: Augmented Reality: Some Emerging Application Areas / [ed] Nee, Andrew Yeh Ching, IN-TECH, 2011, p. 89-114Chapter in book (Refereed)
  • 186.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bättre tala än texta - talteknologi nu och i framtiden2008In: Tekniken bakom språket / [ed] Domeij, Rickard, Stockholm: Norstedts Akademiska Förlag , 2008, p. 98-118Chapter in book (Other (popular science, discussion, etc.))
  • 187.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can audio-visual instructions help learners improve their articulation?: an ultrasound study of short term changes2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2631-2634Conference paper (Refereed)
    Abstract [en]

    This paper describes how seven French subjects change their pronunciation and articulation when practising Swedish words with a computer-animated virtual teacher. The teacher gives feedback on the user's pronunciation with audiovisual instructions suggesting how the articulation should be changed. A wizard-of-Oz set-up was used for the training session, in which a human listener choose the adequate pre-generated feedback based on the user's pronunciation. The subjects change of the articulation was monitored during the practice session with a hand-held ultrasound probe. The perceptual analysis indicates that the subjects improved their pronunciation during the training and the ultrasound measurements suggest that the improvement was made by following the articulatory instructions given by the computer-animated teacher.

  • 188.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Datoranimerade talande ansikten2012In: Människans ansikten: Emotion, interaktion och konst / [ed] Adelswärd, V.; Forstorp, P-A., Stockholm: Carlssons Bokförlag , 2012Chapter in book (Other academic)
  • 189.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Evaluation of speech inversion using an articulatory classifier2006In: In Proceedings of the Seventh International Seminar on Speech Production / [ed] Yehia, H.; Demolin, D.; Laboissière, R., 2006, p. 469-476Conference paper (Refereed)
    Abstract [en]

    This paper presents an evaluation method for statistically basedspeech inversion, in which the estimated vocal tract shapes are classified intophoneme categories based on the articulatory correspondence with prototypevocal tract shapes. The prototypes are created using the original articulatorydata and the classifier hence permits to interpret the results of the inversion interms of, e.g., confusions between different articulations and the success in estimatingdifferent places of articulation. The articulatory classifier was used toevaluate acoustic and audiovisual speech inversion of VCV words and Swedishsentences performed with a linear estimation and an artificial neural network.

  • 190.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Feedback strategies of human and virtual tutors in pronunciation training2006In: TMH-QPSR, ISSN 1104-5787, Vol. 48, no 1, p. 011-034Article in journal (Other academic)
    Abstract [en]

    This paper presents a survey of language teachers’ and their students’ attitudes and practice concerning the use of corrective feedback in pronunciation training. Theaim of the study is to identify feedback strategies that can be used successfully ina computer assisted pronunciation training system with a virtual tutor giving articulatoryinstructions and feedback. The study was carried out using focus groupmeetings, individual semi-structured interviews and classroom observations. Implicationsfor computer assisted pronunciation training are presented and some havebeen tested with 37 users in a short practice session with a virtual teacher

  • 191.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Introducing visual cues in acoustic-to-articulatory inversion2005In: Interspeech 2005: 9th European Conference on Speech Communication and Technology, 2005, p. 3205-3208Conference paper (Refereed)
    Abstract [en]

    The contribution of facial measures in a statistical acoustic-to- articulatory inversion has been investigated. The tongue contour was estimated using a linear estimation from either acoustics or acoustics and facial measures. Measures of the lateral movement of lip corners and the vertical movement of the upper and lower lip and the jaw gave a substantial improvement over the audio-only case. It was further found that adding the corresponding articulatory measures that could be extracted from a profile view of the face; i.e. the protrusion of the lips, lip corners and the jaw, did not give any additional improvement of the inversion result. The present study hence suggests that audiovisual-to-articulatory inversion can as well be performed using front view monovision of the face, rather than stereovision of both the front and profile view.

  • 192.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Is there a McGurk effect for tongue reading?2010In: Proceedings of AVSP: International Conferenceon Audio-Visual Speech Processing, 2010Conference paper (Refereed)
    Abstract [en]

    Previous studies on tongue reading, i.e., speech perception ofdegraded audio supported by animations of tongue movementshave indicated that the support is weak initially and that subjectsneed training to learn to interpret the movements. Thispaper investigates if the learning is of the animation templatesas such or if subjects learn to retrieve articulatory knowledgethat they already have. Matching and conflicting animationsof tongue movements were presented randomly together withthe auditory speech signal at three different levels of noise in aconsonant identification test. The average recognition rate overthe three noise levels was significantly higher for the matchedaudiovisual condition than for the conflicting and the auditoryonly. Audiovisual integration effects were also found for conflictingstimuli. However, the visual modality is given much lessweight in the perception than for a normal face view, and intersubjectdifferences in the use of visual information are large.

  • 193.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pronunciation analysis by acoustic-to-articulatory feature inversion2012In: Proceedings of the International Symposium on Automatic detection of Errors in Pronunciation Training / [ed] Engwall, O., Stockholm, 2012, p. 79-84Conference paper (Refereed)
    Abstract [en]

    Second  language  learners  may  require  assistancecorrecting  their  articulation  of  unfamiliar  phonemes  in  orderto  reach  the  target  pronunciation.  If,  e.g.,  a  talking  head  isto  provide  the  learner  with  feedback  on  how  to  change  thearticulation,  a  required  first  step  is  to  be  able  to  analyze  thelearner’s  articulation.  This  paper  describes  how  a  specializedrestricted  acoustic-to-articulatory  inversion  procedure  may  beused  for  this  analysis.  The  inversion  is  trained  on  simultane-ously  recorded  acoustic-articulatory  data  of  one  native  speakerof  Swedish,  and  four  different  experiments  investigate  how  itperforms  for  the  original  speaker,  using  acoustic  input;  for  theoriginal speaker, using acoustic input and visual information; forfour other speakers; and for correct and mispronounced phonesuttered by two non-native speakers

  • 194.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Designing the user interface of the computer-based speech training system ARTUR based on early user tests2006In: Behavior and Information Technology, ISSN 0144-929X, E-ISSN 1362-3001, Vol. 25, no 4, p. 353-365Article in journal (Refereed)
    Abstract [en]

    This study has been performed in order to evaluate a prototype for the human - computer interface of a computer-based speech training aid named ARTUR. The main feature of the aid is that it can give suggestions on how to improve articulations. Two user groups were involved: three children aged 9 - 14 with extensive experience of speech training with therapists and computers, and three children aged 6, with little or no prior experience of computer-based speech training. All children had general language disorders. The study indicates that the present interface is usable without prior training or instructions, even for the younger children, but that more motivational factors should be introduced. The granularity of the mesh that classifies mispronunciations was satisfactory, but the flexibility and level of detail of the feedback should be developed further.

  • 195.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Feedback management in the pronunciation training system ARTUR2006In: Proceedings of CHI 2006, 2006, p. 231-234Conference paper (Refereed)
    Abstract [en]

    This extended abstract discusses the development of a computer-assisted pronunciation training system that gives articulatory feedback, and in particular the management of feedback given to the user.

  • 196.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Delvaux, V.
    Metens, T.
    Interspeaker Variation in the Articulation of French Nasal Vowels2006In: In Proceedings of the Seventh International Seminar on Speech Production, 2006, p. 3-10Conference paper (Refereed)
  • 197.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Are real tongue movements easier to speech read than synthesized?2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 824-827Conference paper (Refereed)
    Abstract [en]

    Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue.

  • 198.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Can you tell if tongue movements are real or synthetic?2009In: Proceedings of AVSP, 2009Conference paper (Refereed)
    Abstract [en]

    We have investigated if subjects are aware of what natural tongue movements look like, by showing them animations based on either measurements or rule-based synthesis. The issue is of interest since a previous audiovisual speech perception study recently showed that the word recognition rate in sentences with degraded audio was significantly better with real tongue movements than with synthesized. The subjects in the current study could as a group not tell which movements were real, with a classification score at chance level. About half of the subjects were significantly better at discriminating between the two types of animations, but their classification score was as often well below chance as above. The correlation between classification score and word recognition rate for subjects who also participated in the perception study was very weak, suggesting that the higher recognition score for real tongue movements may be due to subconscious, rather than conscious, processes. This finding could potentially be interpreted as an indication that audiovisual speech perception is based onarticulatory gestures.

  • 199.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Real vs. rule-generated tongue movements as an audio-visual speech perception support2009In: Proceedings of Fonetik 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 30-35Conference paper (Other academic)
    Abstract [en]

    We have conducted two studies in which animations created from real tongue movements and rule-based synthesis are compared. We first studied if the two types of animations were different in terms of how much support they give in a perception task. Subjects achieved a significantly higher word recognition rate insentences when animations were shown compared to the audio only condition, and a significantly higher score with real movements than with synthesized. We then performed a classification test, in which subjects should indicate if the animations were created from measurements or from rules. The results show that the subjects as a group are unable to tell if the tongue movements are real or not. The stronger support from real movements hence appears to be due to subconscious factors.

  • 200.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Design strategies for a virtual language tutor2004In: INTERSPEECH 2004, ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004 / [ed] Kim, S. H.; Young, D. H., Jeju Island, Korea, 2004, p. 1693-1696Conference paper (Refereed)
    Abstract [en]

    In this paper we discuss work in progress on an interactive talking agent as a virtual language tutor in CALL applications. The ambition is to create a tutor that can be engaged in many aspects of language learning from detailed pronunciation to conversational training. Some of the crucial components of such a system is described. An initial implementation of a stress/quantity training scheme will be presented.

1234567 151 - 200 of 472
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf