Change search
Refine search result
12 1 - 50 of 65
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Badin, P.
    GIPSA-Lab, Grenoble University.
    Vargas, J. A. V.
    GIPSA-Lab, Grenoble University.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Predicting Unseen Articulations from Multi-speaker Articulatory Models2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, p. 1588-1591Conference paper (Refereed)
    Abstract [en]

    In order to study inter-speaker variability, this work aims to assessthe generalization capabilities of data-based multi-speakerarticulatory models. We use various three-mode factor analysistechniques to model the variations of midsagittal vocal tractcontours obtained from MRI images for three French speakersarticulating 73 vowels and consonants. Articulations of agiven speaker for phonemes not present in the training set arethen predicted by inversion of the models from measurementsof these phonemes articulated by the other subjects. On the average,the prediction RMSE was 5.25 mm for tongue contours,and 3.3 mm for 2D midsagittal vocal tract distances. Besides,this study has established a methodology to determine the optimalnumber of factors for such models.

  • 2.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Important regions in the articulator trajectory2008In: Proceedings of International Seminar on Speech Production / [ed] Rudolph Sock, Susanne Fuchs, Yves Laprie, Strasbourg, France: INRIA , 2008, p. 305-308Conference paper (Refereed)
    Abstract [en]

    This paper deals with identifying important regions in the articulatory trajectory based on the physical properties of the trajectory. A method to locate critical time instants as well as the key articulator positions is suggested. Acoustic-to-Articulatory Inversion using linear and non-linear regression isperformed using only these critical points. The accuracy of inversion is found to be almost the same as using all the data points.

  • 3.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mapping between acoustic and articulatory gestures2011In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 53, no 4, p. 567-589Article in journal (Refereed)
    Abstract [en]

    This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories arc essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.

  • 4.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping2011In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Prague, Czech republic, 2011, p. 4628-4631Conference paper (Refereed)
    Abstract [en]

    This paper studies the role of non-uniqueness in the Acoustic-to- Articulatory Inversion. It is generally believed that applying continuity constraints to the estimates of thearticulatory parameters can resolve the problem of non-uniqueness. This paper tries to find out whether all instances of non-uniqueness can be resolved using continuity constraints. The investigation reveals that applying continuity constraints provides the best estimate in roughly around 50 to 53 % of the non-unique mappings. Roughly around 8 to13 % of the non-unique mappings are best estimated by choosing discontinuous paths along the hypothetical high probability estimates of articulatory trajectories.

  • 5.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 10, p. 2672-2682Article in journal (Refereed)
    Abstract [en]

    This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

  • 6.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 2799-2802Conference paper (Refereed)
    Abstract [en]

    This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.

  • 7.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Detecting confusable phoneme pairs for Swedish language learners depending on their first language2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 89-92Article in journal (Other academic)
    Abstract [en]

    This paper proposes a paradigm where commonly made segmental pronunciation errors are modeled as pair-wise confusions between two or more phonemes in the language that is being learnt. The method uses an ensemble of support vector machine classifiers with time varying Mel frequency cepstral features to distinguish between several pairs of phonemes. These classifiers are then applied to classify the phonemes uttered by second language learners. Using this method, an assessment is made regarding the typical pronunciation problems that students learning Swedish would encounter, depending on their first language.

  • 8.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Abdou, Sherif
    Faculty of Computers & Information, Cairo University, Egypt.
    Using an Ensemble of Classifiers for Mispronunciation Feedback2011In: Proceedings of SLaTE / [ed] Strik, H.; Delmonte, R.; Russel, M., Venice, Italy, 2011Conference paper (Refereed)
    Abstract [en]

    This paper proposes a paradigm where commonly made segmental pronunciation errors are modeled as pair-wise confusions between two or more phonemes in the language that is being learnt. The method uses an ensemble of support vector machine classifiers with time varying Mel frequency cepstral features to distinguish between several pairs of phonemes. These classifiers are then applied to classify the phonemes uttered by second language learners. Instead of providing feedback at every mispronounced phoneme, the method attempts toprovide feedback about typical mispronunciations by a certain student, over an entire session of several utterances. Two case studies that demonstrate how the paradigm is applied to provide suitable feedback to two students is also described in this pape

  • 9. Arnela, Marc
    et al.
    Blandin, Rémi
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Guasch, Oriol
    Alías, Francesc
    Pelorson, Xavier
    Van Hirtum, Annemie
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Influence of lips on the production of vowels based on finite element simulations and experiments2016In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 139, no 5, p. 2852-2859Article in journal (Refereed)
    Abstract [en]

    Three-dimensional (3-D) numerical approaches for voice production are currently being investigated and developed. Radiation losses produced when sound waves emanate from the mouth aperture are one of the key aspects to be modeled. When doing so, the lips are usually removed from the vocal tract geometry in order to impose a radiation impedance on a closed cross-section, which speeds up the numerical simulations compared to free-field radiation solutions. However, lips may play a significant role. In this work, the lips' effects on vowel sounds are investigated by using 3-D vocal tract geometries generated from magnetic resonance imaging. To this aim, two configurations for the vocal tract exit are considered: with lips and without lips. The acoustic behavior of each is analyzed and compared by means of time-domain finite element simulations that allow free-field wave propagation and experiments performed using 3-D-printed mechanical replicas. The results show that the lips should be included in order to correctly model vocal tract acoustics not only at high frequencies, as commonly accepted, but also in the low frequency range below 4 kHz, where plane wave propagation occurs.

  • 10. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Blandin, Rémi
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Hirtum, Annemie Van
    Pelorson, Xavier
    Influence of vocal tract geometry simplifications on the numerical simulation of vowel sounds2016In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 140, no 3, p. 1707-1718Article in journal (Refereed)
    Abstract [en]

    For many years, the vocal tract shape has been approximated by one-dimensional (1D) area functions to study the production of voice. More recently, 3D approaches allow one to deal with the complex 3D vocal tract, although area-based 3D geometries of circular cross-section are still in use. However, little is known about the influence of performing such a simplification, and some alternatives may exist between these two extreme options. To this aim, several vocal tract geometry simplifications for vowels [ɑ], [i], and [u] are investigated in this work. Six cases are considered, consisting of realistic, elliptical, and circular cross-sections interpolated through a bent or straight midline. For frequencies below 4–5 kHz, the influence of bending and cross-sectional shape has been found weak, while above these values simplified bent vocal tracts with realistic cross-sections are necessary to correctly emulate higher-order mode propagation. To perform this study, the finite element method (FEM) has been used. FEM results have also been compared to a 3D multimodal method and to a classical 1D frequency domain model.

  • 11. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blandin, Rémi
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pelorson, Xavier
    Van Hirtum, Annemie
    Effects of vocal tract geometry simplifications on the numerical simulation of vowels2015In: PAN EUROPEAN VOICE CONFERENCE ABSTRACT BOOK: Proceedings e report 104, Firenze University Press, 2015, p. 177-Conference paper (Other academic)
  • 12. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    A semi-polar grid strategy for the three-dimensional finite element simulation of vowel-vowel sequences2017In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, The International Speech Communication Association (ISCA), 2017, Vol. 2017, p. 3477-3481Conference paper (Refereed)
    Abstract [en]

    Three-dimensional computational acoustic models need very detailed 3D vocal tract geometries to generate high quality sounds. Static geometries can be obtained from Magnetic Resonance Imaging (MRI), but it is not currently possible to capture dynamic MRI-based geometries with sufficient spatial and time resolution. One possible solution consists in interpolating between static geometries, but this is a complex task. We instead propose herein to use a semi-polar grid to extract 2D cross-sections from the static 3D geometries, and then interpolate them to obtain the vocal tract dynamics. Other approaches such as the adaptive grid have also been explored. In this method, cross-sections are defined perpendicular to the vocal tract midline, as typically done in 1D to obtain the vocal tract area functions. However, intersections between adjacent cross-sections may occur during the interpolation process, especially when the vocal tract midline quickly changes its orientation. In contrast, the semi-polar grid prevents these intersections because the plane orientations are fixed over time. Finite element simulations of static vowels are first conducted, showing that 3D acoustic wave propagation is not significantly altered when the semi-polar grid is used instead of the adaptive grid. The vowel-vowel sequence [ɑi] is finally simulated to demonstrate the method.

  • 13. Arnela, Marc
    et al.
    Guasch, Oriol
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    FINITE ELEMENT GENERATION OF VOWEL SOUNDS USING DYNAMIC COMPLEX THREE-DIMENSIONAL VOCAL TRACTS2016In: PROCEEDINGS OF THE 23RD INTERNATIONAL CONGRESS ON SOUND AND VIBRATION: FROM ANCIENT TO MODERN ACOUSTICS, INT INST ACOUSTICS & VIBRATION , 2016Conference paper (Refereed)
    Abstract [en]

    Three-dimensional (3D) numerical simulations of the vocal tract acoustics require very detailed vocal tract geometries in order to generate good quality vowel sounds. These geometries are typically obtained from Magnetic Resonance Imaging (MRI), from which a volumetric representation of the complex vocal tract shape is obtained. Static vowel sounds can then be generated using a finite element code, which simulates the propagation of acoustic waves through the vocal tract when a given train of glottal pulses is introduced at the glottal cross-section. A more challenging problem to solve is that of generating dynamic vowel sounds. On the one hand, the acoustic wave equation has to be solved in a computational domain with moving boundaries, which entails some numerical difficulties. On the other hand, the finite element meshes where acoustic wave propagation is computed have to move according to the dynamics of these very complex vocal tract shapes. In this work this problem is addressed. First, the acoustic wave equation in mixed form is expressed in an Arbitrary Lagrangian-Eulerian (ALE) framework to account for the vocal tract wall motion. This equation is numerically solved using a stabilized finite element approach. Second, the dynamic 3D vocal tract geometry is approximated by a finite set of cross-sections with complex shape. The time-evolution of these cross-sections is used to move the boundary nodes of the finite element meshes, while inner nodes are computed through diffusion. Some dynamic vowel sounds are presented as numerical examples.

  • 14.
    Beskow, Jonas
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Granström, Björn
    KTH, Superseded Departments, Speech, Music and Hearing.
    Resynthesis of Facial and Intraoral Articulation fromSimultaneous Measurements2003In: Proceedings of the 15th International Congress of phonetic Sciences (ICPhS'03), Adelaide: Casual Productions , 2003Conference paper (Other academic)
    Abstract [en]

    Simultaneous measurements of tongue and facial motion,using a combination of electromagnetic articulography(EMA) and optical motion tracking, are analysed to improvethe articulation of an animated talking head and toinvestigate the correlation between facial and vocal tractmovement. The recorded material consists of VCV andCVC words and 270 short everyday sentences spoken byone Swedish subject. The recorded articulatory movementsare re-synthesised by a parametrically controlled 3D modelof the face and tongue, using a procedure involvingminimisation of the error between measurement and model.Using linear estimators, tongue data is predicted from theface and vice versa, and the correlation betweenmeasurement and prediction is computed.

  • 15.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Nordqvist, Peter
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Visualization of speech and audio for hearing-impaired persons2008In: Technology and Disability, ISSN 1055-4181, Vol. 20, no 2, p. 97-107Article in journal (Refereed)
    Abstract [en]

    Speech and sounds are important sources of information in our everyday lives for communication with our environment, be it interacting with fellow humans or directing our attention to technical devices with sound signals. For hearing impaired persons this acoustic information must be supplemented or even replaced by cues using other senses. We believe that the most natural modality to use is the visual, since speech is fundamentally audiovisual and these two modalities are complementary. We are hence exploring how different visualization methods for speech and audio signals may support hearing impaired persons. The goal in this line of research is to allow the growing number of hearing impaired persons, children as well as the middle-aged and elderly, equal participation in communication. A number of visualization techniques are proposed and exemplified with applications for hearing impaired persons.

  • 16.
    Bälter, Olle
    et al.
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.
    Wizard-of-Oz Test of ARTUR - a Computer-Based Speech Training System with Articulation Correction2005In: proceedings of ASSETS 2005, 2005, p. 36-43Conference paper (Refereed)
    Abstract [en]

    This study has been performed in order to test the human-machine interface of a computer-based speech training aid named ARTUR with the main feature that it can give suggestions on how to improve articulation. Two user groups were involved: three children aged 9-14 with extensive experience of speech training, and three children aged 6. All children had general language disorders. The study indicates that the present interface is usable without prior training or instructions, even for the younger children, although it needs some improvement to fit illiterate children. The granularity of the mesh that classifies mispronunciations was satisfactory, but can be developed further.

  • 17.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Arnela, Marc
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    SIMPLIFICATION OF VOCAL TRACT SHAPES WITH DIFFERENT LEVELS OF DETAIL2015In: Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, UK, University of Glasgow , 2015, p. 1-5Conference paper (Refereed)
    Abstract [en]

    We propose a semi-automatic method to regenerate simplified vocal tract geometries from very detailed input (e.g. MRI-based geometry) with the possibility to control the level of detail, while maintaining the overall properties. The simplification procedure controls the number and organization of the vertices in the vocal tract surface mesh and can be assigned to replace complex cross-sections with regular shapes. Six different geometry regenerations are suggested: bent or straight vocal tract centreline, combined with three different types of cross-sections; namely realistic, elliptical or circular. The key feature in the simplification is that the cross-sectional areas and the length of the vocal tract are maintained. This method may, for example, be used to facilitate 3D finite element method simulations of vowels and diphthongs and to examine the basic acoustic characteristics of vocal tract in printed physical replicas. Furthermore, it allows for multimodal solutions of the wave equation.

  • 18.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Arnela, Marc
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Guasch, Oriol
    Synthesis of VV utterances from muscle activation to sound with a 3d model2017In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, The International Speech Communication Association (ISCA), 2017, p. 3497-3501Conference paper (Refereed)
    Abstract [en]

    We propose a method to automatically generate deformable 3D vocal tract geometries from the surrounding structures in a biomechanical model. This allows us to couple 3D biomechanics and acoustics simulations. The basis of the simulations is muscle activation trajectories in the biomechanical model, which move the articulators to the desired articulatory positions. The muscle activation trajectories for a vowel-vowel utterance are here defined through interpolation between the determined activations of the start and end vowel. The resulting articulatory trajectories of flesh points on the tongue surface and jaw are similar to corresponding trajectories measured using Electromagnetic Articulography, hence corroborating the validity of interpolating muscle activation. At each time step in the articulatory transition, a 3D vocal tract tube is created through a cavity extraction method based on first slicing the geometry of the articulators with a semi-polar grid to extract the vocal tract contour in each plane and then reconstructing the vocal tract through a smoothed 3D mesh-generation using the extracted contours. A finite element method applied to these changing 3D geometries simulates the acoustic wave propagation. We present the resulting acoustic pressure changes on the vocal tract boundary and the formant transitions for the utterance [Ai].

  • 19.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Arnela, Marc
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    Stavness, Ian
    Badin, Pierre
    Using a Biomechanical Model and Articulatory Data for the Numerical Production of Vowels2016In: Interspeech 2016, 2016, p. 3569-3573Conference paper (Refereed)
    Abstract [en]

    We introduce a framework to study speech production using a biomechanical model of the human vocal tract, ArtiSynth. Electromagnetic articulography data was used as input to an inverse tracking simulation that estimates muscle activations to generate 3D jaw and tongue postures corresponding to the target articulator positions. For acoustic simulations, the vocal tract geometry is needed, but since the vocal tract is a cavity rather than a physical object, its geometry does not explicitly exist in a biomechanical model. A fully-automatic method to extract the 3D geometry (surface mesh) of the vocal tract by blending geometries of the relevant articulators has therefore been developed. This automatic extraction procedure is essential, since a method with manual intervention is not feasible for large numbers of simulations or for generation of dynamic sounds, such as diphthongs. We then simulated the vocal tract acoustics by using the Finite Element Method (FEM). This requires a high quality vocal tract mesh without irregular geometry or self-intersections. We demonstrate that the framework is applicable to acoustic FEM simulations of a wide range of vocal tract deformations. In particular we present results for cardinal vowel production, with muscle activations, vocal tract geometry, and acoustic simulations.

  • 20.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher2012In: Computer Assisted Language Learning, ISSN 0958-8221, E-ISSN 1744-3210, Vol. 25, no 1, p. 37-64Article in journal (Refereed)
    Abstract [en]

    Pronunciation errors may be caused by several different deviations from the target, such as voicing, intonation, insertions or deletions of segments, or that the articulators are placed incorrectly. Computer-animated pronunciation teachers could potentially provide important assistance on correcting all these types of deviations, but they have an additional benefit for articulatory errors. By making parts of the face transparent, they can show the correct position and shape of the tongue and provide audiovisual feedback on how to change erroneous articulations. Such a scenario however requires firstly that the learner's current articulation can be estimated with precision and secondly that the learner is able to imitate the articulatory changes suggested in the audiovisual feedback. This article discusses both these aspects, with one experiment on estimating the important articulatory features from a speaker through acoustic-to-articulatory inversion and one user test with a virtual pronunciation teacher, in which the articulatory changes made by seven learners who receive audiovisual feedback are monitored using ultrasound imaging.

  • 21.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Are Static MRI Data Representative of Dynamic Speech?: Results from a Comparative Study Using MRI, EMA, and EPG2000In: Proceedings of the 6th ICSLP, 2000, p. 17-20Conference paper (Other academic)
  • 22.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Articulatory synthesis using corpus-based estimation of line spectrum pairs2005In: 9th European Conference on Speech Communication and Technology, 2005, p. 1909-1912Conference paper (Refereed)
    Abstract [en]

    An attempt to define a new articulatory synthesis method, in which the speech signal is generated through a statistical estimation of its relation with articulatory parameters, is presented. A corpus containing acoustic material and simultaneous recordings of the tongue and facial movements was used to train and test the articulatory synthesis of VCV words and short sentences. Tongue and facial motion data, captured with electromagnetic articulography and three-dimensional optical motion tracking, respectively, define articulatory parameters of a talking head. These articulatory parameters are then used as estimators of the speech signal, represented by line spectrum pairs. The statistical link between the articulatory parameters and the speech signal was established using either linear estimation or artificial neural networks. The results show that the linear estimation was only enough to synthesize identifiable vowels, but not consonants, whereas the neural networks gave a perceptually better synthesis.

  • 23.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Assessing MRI measurements: Effects of sustenation, gravitation and coarticulation2006In: Speech production: Models, Phonetic Processes and Techniques / [ed] Harrington, J.; Tabain, M., New York: Psychology Press , 2006, p. 301-314Chapter in book (Refereed)
  • 24.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Augmented Reality Talking Heads as a Support for Speech Perception and Production2011In: Augmented Reality: Some Emerging Application Areas / [ed] Nee, Andrew Yeh Ching, IN-TECH, 2011, p. 89-114Chapter in book (Refereed)
  • 25.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bättre tala än texta - talteknologi nu och i framtiden2008In: Tekniken bakom språket / [ed] Domeij, Rickard, Stockholm: Norstedts Akademiska Förlag , 2008, p. 98-118Chapter in book (Other (popular science, discussion, etc.))
  • 26.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can audio-visual instructions help learners improve their articulation?: an ultrasound study of short term changes2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2631-2634Conference paper (Refereed)
    Abstract [en]

    This paper describes how seven French subjects change their pronunciation and articulation when practising Swedish words with a computer-animated virtual teacher. The teacher gives feedback on the user's pronunciation with audiovisual instructions suggesting how the articulation should be changed. A wizard-of-Oz set-up was used for the training session, in which a human listener choose the adequate pre-generated feedback based on the user's pronunciation. The subjects change of the articulation was monitored during the practice session with a hand-held ultrasound probe. The perceptual analysis indicates that the subjects improved their pronunciation during the training and the ultrasound measurements suggest that the improvement was made by following the articulatory instructions given by the computer-animated teacher.

  • 27.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Combining MRI, EMA and EPG measurements in a three-dimensional tongue model2003In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 41, no 2-3, p. 303-329Article in journal (Refereed)
    Abstract [en]

    A three-dimensional (3D) tongue model has been developed using MR images of a reference subject producing 44 artificially sustained Swedish articulations. Based on the difference in tongue shape between the articulations and a reference, the six linear parameters jaw height, tongue body, tongue dorsum, tongue tip, tongue advance and tongue width were determined using an ordered linear factor analysis controlled by articulatory measures. The first five factors explained 88% of the tongue data variance in the midsagittal plane and 78% in the 3D analysis. The six-parameter model is able to reconstruct the modelled articulations with an overall mean reconstruction error of 0.13 cm, and it specifically handles lateral differences and asymmetries in tongue shape. In order to correct articulations that were hyperarticulated due to the artificial sustaining in the magnetic resonance imaging (MRI) acquisition, the parameter values in the tongue model were readjusted based on a comparison of virtual and natural linguopalatal contact patterns, collected with electropalatography (EPG). Electromagnetic articulography (EMA) data was collected to control the kinematics of the tongue model for vowel-fricative sequences and an algorithm to handle surface contacts has been implemented, preventing the tongue from protruding through the palate and teeth.

  • 28.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Concatenative Articulatory SynthesisManuscript (preprint) (Other academic)
  • 29.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Datoranimerade talande ansikten2012In: Människans ansikten: Emotion, interaktion och konst / [ed] Adelswärd, V.; Forstorp, P-A., Stockholm: Carlssons Bokförlag , 2012Chapter in book (Other academic)
  • 30.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Dynamical Aspects of Coarticulation in Swedish Fricatives: A Combined EMA and EPG Study2000In: TMH Quarterly Status and Progress Report, p. 49-73Article in journal (Other academic)
    Abstract [en]

    An electromagnetic articulography (EMA) system and electropalatography (EPG)have been employed to study five Swedish fricatives in different vowel contexts.Articulatory measures at the onset of, the mean value during, and at the offset ofthe fricative were used to evidence the coarticulation throughout the fricative. Thecontextual influence on these three different measurements of the fricative arecompared and contrasted to evidence how the coarticulation changes. Measureswere made for the jaw motion, lip protrusion, tongue body with EMA and linguopalatalcontact with EPG. The data from the two sources were further combinedand assessed for complementary and conflicting results.

  • 31.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Evaluation of a System for Concatenative Articulatory Visual Synthesis2002In: Proceedings of the ICSLP, 2002Conference paper (Other academic)
  • 32.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Evaluation of speech inversion using an articulatory classifier2006In: In Proceedings of the Seventh International Seminar on Speech Production / [ed] Yehia, H.; Demolin, D.; Laboissière, R., 2006, p. 469-476Conference paper (Refereed)
    Abstract [en]

    This paper presents an evaluation method for statistically basedspeech inversion, in which the estimated vocal tract shapes are classified intophoneme categories based on the articulatory correspondence with prototypevocal tract shapes. The prototypes are created using the original articulatorydata and the classifier hence permits to interpret the results of the inversion interms of, e.g., confusions between different articulations and the success in estimatingdifferent places of articulation. The articulatory classifier was used toevaluate acoustic and audiovisual speech inversion of VCV words and Swedishsentences performed with a linear estimation and an artificial neural network.

  • 33.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Feedback strategies of human and virtual tutors in pronunciation training2006In: TMH-QPSR, ISSN 1104-5787, Vol. 48, no 1, p. 011-034Article in journal (Other academic)
    Abstract [en]

    This paper presents a survey of language teachers’ and their students’ attitudes and practice concerning the use of corrective feedback in pronunciation training. Theaim of the study is to identify feedback strategies that can be used successfully ina computer assisted pronunciation training system with a virtual tutor giving articulatoryinstructions and feedback. The study was carried out using focus groupmeetings, individual semi-structured interviews and classroom observations. Implicationsfor computer assisted pronunciation training are presented and some havebeen tested with 37 users in a short practice session with a virtual teacher

  • 34.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    From real-time MRI to 3D tongue movements2004In: INTERSPEECH 2004: ICSLP 8th International Conference on Spoken Language Processing / [ed] Kim, S. H.; Young, D. H., 2004, p. 1109-1112Conference paper (Refereed)
    Abstract [en]

    Real-time Magnetic Resonance Imaging (MRI) at 9 images/s of the midsagittal plane is used as input to a threedimensionaltongue model, previously generated based onsustained articulations imaged with static MRI. The aimis two-fold, firstly to use articulatory inversion to extrapolatethe midsagittal tongue movements to three-dimensionalmovements, secondly to determine the accuracy of thetongue model in replicating the real-time midsagittal tongueshapes. The evaluation of the inversion shows that the realtimemidsagittal contour is reproduced with acceptable accuracy.This means that the 3D model can be used to representreal-time articulations, eventhough the artificially sustainedarticulations on which it was based were hyperarticulated andhad a backward displacement of the tongue.

  • 35.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Introducing visual cues in acoustic-to-articulatory inversion2005In: Interspeech 2005: 9th European Conference on Speech Communication and Technology, 2005, p. 3205-3208Conference paper (Refereed)
    Abstract [en]

    The contribution of facial measures in a statistical acoustic-to- articulatory inversion has been investigated. The tongue contour was estimated using a linear estimation from either acoustics or acoustics and facial measures. Measures of the lateral movement of lip corners and the vertical movement of the upper and lower lip and the jaw gave a substantial improvement over the audio-only case. It was further found that adding the corresponding articulatory measures that could be extracted from a profile view of the face; i.e. the protrusion of the lips, lip corners and the jaw, did not give any additional improvement of the inversion result. The present study hence suggests that audiovisual-to-articulatory inversion can as well be performed using front view monovision of the face, rather than stereovision of both the front and profile view.

  • 36.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Is there a McGurk effect for tongue reading?2010In: Proceedings of AVSP: International Conferenceon Audio-Visual Speech Processing, 2010Conference paper (Refereed)
    Abstract [en]

    Previous studies on tongue reading, i.e., speech perception ofdegraded audio supported by animations of tongue movementshave indicated that the support is weak initially and that subjectsneed training to learn to interpret the movements. Thispaper investigates if the learning is of the animation templatesas such or if subjects learn to retrieve articulatory knowledgethat they already have. Matching and conflicting animationsof tongue movements were presented randomly together withthe auditory speech signal at three different levels of noise in aconsonant identification test. The average recognition rate overthe three noise levels was significantly higher for the matchedaudiovisual condition than for the conflicting and the auditoryonly. Audiovisual integration effects were also found for conflictingstimuli. However, the visual modality is given much lessweight in the perception than for a normal face view, and intersubjectdifferences in the use of visual information are large.

  • 37.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pronunciation analysis by acoustic-to-articulatory feature inversion2012In: Proceedings of the International Symposium on Automatic detection of Errors in Pronunciation Training / [ed] Engwall, O., Stockholm, 2012, p. 79-84Conference paper (Refereed)
    Abstract [en]

    Second  language  learners  may  require  assistancecorrecting  their  articulation  of  unfamiliar  phonemes  in  orderto  reach  the  target  pronunciation.  If,  e.g.,  a  talking  head  isto  provide  the  learner  with  feedback  on  how  to  change  thearticulation,  a  required  first  step  is  to  be  able  to  analyze  thelearner’s  articulation.  This  paper  describes  how  a  specializedrestricted  acoustic-to-articulatory  inversion  procedure  may  beused  for  this  analysis.  The  inversion  is  trained  on  simultane-ously  recorded  acoustic-articulatory  data  of  one  native  speakerof  Swedish,  and  four  different  experiments  investigate  how  itperforms  for  the  original  speaker,  using  acoustic  input;  for  theoriginal speaker, using acoustic input and visual information; forfour other speakers; and for correct and mispronounced phonesuttered by two non-native speakers

  • 38.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Speaker adaptation of a three-dimensional tongue model2004In: INTERSPEECH 2004: ICSLP 8th International Conference on Spoken Language Processing / [ed] Kim, S. H.; Young, D. H., 2004, p. 465-468Conference paper (Refereed)
    Abstract [en]

    Magnetic Resonance Images of nine subjects have been collected to determine scaling factors that can adapt a 3D tongue model to new subjects. The aim is to define few and simple measures that will allow for an automatic, but accurate, scaling of the model. The scaling should be automatic in order to be useful in an application for articulation training, in which the model must replicate the user's articulators without involving the user in a complicated speaker adaptation. It should further be accurate enough to allow for correct acoustic-to-articulatory inversion. The evaluation shows that the defined scaling technique is able to estimate a tongue shape that was not included in the training with an accuracy of 1.5 mm in the midsagittal plane and 1.7 mm for the whole 3D tongue, based on four articulatory measures.

  • 39. Engwall, Olov
    Synthesizing Static Vowels and Dynamic Sounds Using a 3D Vocal Tract Model2001In: Proceedings of the 4th ISCA workshop on Speech Synthesis, 2001, p. 81-86Conference paper (Other academic)
  • 40.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Tongue Talking: Studies in Intraoral Speech Synthesis2002Doctoral thesis, comprehensive summary (Other scientific)
  • 41.
    Engwall, Olov
    KTH, Superseded Departments, Speech, Music and Hearing.
    Vocal Tract Modeling i 3D1999In: TMH Quarterly Status and Progress Report, p. 31-38Article in journal (Other academic)
  • 42.
    Engwall, Olov
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Badin, P
    An MRI Study of Swedish Fricatives: Coarticulatory effects2000In: Proceedings of the 5th Speech Production Seminar, 2000, p. 297-300Conference paper (Other academic)
  • 43.
    Engwall, Olov
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Badin, P
    Collecting and Analysing Two- and Three-dimensional MRI data for Swedish1999In: TMH Quarterly Status and Progress Report, p. 11-38Article in journal (Other academic)
  • 44.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Pronunciation feedback from real and virtual language teachers2007In: Computer Assisted Language Learning, ISSN 0958-8221, E-ISSN 1744-3210, Vol. 20, no 3, p. 235-262Article in journal (Refereed)
    Abstract [en]

    The aim of this paper is to summarise how pronunciation feedback on the phoneme level should be given in computer-assisted pronunciation training (CAPT) in order to be effective. The study contains a literature survey of feedback in the language classroom, interviews with language teachers and their students about their attitudes towards pronunciation feedback, and observations of how feedback is given in their classrooms. The study was carried out using focus group meetings, individual semi-structured interviews and classroom observations. The feedback strategies that were advocated and observed in the study on pronunciation feedback from human teachers were implemented in a computer-animated language tutor giving articulation feedback. The virtual tutor was subsequently tested in a user trial and evaluated with a questionnaire. The article proposes several feedback strategies that would improve the pedagogical soundness of CAPT systems.

  • 45.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Designing the user interface of the computer-based speech training system ARTUR based on early user tests2006In: Behavior and Information Technology, ISSN 0144-929X, E-ISSN 1362-3001, Vol. 25, no 4, p. 353-365Article in journal (Refereed)
    Abstract [en]

    This study has been performed in order to evaluate a prototype for the human - computer interface of a computer-based speech training aid named ARTUR. The main feature of the aid is that it can give suggestions on how to improve articulations. Two user groups were involved: three children aged 9 - 14 with extensive experience of speech training with therapists and computers, and three children aged 6, with little or no prior experience of computer-based speech training. All children had general language disorders. The study indicates that the present interface is usable without prior training or instructions, even for the younger children, but that more motivational factors should be introduced. The granularity of the mesh that classifies mispronunciations was satisfactory, but the flexibility and level of detail of the feedback should be developed further.

  • 46.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bälter, Olle
    KTH, School of Computer Science and Communication (CSC), Human - Computer Interaction, MDI.
    Öster, Anne-Marie
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Feedback management in the pronunciation training system ARTUR2006In: Proceedings of CHI 2006, 2006, p. 231-234Conference paper (Refereed)
    Abstract [en]

    This extended abstract discusses the development of a computer-assisted pronunciation training system that gives articulatory feedback, and in particular the management of feedback given to the user.

  • 47.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Delvaux, V.
    Metens, T.
    Interspeaker Variation in the Articulation of French Nasal Vowels2006In: In Proceedings of the Seventh International Seminar on Speech Production, 2006, p. 3-10Conference paper (Refereed)
  • 48.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Are real tongue movements easier to speech read than synthesized?2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 824-827Conference paper (Refereed)
    Abstract [en]

    Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue.

  • 49.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Can you tell if tongue movements are real or synthetic?2009In: Proceedings of AVSP, 2009Conference paper (Refereed)
    Abstract [en]

    We have investigated if subjects are aware of what natural tongue movements look like, by showing them animations based on either measurements or rule-based synthesis. The issue is of interest since a previous audiovisual speech perception study recently showed that the word recognition rate in sentences with degraded audio was significantly better with real tongue movements than with synthesized. The subjects in the current study could as a group not tell which movements were real, with a classification score at chance level. About half of the subjects were significantly better at discriminating between the two types of animations, but their classification score was as often well below chance as above. The correlation between classification score and word recognition rate for subjects who also participated in the perception study was very weak, suggesting that the higher recognition score for real tongue movements may be due to subconscious, rather than conscious, processes. This finding could potentially be interpreted as an indication that audiovisual speech perception is based onarticulatory gestures.

  • 50.
    Engwall, Olov
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Real vs. rule-generated tongue movements as an audio-visual speech perception support2009In: Proceedings of Fonetik 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 30-35Conference paper (Other academic)
    Abstract [en]

    We have conducted two studies in which animations created from real tongue movements and rule-based synthesis are compared. We first studied if the two types of animations were different in terms of how much support they give in a perception task. Subjects achieved a significantly higher word recognition rate insentences when animations were shown compared to the audio only condition, and a significantly higher score with real movements than with synthesized. We then performed a classification test, in which subjects should indicate if the animations were created from measurements or from rules. The results show that the subjects as a group are unable to tell if the tongue movements are real or not. The stronger support from real movements hence appears to be due to subconscious factors.

12 1 - 50 of 65
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf