Change search
Refine search result
1 - 24 of 24
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Acoustic-to-Articulatory Inversion based on Local Regression2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, p. 937-940Conference paper (Refereed)
    Abstract [en]

    This paper presents an Acoustic-to-Articulatory inversionmethod based on local regression. Two types of local regression,a non-parametric and a local linear regression have beenapplied on a corpus containing simultaneous recordings of positionsof articulators and the corresponding acoustics. A maximumlikelihood trajectory smoothing using the estimated dynamicsof the articulators is also applied on the regression estimates.The average root mean square error in estimating articulatorypositions, given the acoustics, is 1.56 mm for the nonparametricregression and 1.52 mm for the local linear regression.The local linear regression is found to perform significantlybetter than regression using Gaussian Mixture Modelsusing the same acoustic and articulatory features.

  • 2.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Enflo, Laura
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Automatic Prominence Classification in Swedish2010In: Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA, 2010Conference paper (Refereed)
    Abstract [en]

    This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speech

  • 3.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    From Acoustics to Articulation: Study of the acoustic-articulatory relationship along with methods to normalize and adapt to variations in production across different speakers2011Doctoral thesis, monograph (Other academic)
    Abstract [en]

    The focus of this thesis is the relationship between the articulation ofspeech and the acoustics of produced speech. There are several problems thatare encountered in understanding this relationship, given the non-linearity,variance and non-uniqueness in the mapping, as well as the differences thatexist in the size and shape of the articulators, and consequently the acoustics,for different speakers. The thesis covers mainly four topics pertaining to thearticulation and acoustics of speech.The first part of the thesis deals with variations among different speakersin the articulation of phonemes. While the speakers differ physically in theshape of their articulators and vocal tracts, the study tries to extract articula-tion strategies that are common to different speakers. Using multi-way linearanalysis methods, the study extracts articulatory parameters which can beused to estimate unknown articulations of phonemes made by one speaker;knowing other articulations made by the same speaker and those unknown ar-ticulations made by other speakers of the language. At the same time, a novelmethod to select the number of articulatory model parameters, as well as thearticulations that are representative of a speaker’s articulatory repertoire, issuggested.The second part is devoted to the study of uncertainty in the acoustic-to-articulatory mapping, specifically non-uniqueness in the mapping. Severalstudies in the past have shown that human beings are capable of producing agiven phoneme using non-unique articulatory configurations, when the artic-ulators are constrained. This was also demonstrated by synthesizing soundsusing theoretical articulatory models. The studies in this part of the the-sis investigate the existence of non-uniqueness in unconstrained read speech.This is carried out using a database of acoustic signals recorded synchronouslyalong with the positions of electromagnetic coils placed on selected points onthe lips, jaws, tongue and velum. This part, thus, largely devotes itself todescribing techniques that can be used to study non-uniqueness in the sta-tistical sense, using such a database. The results indicate that the acousticvectors corresponding to some frames in all the phonemes in the databasecan be mapped onto non-unique articulatory distributions. The predictabil-ity of these non-unique frames is investigated, along with verifying whetherapplying continuity constraints can resolve this non-uniqueness.The third part proposes several novel methods of looking at acoustic-articulatory relationships in the context of acoustic-to-articulatory inversion.The proposed methods include explicit modeling of non-uniqueness usingcross-modal Gaussian mixture modeling, as well as modeling the mappingas local regressions. Another innovative approach towards the mapping prob-lem has also been described in the form of relating articulatory and acousticgestures. Definitions and methods to obtain such gestures are presented alongwith an analysis of the gestures for different phoneme types. The relationshipbetween the acoustic and articulatory gestures is also outlined. A method toconduct acoustic-to-articulatory inverse mapping is also suggested, along withva method to evaluate it. An application of acoustic-to-articulatory inversionto improve speech recognition is also described in this part of the thesis.The final part of the thesis deals with problems related to modeling infantsacquiring the ability to speak; the model utilizing an articulatory synthesizeradapted to infant vocal tract sizes. The main problem addressed is related tomodeling how infants acquire acoustic correlates that are normalized betweeninfants and adults. A second problem of how infants decipher the number ofdegrees of articulatory freedom is also partially addressed. The main contri-bution is a realistic model which shows how an infant can learn the mappingbetween the acoustics produced during the babbling phase and the acous-tics heard from the adults. The knowledge required to map correspondingadult-infant speech sounds is shown to be learnt without the total numberof categories or one-one correspondences being specified explicitly. Instead,the model learns these features indirectly based on an overall approval rating,provided by a simulation of adult perception, on the basis of the imitation ofadult utterances by the infant model.Thus, the thesis tries to cover different aspects of the relationship betweenarticulation and acoustics of speech in the context of variations for differentspeakers and ages. Although not providing complete solutions, the thesis pro-poses novel directions for approaching the problem, with pointers to solutionsin some contexts.

  • 4.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Imitating Adult Speech: An Infant's Motivation2011In: 9th International Seminar on Speech Production, 2011, p. 361-368Conference paper (Refereed)
    Abstract [en]

    This paper tries to detail two aspects of speech acquisition by infants which are often assumed to be intrinsic or innate knowledge, namely number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of the certain language is a strong motivation for choosing both a certain number of independent articulatory parameters as well as a certain scheme of acoustic normalization between adult and child speech.

  • 5.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Badin, P.
    GIPSA-Lab, Grenoble University.
    Vargas, J. A. V.
    GIPSA-Lab, Grenoble University.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Predicting Unseen Articulations from Multi-speaker Articulatory Models2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, p. 1588-1591Conference paper (Refereed)
    Abstract [en]

    In order to study inter-speaker variability, this work aims to assessthe generalization capabilities of data-based multi-speakerarticulatory models. We use various three-mode factor analysistechniques to model the variations of midsagittal vocal tractcontours obtained from MRI images for three French speakersarticulating 73 vowels and consonants. Articulations of agiven speaker for phonemes not present in the training set arethen predicted by inversion of the models from measurementsof these phonemes articulated by the other subjects. On the average,the prediction RMSE was 5.25 mm for tongue contours,and 3.3 mm for 2D midsagittal vocal tract distances. Besides,this study has established a methodology to determine the optimalnumber of factors for such models.

  • 6.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Eklund, Robert
    Voice Provider, Stockholm.
    Peters, Gustav
    Forschungsinstitut Alexander Koenig, Bonn, Germany.
    Mabiza, Evans
    Antelope Park, Gweru, Zimbabwe.
    An acoustic analysis of lion roars. II: Vocal tract characteristics2011In: Proceedings from Fonetik 2011: Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR, Volume 51, 2011, Stockholm: KTH Royal Institute of Technology, 2011, Vol. 51, no 1, p. 5-8Conference paper (Other academic)
    Abstract [en]

    This paper makes the first attempt to perform an acoustic-to-articulatory inversion of a lion (Panthera leo) roar. The main problems that one encounters in attempting this, is the fact that little is known about the dimensions of the vocal tract, other than a general range of vocal tract lengths. Precious little is also known about the articulation strategies that are adopted by the lion while roaring. The approach used here is to iterate between possible values of vocal tract lengths and vocal tractconfigurations. Since there seems to be a distinct articulatory changes during the process of a roar, we find a smooth path that minimizes the error function between arecorded roar and the simulated roar using a variable length articulatory model.

  • 7.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Important regions in the articulator trajectory2008In: Proceedings of International Seminar on Speech Production / [ed] Rudolph Sock, Susanne Fuchs, Yves Laprie, Strasbourg, France: INRIA , 2008, p. 305-308Conference paper (Refereed)
    Abstract [en]

    This paper deals with identifying important regions in the articulatory trajectory based on the physical properties of the trajectory. A method to locate critical time instants as well as the key articulator positions is suggested. Acoustic-to-Articulatory Inversion using linear and non-linear regression isperformed using only these critical points. The accuracy of inversion is found to be almost the same as using all the data points.

  • 8.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mapping between acoustic and articulatory gestures2011In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 53, no 4, p. 567-589Article in journal (Refereed)
    Abstract [en]

    This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories arc essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.

  • 9.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping2011In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Prague, Czech republic, 2011, p. 4628-4631Conference paper (Refereed)
    Abstract [en]

    This paper studies the role of non-uniqueness in the Acoustic-to- Articulatory Inversion. It is generally believed that applying continuity constraints to the estimates of thearticulatory parameters can resolve the problem of non-uniqueness. This paper tries to find out whether all instances of non-uniqueness can be resolved using continuity constraints. The investigation reveals that applying continuity constraints provides the best estimate in roughly around 50 to 53 % of the non-unique mappings. Roughly around 8 to13 % of the non-unique mappings are best estimated by choosing discontinuous paths along the hypothetical high probability estimates of articulatory trajectories.

  • 10.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 10, p. 2672-2682Article in journal (Refereed)
    Abstract [en]

    This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

  • 11.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cross-modal Clustering in the Acoustic-Articulatory Space2009In: Proceedings Fonetik 2009: The XXIIth Swedish Phonetics Conference / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 202-207Conference paper (Other academic)
    Abstract [en]

    This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.

  • 12.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 2799-2802Conference paper (Refereed)
    Abstract [en]

    This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.

  • 13.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Using Imitation to learn Infant-Adult Acoustic Mappings2011In: 12th Annual Conference Of The International Speech Communication Association 2011 (INTERSPEECH 2011), Vols 1-5, ISCA , 2011, p. 772-775Conference paper (Refereed)
    Abstract [en]

    This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces, that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback Is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.

  • 14.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Detecting confusable phoneme pairs for Swedish language learners depending on their first language2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 89-92Article in journal (Other academic)
    Abstract [en]

    This paper proposes a paradigm where commonly made segmental pronunciation errors are modeled as pair-wise confusions between two or more phonemes in the language that is being learnt. The method uses an ensemble of support vector machine classifiers with time varying Mel frequency cepstral features to distinguish between several pairs of phonemes. These classifiers are then applied to classify the phonemes uttered by second language learners. Using this method, an assessment is made regarding the typical pronunciation problems that students learning Swedish would encounter, depending on their first language.

  • 15.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Abdou, Sherif
    Faculty of Computers & Information, Cairo University, Egypt.
    Using an Ensemble of Classifiers for Mispronunciation Feedback2011In: Proceedings of SLaTE / [ed] Strik, H.; Delmonte, R.; Russel, M., Venice, Italy, 2011Conference paper (Refereed)
    Abstract [en]

    This paper proposes a paradigm where commonly made segmental pronunciation errors are modeled as pair-wise confusions between two or more phonemes in the language that is being learnt. The method uses an ensemble of support vector machine classifiers with time varying Mel frequency cepstral features to distinguish between several pairs of phonemes. These classifiers are then applied to classify the phonemes uttered by second language learners. Instead of providing feedback at every mispronounced phoneme, the method attempts toprovide feedback about typical mispronunciations by a certain student, over an entire session of several utterances. Two case studies that demonstrate how the paradigm is applied to provide suitable feedback to two students is also described in this pape

  • 16. Eklund, R.
    et al.
    Peters, G.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Mabiza, E.
    An acoustic analysis of lion roars. I: Data collection and spectrogram and waveform analyses2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 1-4Article in journal (Other academic)
    Abstract [en]

    This paper describes the collection of lion roar data at two different locations, an outdoor setting at Antelope Park in Zimbabwe and an indoor setting at Parken Zoo in Sweden. Preliminary analyses of spectrographic and waveform data are provided.

  • 17. Katsamanis, N.
    et al.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Papandreou, G.
    Maragos, P.
    NTU, Athens, Greece.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Audiovisual speech inversion by switching dynamical modeling Governed by a Hidden Markov Process2008In: Proceedings of EUSIPCO, 2008Conference paper (Refereed)
    Abstract [en]

    We propose a unified framework to recover articulation from audiovisual speech. The nonlinear audiovisual-to-articulatory mapping is modeled by means of a switching linear dynamical system. Switching is governed by a state sequence determined via a Hidden Markov Model alignment process. Mel Frequency Cepstral Coefficients are extracted from audio while visual analysis is performed using Active Appearance Models. The articulatory state is represented by the coordinates of points on important articulators, e.g., tongue and lips. To evaluate our inversion approach, instead of just using the conventional correlation coefficients and root mean squared errors, we introduce a novel evaluation scheme that is more specific to the inversion problem. Prediction errors in the positions of the articulators are weighted differently depending on their relevant importance in the production of the corresponding sound. The applied weights are determined by an articulatory classification analysis using Support Vector Machines with a radial basis function kernel. Experiments are conducted in the audiovisual-articulatory MOCHA database.

  • 18.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    On the Non-uniqueness of Acoustic-to-Articulatory Mapping2008In: Proceedings FONETIK 2008, Göteborg, 2008, p. 9-13Conference paper (Other academic)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by minimizing the BIC while fitting the data into a GMM using the EM algorithm. The kurtosisis used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non unique articulator configurations. It is found that stop consonants and alveolar fricatives are generally not only non-linear but also non unique,while dental fricatives are found to be highly non-linear but fairly unique.

  • 19.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 1387-1390Conference paper (Refereed)
    Abstract [en]

    The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

  • 20.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Acoustic to Articulation Mapping: Non-linear or Non-unique?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 1485-1488Conference paper (Refereed)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non-unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by fitting the data into a Gaussian Mixture Model. The kurtosis is used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non-unique articulator configurations. It is found that stop consonants and alveolar fricatives arc generally not only non-linear but also non-unique, while dental fricatives arc found to be highly non-linear but fairly unique. Two more investigations are also discussed: the first is on how well the best possible piecewise linear regression is likely to perform, the second is on whether the dynamic constraints improve the ability to predict different articulatory regions corresponding to the same region in the acoustic space.

  • 21.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tracking pitch contours using minimum jerk trajectories2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 2056-2059Conference paper (Refereed)
    Abstract [en]

    This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

  • 22.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Laukka, P.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Classification of Affective Speech using Normalized Time-Frequency Cepstra2010In: Speech Prosody 2010 Conference Proceedings, Chicago, Illinois, U.S.A, 2010Conference paper (Refereed)
    Abstract [en]

    Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered.Speaker independent classification results using nu-SVMthe the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.

  • 23.
    Picard, Sebastien
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Abdou, S.
    Detection of Specific Mispronunciations using Audiovisual Features2010In: Auditory-Visual Speech Processing (AVSP) 2010, 2010Conference paper (Refereed)
    Abstract [en]

    This paper introduces a general approach for binaryclassification of audiovisual data. The intended application ismispronunciation detection for specific phonemic errors, usingvery sparse training data. The system uses a Support VectorMachine (SVM) classifier with features obtained from a TimeVarying Discrete Cosine Transform (TV-DCT) on the audiolog-spectrum as well as on the image sequences. Theconcatenated feature vectors from both the modalities werereduced to a very small subset using a combination of featureselection methods. We achieved 95-100% correctclassification for each pair-wise classifier on a database ofSwedish vowels with an average of 58 instances per vowel fortraining. The performance was largely unaffected when testedon data from a speaker who was not included in the training.

  • 24. Sen, Amrik
    et al.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sundaram, Suresh
    Ramakrishnan, A. G.
    Dynamic space warping of strokes for recognition of online handwritten characters2009In: International journal of pattern recognition and artificial intelligence, ISSN 0218-0014, Vol. 23, no 5, p. 925-943Article in journal (Refereed)
    Abstract [en]

    This paper suggests a scheme for classifying online handwritten characters, based on dynamic space warping of strokes within the characters. A method for segmenting components into strokes using velocity profiles is proposed. Each stroke is a simple arbitrary shape and is encoded using three attributes. Correspondence between various strokes is established using Dynamic Space Warping. A distance measure which reliably differentiates between two corresponding simple shapes (strokes) has been formulated thus obtaining a perceptual distance measure between any two characters. Tests indicate an accuracy of over 85% on two different datasets of characters.

1 - 24 of 24
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf