kth.sePublications
Change search
Refine search result
1 - 38 of 38
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 10, p. 2672-2682Article in journal (Refereed)
    Abstract [en]

    This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

    Download full text (pdf)
    fulltext
  • 2.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cross-modal Clustering in the Acoustic-Articulatory Space2009In: Proceedings Fonetik 2009: The XXIIth Swedish Phonetics Conference / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 202-207Conference paper (Other academic)
    Abstract [en]

    This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.

  • 3.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 2799-2802Conference paper (Refereed)
    Abstract [en]

    This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.

  • 4.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Directing conversation using the prosody of mm and mhm2010In: Proceedings of SLTC 2010, Linköping, Sweden, 2010, p. 15-16Conference paper (Refereed)
  • 5.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic cues to engagement in non-lexical response tokens in Swedish2010In: Proceedings of DiSS-LPSS Joint Workshop 2010, Tokyo, Japan, 2010Conference paper (Refereed)
  • 6. Imboden, S.
    et al.
    Petrone, M.
    Quadrani, P.
    Zannoni, C.
    Mayoral, R.
    Clapworthy, G. J.
    Testi, D.
    Viceconti, M.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tsagarakis, N. G.
    Caldwell, D.G.
    A Haptic Enabled Multimodal Pre-Operative Planner for Hip Arthroplasty2005In: World Haptics Conference: First Joint Eurohaptics Conference and Symposium on Haptic Interfaces for Virutual Environment and Teleoperator Systems, Proceedings, 2005, p. 503-504Conference paper (Refereed)
    Abstract [en]

    This paper introduces the multisense idea, with a special reference to the use of haptics in the medical field and, in particular, in the planning of total hip replacement surgery. We emphasise the integration of different modalities and the capability of the multimodal system to gather and register data coming from different sources.

  • 7. Landsiedel, Christian
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Eyben, Florian
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Schuller, Björn
    Syllabification of conversational speech using bidirectional long-short-term memory neural networks2011In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, Prague, Czech Republic, 2011, p. 5256-5259Conference paper (Refereed)
    Abstract [en]

    Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.

  • 8. Laukka, P.
    et al.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Fredriksson, M.
    Furumark, T.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Vocal Expression in spontaneous and experimentally induced affective speech: Acoustic correlates of anxiety, irritation and resignation2008In: Proceedings of the LREC 2008 Workshop on Corpora for Research on Emotion and Affect, Marrakesh, Marocko, 2008, p. 44-47Conference paper (Refereed)
    Abstract [en]

    We present two studies on authentic vocal affect expressions. In Study 1, the speech of social phobics was recorded in an anxiogenic public speaking task both before and after treatment. In Study 2, the speech material was collected from real life human-computer interactions. All speech samples were acoustically analyzed and subjected to listening tests. Results from Study 1 showed that a decrease in experienced state anxiety after treatment was accompanied by corresponding decreases in a) several acoustic parameters (i.e., mean and maximum F0, proportion of high-frequency components in the energy spectrum, and proportion of silent pauses), and b) listeners’ perceived level of nervousness. Both speakers’ self-ratings of state anxiety and listeners’ ratings of perceived nervousness were further correlated with similar acoustic parameters. Results from Study 2 revealed that mean and maximum F0, mean voice intensity and H1-H2 was higher for speech perceived as irritated than for speech perceived as neutral. Also, speech perceived as resigned had lower mean and maximum F0, and mean voice intensity than neutral speech. Listeners’ ratings of irritation, resignation and emotion intensity were further correlated with several acoustic parameters. The results complement earlier studies on vocal affect expression which have been conducted on posed, rather than authentic, emotional speech.

  • 9. Laukka, P.
    et al.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elfenbein, HA
    Classification of affective speech within and across cultures2013In: Frontiers in Psychology, E-ISSN 1664-1078Article in journal (Refereed)
    Abstract [en]

    Affect in speech is conveyed by patterns of pitch, intensity, voice quality and temporal features. The authors investigated how consistently emotions are expressed within and across cultures using a selection of 3,100 emotion portrayals from the VENEC corpus. The selection consisted of 11 emotions expressed with 3 levels of emotion intensity portrayed by professional actors from 5 different English speaking cultures (Australia, India, Kenya, Singapore, and USA). Classification experiments (nu-SVM) based on acoustic measures were performed in conditions where training and evaluation were conducted either within the same or different cultures and/or emotion intensities. Results first showed that average recall rates were 2.4-3.0 times higher than chance for intra- and inter-cultural conditions, whereas performance dropped 7-8 percentage units for cross-cultural conditions. This provides the first demonstration of an in-group advantage in cross-cultural emotion recognition using acoustic-feature-based classification. When further observed that matching the intensity level in training and testing data gave an advantage for high and medium intensity levels, but when classifying stimuli of unknown intensity the best performance was achieved with models trained on high intensity stimuli. Finally, classification performance across conditions varied as a function of emotion, with largest consistency for happiness, lust and relief. Implications for studies on cross-cultural emotion recognition and cross-corpora classification will be discussed.

  • 10. Laukka, P.
    et al.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Forsell, Mimmi
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation2011In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 25, no 1, p. 84-104Article in journal (Refereed)
    Abstract [en]

    The majority of previous studies on vocal expression have been conducted on posed expressions. In contrast, we utilized a large corpus of authentic affective speech recorded from real-life voice controlled telephone services. Listeners rated a selection of 200 utterances from this corpus with regard to level of perceived irritation, resignation, neutrality, and emotion intensity. The selected utterances came from 64 different speakers who each provided both neutral and affective stimuli. All utterances were further automatically analyzed regarding a comprehensive set of acoustic measures related to F0, intensity, formants, voice source, and temporal characteristics of speech. Results first showed that several significant acoustic differences were found between utterances classified as neutral and utterances classified as irritated or resigned using a within-persons design. Second, listeners' ratings on each scale were associated with several acoustic measures. In general the acoustic correlates of irritation, resignation, and emotion intensity were similar to previous findings obtained with posed expressions, though the effect sizes were smaller for the authentic expressions. Third, automatic classification (using LDA classifiers both with and without speaker adaptation) of irritation, resignation, and neutral performed at a level comparable to human performance, though human listeners and machines did not necessarily classify individual utterances similarly. Fourth, clearly perceived exemplars of irritation and resignation were rare in our corpus. These findings were discussed in relation to future research.

  • 11. Laukka, Petri
    et al.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elfenbein, Hillary Anger
    Evidence for Cultural Dialects in Vocal Emotion Expression: Acoustic Classification Within and Across Five Nations2014In: Emotion, ISSN 1528-3542, E-ISSN 1931-1516, Vol. 14, no 3, p. 445-449Article in journal (Refereed)
    Abstract [en]

    The possibility of cultural differences in the fundamental acoustic patterns used to express emotion through the voice is an unanswered question central to the larger debate about the universality versus cultural specificity of emotion. This study used emotionally inflected standard-content speech segments expressing 11 emotions produced by 100 professional actors from 5 English-speaking cultures. Machine learning simulations were employed to classify expressions based on their acoustic features, using conditions where training and testing were conducted on stimuli coming from either the same or different cultures. A wide range of emotions were classified with above-chance accuracy in cross-cultural conditions, suggesting vocal expressions share important characteristics across cultures. However, classification showed an in-group advantage with higher accuracy in within-versus cross-cultural conditions. This finding demonstrates cultural differences in expressive vocal style, and supports the dialect theory of emotions according to which greater recognition of expressions from in-group members results from greater familiarity with culturally specific expressive styles.

  • 12.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue2012Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Parallel with the orthographic streams of words in conversation are multiple layered epiphenomena, short in duration and with a communicativepurpose. These paralinguistic events regulate the interaction flow via gaze,gestures and intonation. This thesis focus on how to compute, model, discoverand analyze prosody and it’s applications for spoken dialog systems.Specifically it addresses automatic classification and analysis of conversationalcues related to turn-taking, brief feedback, affective expressions, their crossrelationshipsas well as their cognitive and neurological basis. Techniques areproposed for instantaneous and suprasegmental parameterization of scalarand vector valued representations of fundamental frequency, but also intensity and voice quality. Examples are given for how to engineer supervised learned automata’s for off-line processing of conversational corpora as well as for incremental on-line processing with low-latency constraints suitable as detector modules in a responsive social interface. Specific attention is given to the communicative functions of vocal feedback like "mhm", "okay" and "yeah, that’s right" as postulated by the theories of grounding, emotion and a survey on laymen opinions. The potential functions and their prosodic cues are investigated via automatic decoding, data-mining, exploratory visualization and descriptive measurements.

    Download full text (pdf)
    Thesis
  • 13.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Visualizing prosodic densities and contours: Forming one from many2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 57-60Article in journal (Other academic)
    Abstract [en]

    This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.

  • 14.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    On the Non-uniqueness of Acoustic-to-Articulatory Mapping2008In: Proceedings FONETIK 2008, Göteborg, 2008, p. 9-13Conference paper (Other academic)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by minimizing the BIC while fitting the data into a GMM using the EM algorithm. The kurtosisis used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non unique articulator configurations. It is found that stop consonants and alveolar fricatives are generally not only non-linear but also non unique,while dental fricatives are found to be highly non-linear but fairly unique.

  • 15.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 1387-1390Conference paper (Refereed)
    Abstract [en]

    The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

  • 16.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Acoustic to Articulation Mapping: Non-linear or Non-unique?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 1485-1488Conference paper (Refereed)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non-unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by fitting the data into a Gaussian Mixture Model. The kurtosis is used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non-unique articulator configurations. It is found that stop consonants and alveolar fricatives arc generally not only non-linear but also non-unique, while dental fricatives arc found to be highly non-linear but fairly unique. Two more investigations are also discussed: the first is on how well the best possible piecewise linear regression is likely to perform, the second is on whether the dynamic constraints improve the ability to predict different articulatory regions corresponding to the same region in the acoustic space.

  • 17.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tracking pitch contours using minimum jerk trajectories2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 2056-2059Conference paper (Refereed)
    Abstract [en]

    This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

  • 18.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Automatic Recognition of Anger in Spontaneous Speech2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2755-2758Conference paper (Refereed)
    Abstract [en]

    Automatic detection of real life negative emotions in speech has been evaluated using Linear Discriminant Analysis, LDA, with "classic" emotion features and a classifier based on Gaussian Mixture Models, GMMs. The latter uses Mel-Frequency Cepstral Coefficients, MFCCs, from a filter bank covering the 300-3400 Hz region to capture spectral shape and formants, and another in the 20-600 Hz region to capture prosody. Both classifiers have been tested on an extensive corpus from Swedish voice controlled telephone services. The results indicate that it is possible to detect anger with reasonable accuracy (average recall 83%) in natural speech and that the GMM method performed better than the LDA one.

  • 19.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Burger, S.
    Emotion Recognition2009In: Computers in the Human Interaction Loop / [ed] Waibel, A.; Stiefelhagen, R, Berlin/Heidelberg: Springer , 2009, p. 96-105Chapter in book (Refereed)
  • 20.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, K.
    Emotion Recognition in Spontaneous Speech2006In: Working Papers 52: Proceedings of Fonetik 2006, Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics , 2006, p. 101-104Conference paper (Other academic)
    Abstract [en]

    Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs, and a variant, MFCC-low, that is calculated between 20 and 300 Hz in order to model pitch. Plain pitch features have been used as well. These acoustic features have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method has been tested on two different corpora and languages; Swedish voice controlled telephone services and English meetings. The results indicate that using GMMs on the frame level is a feasible technique for emotion classification. The two MFCC methods have similar perform-ance, and MFCC-low outperforms the pitch features. Combining the three classifiers signifi-cantly improves performance.

  • 21.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, Kornel
    Emotion Recognition in Spontaneous Speech Using GMMs2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 809-812Conference paper (Refereed)
    Abstract [en]

    Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs, and a variant, MFCC-low, calculated between 20 and 300 Hz, in order to model pitch. Also plain pitch features have been used. These acoustic features have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method has been tested on two different corpora and languages; Swedish voice controlled telephone services and English meetings. The results indicate that using GMMs on the frame level is a feasible technique for emotion classification. The two MFCC methods have similar performance, and MFCC-low outperforms the pitch features. Combining the three classifiers significantly improves performance.

  • 22.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Dual Channel Coupled Decoder for Fillers and Feedback2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 3097-3100Conference paper (Refereed)
    Abstract [en]

    This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

  • 23.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cues to perceived functions of acted and spontaneous feedback expressions2012In: Proceedings of theInterdisciplinary Workshop on Feedback Behaviors in Dialog, 2012, p. 53-56Conference paper (Refereed)
    Abstract [en]

    We present a two step study where the first part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.

  • 24.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring the implications for feedback of a neurocognitive theory of overlapped speech2012In: Proceedings of Workshop on Feedback Behaviors in Dialog, 2012, p. 57-60Conference paper (Refereed)
  • 25.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Modeling Conversational Interaction Using Coupled Markov Chains2010In: Proceedings of DiSS-LPSS Joint Workshop 2010, 2010Conference paper (Refereed)
    Abstract [en]

    This paper presents a series of experiments on automatic transcription and classification of fillers and feedbacks in conversational speech corpora. A feature combination of PCA projected normalized F0 Constant-Q Cepstra and MFCCs has shown to be effective for standard Hidden Markov Models (HMM). We demonstrate how to model both speaker channel with coupled HMMs and show expected improvements. In particular, we explore model topologies which take advantage of predictive cues for fillers and feedback. This is done by initialize the training with special labels located immediately before fillers in the same channel and immediately before feedbacks in the other speaker channel. The average F-score for a standard HMM is 34.1%, for a coupled HMM 36.7% and for a coupled HMM with pre-filler and pre-feedback labels 40.4%. In a pilot study the detectors are found to be useful for semi-automatic transcription of feedback and fillers in socializing conversations.

  • 26.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Predicting Speaker Changes and Listener Responses With And Without Eye-contact2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy., 2011, p. 1576-1579Conference paper (Refereed)
    Abstract [en]

    This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.

  • 27.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic Characterization and Automatic Classification of Conversational Grunts in Swedish2010In: Working Papers 54: Proceedings from Fonetik 2010, 2010Conference paper (Other academic)
    Abstract [en]

    Conversation is the most common use of speech. Any automatic dialog system, pretending to mimic a human, must be able to successfully detect typical sounds and meanings of spontaneous conversational speech. Automatic transcription of the function of linguistic units, sometimes refereed to as Dialog Acts (DAs), Cue Phrases or Discourse Markers is an emerging area of research. This can be done on a pure lexical level, or by using prosody alone (Laskowski and Shriberg, 2010; Goto et al., 1999), or a combination of thereof (Sridhar et al., 2009; Gravano et al., 2007). However, it is not straightforward to train a language model for non-verbal content (e.g. “mm”, “mhm”, “eh”, “em”), not only since it is questionable if these sounds are words, but also because of lack of standardized annotation schemes. Ward (2000) refer to these tokens as conversational grunts, which is also the scope of this study. Feedback tokens are usually sub-divided into yes/no answers, backchannels and acknowledgments. In this study, it is the attitude of the response which is the focus of interest. Thus, the cut is instead made between dis-preference, news receiving and general feedback. These are further subdivided into their turn-taking effect: Other speaker, Same speaker and Simultaneous start. This allows us to verify if conversational grunts are simply carriers of prosodic information. In this study, we use a supra-segmental prosodic signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) introduced in (Neiberg et al., 2010), for classification and intuitive visualization of feedback and fillers. The contribution of the end of interlocutor left context for predicting turn taking effect has been studied for a while (Duncan, 1972) and is also addressed in this study. In addition, we examine the effect of contextual timing features, which has been shown to be useful in DAs recognition (Laskowski and Shriberg, 2010). We use the Swedish DEAL corpus which has annotated fillers and feedback attitudes. Classification results using linear discriminant analysis are presented. It was found that feedbacks followed by a clean floor taking lose some of their prosodic cues which signal attitude compared to a clean continuer feedback. Turn taking effects can be predicted well over chance level, while Simultaneous Start can’t be predicted at all. However, feedback tokens before Simultaneous Starts were found to be more equal feedback continuers than turn initial feedback tokens, which may be explained as inappropriate floor stealing attempts from the feedback producing speaker. An analysis based on the prototypical spectrograms closely follows the results for Bad News (Dispreference) vs Good news (News reciving) found in Freese and Maynard (1998) although the defnitions differ slightly.

  • 28.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    The Prosody of Swedish Conversational Grunts2010In: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, 2010, p. 2562-2565Conference paper (Refereed)
    Abstract [en]

    This paper explores conversational grunts in a face-to-face setting. The study investigates the prosody and turn-taking effect of fillers and feedback tokens that has been annotated for attitudes. The grunts were selected from the DEAL corpus and automatically annotated for their turn taking effect. A novel suprasegmental prosodic signal representation and contextual timing features are used for classification and visualization. Classification results using linear discriminant analysis, show that turn-initial feedback tokens lose some of their attitude-signaling prosodic cues compared to non-overlapping continuer feedback tokens. Turn taking effects can be predicted well over chance level, except Simultaneous Starts. However, feedback tokens before places where both speakers take the turn were more similar to feedback continuers than to turn initial feedback tokens.

  • 29.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English2012In: Proceedings of Fonetik, 2012Conference paper (Other academic)
  • 30.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Laukka, P.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Classification of Affective Speech using Normalized Time-Frequency Cepstra2010In: Speech Prosody 2010 Conference Proceedings, Chicago, Illinois, U.S.A, 2010Conference paper (Refereed)
    Abstract [en]

    Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered.Speaker independent classification results using nu-SVMthe the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.

  • 31.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laukka, P.
    Elfenbein, H. A.
    Intra-, Inter-, and Cross-cultural Classification of Vocal Affect2011In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy., 2011, p. 1592-1595Conference paper (Refereed)
    Abstract [en]

    We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.

  • 32.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Semi-supervised methods for exploring the acoustics of simple productive feedback2013In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 3, p. 451-469Article in journal (Refereed)
    Abstract [en]

    This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

  • 33.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Truong, K. P.
    A Maximum Latency Classifier for Listener Responses2010In: Proceedings of SLTC 2010, Linköping, Sweden, 2010Conference paper (Refereed)
    Abstract [en]

    When Listener Responses such as “yeah”, “right” or “mhm” are uttered in a face-to-face conversation, it is not uncommon for the interlocutor to continue to speak in overlap, i.e. before the Listener becomes silent. We propose a classifier which can classify incoming speech as a Listener Response or not before the talk-spurt ends. The classifier is implemented as an upgrade of the Embodied Conversational Agent developed in the SEMAINE project during the eNTERFACE 2010 workshop.

  • 34.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Truong, K.P.
    Online Detection Of Vocal Listener Responses With Maximum Latency Constraints2011In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, Prague, Czech Republic, 2011, p. 5836-5839Conference paper (Refereed)
    Abstract [en]

    When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as `yeah' and `mmhmm', interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in human-human conversation. To allow for this type of speech interactivity to occur between humans and spoken dialog systems, which will result in more human-like continuous and smoother human-machine interaction, we propose an on-line classifier which can classify incoming speech as Listener Responses. We show that it is possible to detect vocal Listener Responses using maximum latency thresholds of 100-500 ms, thereby obtaining equal error rates ranging from 34% to 28% by using an energy based voice activity detector.

  • 35. Reidsma, Dennis
    et al.
    de Kok, Iwan
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pammi, Sathish Chandra
    van Straalen, Bart
    Truong, Khiet
    van Welbergen, Herwin
    Continuous Interaction with a Virtual Human2011In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 4, no 2, p. 97-118Article in journal (Refereed)
    Abstract [en]

    This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking-and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.

  • 36. Testi, D.
    et al.
    Zannoni, C.
    Caldwell, D.G.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Clapworthy, G.J.
    Viceconti, M.
    An innovative multisensorial environment for pre-operative planning of total hip replacement2005In: 5th Annual Meeting of Computer Assisted Orthopaedic Surgery, 2005Conference paper (Refereed)
  • 37. Testi, D.
    et al.
    Zannoni, C.
    Petrone, M.
    Clapworthy, G. J.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Tsagarakis, N. G.
    Caldwell, D. G.
    Viceconti, M.
    A multimodal and multisensorial pre-operative planning environment for total hip replacement2005In: Proceedings - Third International Conference on Medical Information Visualisation - BioMedical Visualisation, MediVis 2005, 2005, p. 25-29Conference paper (Refereed)
    Abstract [en]

    This paper describes a new environment for the pre-operative planning of total hip replacement. The system is based on a multimodal/multisensorial interface, which includes advanced software visualisation and evaluation modules for the planning and state-of-the-art technologies for immersive interface (stereoscopic display, different six degrees of freedom tracking technologies, speech recognition, and haptic feedbacks). This paper is focused on the final clinical application description. More specific visualisation-related modules are described in other related papers.

  • 38. Testi, D.
    et al.
    Zannoni, C.
    Petrone, M.
    Clapworthy, G. J.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tsagarakis, N. G.
    Caldwell, D. G.
    Viceconti, M.
    A mutlimodal and multisensorial pre-operative planning environment for total hip replacement2005In: Proceedings of the Third International Conference on Medical Information Visualisation: BioMedical Visualisation (MediVis’05), 2005Conference paper (Refereed)
    Abstract [en]

    This paper describes a new environment for the pre-operative planning of total hip replacement. The system is based on a multimodal/multisensorial interface, which includes advanced software visualisation and evaluation modules for the planning and state-of-the-art technologies for immersive interface (stereoscopic display, different six degrees of freedom tracking technologies, speech recognition, and haptic feedbacks). This paper is focused on the final clinical application description. More specific visualisation-related modules are described in other related papers.

1 - 38 of 38
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf