Change search
Refine search result
45678910 301 - 350 of 472
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 301. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems2008In: 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, New York: IEEE , 2008, p. 5041-5044Conference paper (Refereed)
    Abstract [en]

    As spoken dialogue systems become deployed in increasingly complex domains, they face rising demands on the naturalness of interaction. We focus on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations. We derive an instantaneous vector representation of pitch variation and show that it isamenable to standard acoustic modeling techniques. Using a small amount of automatically labeled data, we train models which significantly outperform current state-of-the-art pause-only systems, and replicate to within 1% absolute the performance of our previously published hand-crafted baseline. The new system additionally offers scope for run-time control over the precision or recall of locations at which to speak.

  • 302. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Incremental learning and forgetting in incremental stochastic turn-taking models2011In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy, 2011, p. 2080-2083Conference paper (Refereed)
    Abstract [en]

    We present a computational framework for stochastically modeling dyad interaction chronograms. The framework's most novel feature is the capacity for incremental learning and forgetting. To showcase its flexibility, we design experiments answering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in interaction strategy; (3) strategies wander in time rather than converge; and (4) individuals exhibit similarity with their interlocutors. We expect the proposed framework to be capable of answering many such questions with little additional effort.

  • 303. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Learning prosodic sequences using the fundamental frequency variation spectrum2008In: Proceedings of the Speech Prosody 2008 Conference, Campinas, Brazil: Editora RG/CNPq , 2008, p. 151-154Conference paper (Refereed)
    Abstract [en]

    We investigate a recently introduced vector-valued representation of fundamental frequency variation, whose properties appear to be well-suited for statistical sequence modeling. We show what the representation looks like, and apply hidden Markov models to learn prosodic sequences characteristic of higher-level turn-taking phenomena. Our analysis shows that the models learn exactly those characteristics which have been reported for the phenomena in the literature. Further refinements to the representation lead to 12-17% relative improvement in speaker change prediction for conversational spoken dialogue systems.

  • 304. Laskowski, Kornel
    et al.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A general-purpose 32 ms prosodic vector for Hidden Markov Modeling2009In: Proceedings of Interspeech 2009, Brighton, UK: ISCA , 2009, p. 724-729Conference paper (Refereed)
    Abstract [en]

    Prosody plays a central role in communicating via speech, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experimentson 4 tasks demontrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by more than an order of magnitude. The resulting is sufficiently mature for general deployment in a broad range of automatic speech processing applications.

  • 305.
    Laskowski, Kornel
    et al.
    Carnegie Mellon University; Universit¨at Karlsruhe.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum2009In: Proceedings of the 2009 European Signal Processing Conference (EUSIPCO-2009), Glasgow, Scotland, 2009, p. 2539-2543Conference paper (Refereed)
    Abstract [en]

    A basic requirement for participation in conversation is the ability to jointly manage interaction. Examples of interaction management include indications to acquire, re-acquire, hold, release, and acknowledge floor ownership, and these are often implemented using specialized dialog act (DA) types. In this work, we explore the prosody of one class of such DA types, known as floor mechanisms, using a methodology based on a recently proposed representation of fundamental frequency variation (FFV). Models over the representation illustrate significant differences between floor mechanisms and other dialog act types, and lead to automatic detection accuracies in equal-prior test data of up to 75%. Analysis indicates that FFV modeling offers a useful tool for the discovery of prosodic phenomena which are not explicitly labeled in the audio.

  • 306. Laskowski, Kornel
    et al.
    Heldner, Mattias
    Stockholm University, Stockholm, Sweden .
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the dynamics of overlap in multi-party conversation2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 2012, p. 846-849Conference paper (Refereed)
    Abstract [en]

    Overlap, although short in duration, occurs frequently in multiparty conversation. We show that its duration is approximately log-normal, and inversely proportional to the number of simultaneously speaking parties. Using a simple model, we demonstrate that simultaneous talk tends to end simultaneously less frequently than in begins simultaneously, leading to an arrow of time in chronograms constructed from speech activity alone. The asymmetry is significant and discriminative. It appears to be due to dialog acts which do not carry propositional content, and those which are not brought to completion.

  • 307.
    Laskowski, Kornel
    et al.
    Carnegie Mellon University.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass2010In: Proceedings of NIPS Workshop on Modeling Human Communication Dynamics, Vancouver, B.C., Canada, 2010Conference paper (Refereed)
    Abstract [en]

    We present empirical justification of why logistic regression may acceptably approximate, using the number of currently vocalizing interlocutors, the probabilities returned by a time-invariant, conditionally independent model of turn-taking. The resulting parametric model with 3 degrees of freedom is shown to be identical to an infinite-range Ising antiferromagnet, with slow connections, in an external field; it is suitable for undifferentiated-participant scenarios. In differentiated-participant scenarios, untying parameters results in an infinite-range spin glass whose degrees of freedom scale as the square of the number of participants; it offers an efficient representation of participant-pair synchrony. We discuss the implications of model parametrization and of the thermodynamic and feed-forward perceptron formalisms for easily quantifying aspects of conversational dynamics.

  • 308. Laskowski, Kornel
    et al.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    The fundamental frequency variation spectrum2008In: Proceedings of FONETIK 2008, Gothenburg, Sweden: Department of Linguistics, University of Gothenburg , 2008, p. 29-32Conference paper (Other academic)
    Abstract [en]

    This paper describes a recently introduced vector-valued representation of fundamental frequency variation – the FFV spectrum – which has a number of desirable properties. In particular, it is instantaneous, continuous, distributed, and well suited for application of standard acoustic modeling techniques. We show what the representation looks like, and how it can be used to model prosodic sequences.

  • 309. Laskowski, Kornel
    et al.
    Wölfel, Matthias
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems2008In: Proceedings of Acoustics'08, Paris, France, 2008, p. 3305-3310Conference paper (Refereed)
    Abstract [en]

    Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling fundamental frequency, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is unde?ned in unvoiced segments, that its absolute magnitude is speaker-specific, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-specific error rate in a conversational spoken dialogue system to the specific definition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.

  • 310. Laukka, P.
    et al.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Fredriksson, M.
    Furumark, T.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Vocal Expression in spontaneous and experimentally induced affective speech: Acoustic correlates of anxiety, irritation and resignation2008In: Proceedings of the LREC 2008 Workshop on Corpora for Research on Emotion and Affect, Marrakesh, Marocko, 2008, p. 44-47Conference paper (Refereed)
    Abstract [en]

    We present two studies on authentic vocal affect expressions. In Study 1, the speech of social phobics was recorded in an anxiogenic public speaking task both before and after treatment. In Study 2, the speech material was collected from real life human-computer interactions. All speech samples were acoustically analyzed and subjected to listening tests. Results from Study 1 showed that a decrease in experienced state anxiety after treatment was accompanied by corresponding decreases in a) several acoustic parameters (i.e., mean and maximum F0, proportion of high-frequency components in the energy spectrum, and proportion of silent pauses), and b) listeners’ perceived level of nervousness. Both speakers’ self-ratings of state anxiety and listeners’ ratings of perceived nervousness were further correlated with similar acoustic parameters. Results from Study 2 revealed that mean and maximum F0, mean voice intensity and H1-H2 was higher for speech perceived as irritated than for speech perceived as neutral. Also, speech perceived as resigned had lower mean and maximum F0, and mean voice intensity than neutral speech. Listeners’ ratings of irritation, resignation and emotion intensity were further correlated with several acoustic parameters. The results complement earlier studies on vocal affect expression which have been conducted on posed, rather than authentic, emotional speech.

  • 311. Laukka, P.
    et al.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elfenbein, HA
    Classification of affective speech within and across cultures2013In: Frontiers in emotion scienceArticle in journal (Refereed)
    Abstract [en]

    Affect in speech is conveyed by patterns of pitch, intensity, voice quality and temporal features. The authors investigated how consistently emotions are expressed within and across cultures using a selection of 3,100 emotion portrayals from the VENEC corpus. The selection consisted of 11 emotions expressed with 3 levels of emotion intensity portrayed by professional actors from 5 different English speaking cultures (Australia, India, Kenya, Singapore, and USA). Classification experiments (nu-SVM) based on acoustic measures were performed in conditions where training and evaluation were conducted either within the same or different cultures and/or emotion intensities. Results first showed that average recall rates were 2.4-3.0 times higher than chance for intra- and inter-cultural conditions, whereas performance dropped 7-8 percentage units for cross-cultural conditions. This provides the first demonstration of an in-group advantage in cross-cultural emotion recognition using acoustic-feature-based classification. When further observed that matching the intensity level in training and testing data gave an advantage for high and medium intensity levels, but when classifying stimuli of unknown intensity the best performance was achieved with models trained on high intensity stimuli. Finally, classification performance across conditions varied as a function of emotion, with largest consistency for happiness, lust and relief. Implications for studies on cross-cultural emotion recognition and cross-corpora classification will be discussed.

  • 312. Laukka, P.
    et al.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Forsell, Mimmi
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation2011In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 25, no 1, p. 84-104Article in journal (Refereed)
    Abstract [en]

    The majority of previous studies on vocal expression have been conducted on posed expressions. In contrast, we utilized a large corpus of authentic affective speech recorded from real-life voice controlled telephone services. Listeners rated a selection of 200 utterances from this corpus with regard to level of perceived irritation, resignation, neutrality, and emotion intensity. The selected utterances came from 64 different speakers who each provided both neutral and affective stimuli. All utterances were further automatically analyzed regarding a comprehensive set of acoustic measures related to F0, intensity, formants, voice source, and temporal characteristics of speech. Results first showed that several significant acoustic differences were found between utterances classified as neutral and utterances classified as irritated or resigned using a within-persons design. Second, listeners' ratings on each scale were associated with several acoustic measures. In general the acoustic correlates of irritation, resignation, and emotion intensity were similar to previous findings obtained with posed expressions, though the effect sizes were smaller for the authentic expressions. Third, automatic classification (using LDA classifiers both with and without speaker adaptation) of irritation, resignation, and neutral performed at a level comparable to human performance, though human listeners and machines did not necessarily classify individual utterances similarly. Fourth, clearly perceived exemplars of irritation and resignation were rare in our corpus. These findings were discussed in relation to future research.

  • 313. Laukkanen, Anne-Maria
    et al.
    Pulakka, Hannu
    Alku, Paavo
    Vilkman, Erkki
    Hertegård, Stellan
    Lindestad, Per-Ake
    Larsson, Hans
    Granqvist, Svante
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    High-speed registration of phonation-related glottal area variation during artificial lengthening of the vocal tract2007In: Logopedics, Phoniatrics, Vocology, ISSN 1401-5439, E-ISSN 1651-2022, Vol. 32, no 4, p. 157-164Article in journal (Refereed)
    Abstract [en]

    Vocal exercises that increase the vocal tract impedance are widely used in voice training and therapy. The present study applies a versatile methodology to investigate phonation during varying artificial extension of the vocal tract. Two males and one female phonated into a hard-walled plastic tube ( 2 cm), whose physical length was randomly pair-wise changed between 30 cm, 60 cm and 100 cm. High-speed image (1900 f/sec) sequences of the vocal folds were obtained via a rigid endoscope. Acoustic and electroglottographic signals (EGG) were recorded. Oral pressure during shuttering of the tube was used to give an estimate of subglottic pressure (P-sub). The only trend observed was that with the two longer tubes compared to the shortest one, fundamental frequency was lower, open time of the glottis shorter, and P-sub higher. The results may partly reflect increased vocal tract impedance as such and partly the increased vocal effort to compensate for it. In other parameters there were individual differences in tube length-related changes, suggesting complexity of the coupling between supraglottic space and the glottis.

  • 314. Lidestam, B.
    et al.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Motivation and appraisal in perception of poorly specified speech2006In: Scandinavian Journal of Psychology, ISSN 0036-5564, E-ISSN 1467-9450, Vol. 47, no 2, p. 93-101Article in journal (Refereed)
    Abstract [en]

    Normal-hearing students (n= 72) performed sentence, consonant, and word identification in either A (auditory), V (visual), or AV (audiovisual) modality. The auditory signal had difficult speech-to-noise relations. Talker (human vs. synthetic), topic (no cue vs. cue-words), and emotion (no cue vs. facially displayed vs. cue-words) were varied within groups. After the first block, effects of modality, face, topic, and emotion on initial appraisal and motivation were assessed. After the entire session, effects of modality on longer-term appraisal and motivation were assessed. The results from both assessments showed that V identification was more positively appraised than A identification. Correlations were tentatively interpreted such that evaluation of self-rated performance possibly depends on subjective standard and is reflected on motivation (if below subjective standard, AV group), or on appraisal (if above subjective standard, A group). Suggestions for further research are presented.

  • 315. Lidestam, Bjoern
    et al.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Visual phonemic ambiguity and speechreading2006In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 49, no 4, p. 835-847Article in journal (Refereed)
    Abstract [en]

    Purpose: To study the role of visual perception of phonemes in visual perception of sentences and words among normal-hearing individuals. Method: Twenty-four normal-hearing adults identified consonants, words, and sentences, spoken by either a human or a synthetic talker. The synthetic talker was programmed with identical parameters within phoneme groups, hypothetically resulting in simplified articulation. Proportions of correctly identified phonemes per participant, condition, and task, as well as sensitivity to single consonants and clusters of consonants, were measured. Groups of mutually exclusive consonants were used for sensitivity analyses and hierarchical cluster analyses. Results: Consonant identification performance did not differ as a function of talker, nor did average sensitivity to single consonants. The bilabial and labiodental clusters were most readily identified and cohesive for both talkers. Word and sentence identification was better for the human talker than the synthetic talker. The participants were more sensitive to the clusters of the least visible consonants with the human talker than with the synthetic talker. Conclusions: it is suggested that ability to distiguish between clusters of the least visually distinct phonemes is important in speech reading. Specifically, it reduces the number of candidates, and thereby facilitates lexical identification.

  • 316. Lindberg, Borge
    et al.
    Johansen, Finn Tore
    Warakagoda, Narada
    Lehtinen, Gunnar
    Kacic, Zdravko
    Zgank, Andrei
    Elenius, Kjell
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    a noise robust multilingual reference recogniser based on speechdat(II)2000Conference paper (Refereed)
    Abstract [en]

    An important aspect of noise robustness of automatic speech recognisers (ASR) is the proper handling of non-speech acoustic events. The present paper describes further improvements of an already existing reference recogniser towards achieving such kind of robustness. The reference recogniser applied is the COST 249 SpeechDat reference recogniser, which is a fully automatic, language-independent training procedure for building a phonetic recogniser (http://www.telenor.no/fou/prosjekter/taletek/refrec). The reference recogniser relies on the HTK toolkit and a Speech- Dat(II) compatible database, and is designed to serve as a reference system in multilingual speech recognition research. The paper describes version 0.96 of the reference recogniser which take into account labelled non-speech acoustic events during training and provides robustness against these during testing. Results are presented on small and medium vocabulary recognition for six languages.

  • 317. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    (Re)use of place features in voiced stop systems: Role of phonetic constraints2008In: Proceedings of Fonetik 2008, University of Gothenburg, 2008, p. 5-8Conference paper (Other academic)
    Abstract [en]

    Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.

  • 318. Lindblom, Björn
    et al.
    Diehl, Randy
    Park, Sang-Hoon
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sound systems are shaped by their users: The recombination of phonetic substance2011In: Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories / [ed] G. Nick Clements, G. N.; Ridouane, R., John Benjamins Publishing Company, 2011, p. 67-97Chapter in book (Other academic)
    Abstract [en]

    Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content.

  • 319. Lison, P.
    et al.
    Meena, Raveesh
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Automatic Turn Segmentation for Movie & TV Subtitles2016In: 2016 IEEE Workshop on Spoken Language Technology (SLT 2016), IEEE conference proceedings, 2016, p. 245-252Conference paper (Refereed)
    Abstract [en]

    Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material -- although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78% on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.

  • 320.
    Lopes, José
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A First Visit to the Robot Language Café2017In: Proceedings of the ISCA workshop on Speech and Language Technology in Education / [ed] Engwall, Lopes, Stockholm, 2017Conference paper (Refereed)
    Abstract [en]

    We present an exploratory study on using a social robot in a conversational setting to practice a second language. The prac- tice is carried out within a so called language cafe ́, with two second language learners and one native moderator; a human or a robot; engaging in social small talk. We compare the in- teractions with the human and robot moderators and perform a qualitative analysis of the potentials of a social robot as a con- versational partner for language learning. Interactions with the robot are carried out in a wizard-of-Oz setting, in which the native moderator who leads the corresponding human moder- ator session controls the robot. The observations of the video recorded sessions and the subject questionnaires suggest that the appropriate learner level for the practice is elementary (A1 to A21), for whom the structured, slightly repetitive interaction pattern was perceived as beneficial. We identify both some key features that are appreciated by the learners and technological parts that need further development. 

  • 321.
    Lopes, José
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Abad, A.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Batista, F.
    Meena, Raveesh
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Trancoso, I.
    Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances2015In: INTERSPEECH-2015, 2015, p. 1805-1809Conference paper (Refereed)
    Abstract [en]

    Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

  • 322. López-Colino, F.
    et al.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Colás, J.
    Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls2008In: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008Conference paper (Other academic)
    Abstract [en]

    This paper presents the first version of the Mobile Synface application, which aims to provide a multimodal interface for telephone calls on mobile devices. The Mobile Synface application uses a talking face; it will simulate realistic lip movement for incoming voices. This application works as a complement for mobile VoIP applications without modifying their code or their functionality. The main purpose of this application is to improve the usability of mobile voice communication for hard of hearing people, or in noisy environments.

  • 323.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    O'Dell, Michael
    University of Tampere.
    Nieminen, Tommi
    University of Eastern Finland.
    Wagner, Petra
    Bielefeld University.
    Perspectives on speech timing: coupled oscillator modeling of Polish and Finnish2016In: Phonetica, ISSN 0031-8388, E-ISSN 1423-0321, Vol. 73, no 3-4Article in journal (Refereed)
    Abstract [en]

    We use an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O'Dell and Nieminen, 1999;2009) to analyze empirical duration data for Polish spoken at different tempos. We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared to parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.

  • 324.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. Saarland University, Germany.
    Włodarczak, Marcin
    Stockholms Universitet.
    Buschmeier, Hendrik
    Bielefeld University.
    Skubisz, Joanna
    Universidade Nova de Lisboa.
    Kopp, Stefan
    Bielefeld University.
    Wagner, Petra
    Bielefeld University.
    The ALICO corpus: analysing the active listener2016In: Language resources and evaluation, ISSN 1574-020X, E-ISSN 1574-0218, Vol. 50, no 2, p. 411-442Article in journal (Refereed)
    Abstract [en]

    The Active Listening Corpus (ALICO) is a multimodal data set of spontaneous dyadic conversations in German with diverse speech and gestural annotations of both dialogue partners. The annotations consist of short feedback expression transcriptions with corresponding communicative function interpretations as well as segmentations of interpausal units, words, rhythmic prominence intervals and vowel-to-vowel intervals. Additionally, ALICO contains head gesture annotations of both interlocutors. The corpus contributes to research on spontaneous human–human interaction, on functional relations between modalities, and timing variability in dialogue. It also provides data that differentiates between distracted and attentive listeners. We describe the main characteristics of the corpus and briefly present the most important results obtained from analyses in recent years.

  • 325.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Zygis, Marzena
    Special Issue: Slavic Perspectives on Prosody2016In: Phonetica, ISSN 0031-8388, E-ISSN 1423-0321, Vol. 73, no 3-4, p. 155-162Article in journal (Refereed)
  • 326.
    McAllister, Anita M.
    et al.
    Linköping University.
    Granqvist, Svante
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sjölander, Peta
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sundberg, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Child Voice and Noise: A Pilot Study of Noise in Day Cares and the Effects on 10 Children's Voice Quality According to Perceptual Evaluation2009In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 23, no 5, p. 587-593Article in journal (Refereed)
    Abstract [en]

    The purpose of this investigation was to study children's exposure to background noise at the ears during a normal day at the day care center and also to relate this to a perceptual evaluation of voice quality. Ten children, from three day care centers, with no history of hearing and speech problems or frequent infections were selected as subjects. A binaural recording technique was used with two microphones placed on both sides of the subject's head, at equal distance from the mouth. A portable digital audio tape (DAT) recorder (Sony TCD-D 100, Stockholm, Sweden) was attached to the subject's waist. Three recordings were made for each child during the day. Each recording was calibrated and started with three repetitions of three sentences containing only sonorants. The recording technique allowed separate analyses of the background noise level and of the sound pressure level (SPL) of each subjects' own voice. Results showed a mean background noise level for the three day care centers at 82.6 dBA Leq, ranging from 81.5 to 83.6 dBA Leq. Day care center no. 2 had the highest mean value and also the highest value at any separate recording session with a mean background noise level of 85.4 dBA Leq during the noontime recordings. Perceptual evaluation showed that the children attending this day care center also received higher values on the following voice characteristics: hoarseness, breathiness, and hyperfunction. Girls increased their loudness level during the day, whereas for boys no such change could be observed.

  • 327.
    Meena, Raveesh
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Data-driven Methods for Spoken Dialogue Systems: Applications in Language Understanding, Turn-taking, Error Detection, and Knowledge Acquisition2016Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Spoken dialogue systems are application interfaces that enable humans to interact with computers using spoken natural language. A major challenge for these systems is dealing with the ubiquity of variability—in user behavior, in the performance of the various speech and language processing sub-components, and in the dynamics of the task domain. However, as the predominant methodology for dialogue system development is to handcraft the sub-components, these systems typically lack robustness in user interactions. Data-driven methods, on the other hand, have been shown to offer robustness to variability in various domains of computer science and are increasingly being used in dialogue systems research.    

    This thesis makes four novel contributions to the data-driven methods for spoken dialogue system development. First, a method for interpreting the meaning contained in spoken utterances is presented. Second, an approach for determining when in a user’s speech it is appropriate for the system to give a response is presented. Third, an approach for error detection and analysis in dialogue system interactions is reported. Finally, an implicitly supervised learning approach for knowledge acquisition through the interactive setting of spoken dialogue is presented.     

    The general approach taken in this thesis is to model dialogue system tasks as a classification problem and investigate features (e.g., lexical, syntactic, semantic, prosodic, and contextual) to train various classifiers on interaction data. The central hypothesis of this thesis is that the models for the aforementioned dialogue system tasks trained using the features proposed here perform better than their corresponding baseline models. The empirical validity of this claim has been assessed through both quantitative and qualitative evaluations, using both objective and subjective measures.

  • 328.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Crowdsourcing Street-level Geographic Information Using a Spoken Dialogue System2014In: Proceedings of the SIGDIAL 2014 Conference, Association for Computational Linguistics, 2014, p. 2-11Conference paper (Refereed)
    Abstract [en]

    We present a technique for crowd-sourcing street-level geographic information using spoken natural language. In particular, we are interested in obtaining first-person-view information about what can be seen from different positions in the city. This information can then for example be used for pedestrian routing services. The approach has been tested in the lab using a fully implemented spoken dialogue system, and is showing promising results.

  • 329.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Using a Spoken Dialogue System for Crowdsourcing Street-level Geographic Information2014Conference paper (Refereed)
    Abstract [en]

    We present a novel scheme for enriching geographic database with street-level geographic information that could be useful for pedestrian navigation. A spoken dialogue system for crowdsourcing street-level geographic details was developed and tested in an in-lab experimentation, and has shown promising results.

  • 330.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Dabbaghchian, Saeed
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Data-driven Approach to Detection of Interruptions in Human-–human Conversations2014Conference paper (Refereed)
  • 331.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    David Lopes, José
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Automatic Detection of Miscommunication in Spoken Dialogue Systems2015In: Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015, p. 354-363Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system interactions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error awareness of the system, whereas offline detection could be used by a system designer to identify potential flaws in the dialogue design. In experimental evaluations on system logs from three different dialogue systems that vary in their dialogue strategy, the proposed models performed substantially better than the majority class baseline models.

  • 332.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Jokinen, Kristiina
    Wilcock, Graham
    Integration of gestures and speech in human-robot interaction2012In: 3rd IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2012 - Proceedings, IEEE , 2012, p. 673-678Conference paper (Refereed)
    Abstract [en]

    We present an approach to enhance the interaction abilities of the Nao humanoid robot by extending its communicative behavior with non-verbal gestures (hand and head movements, and gaze following). A set of non-verbal gestures were identified that Nao could use for enhancing its presentation and turn-management capabilities in conversational interactions. We discuss our approach for modeling and synthesizing gestures on the Nao robot. A scheme for system evaluation that compares the values of users' expectations and actual experiences has been presented. We found that open arm gestures, head movements and gaze following could significantly enhance Nao's ability to be expressive and appear lively, and to engage human users in conversational interactions.

  • 333.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Chunking Parser for Semantic Interpretation of Spoken Route Directions in Human-Robot Dialogue2012In: Proceedings of the 4th Swedish Language Technology Conference (SLTC 2012), Lund, Sweden, 2012, p. 55-56Conference paper (Refereed)
    Abstract [en]

    We present a novel application of the chunking parser for data-driven semantic interpretation of spoken route directions into route graphs that are useful for robot navigation. Various sets of features and machine learning algorithms were explored. The results indicate that our approach is robust to speech recognition errors, and could be easily used in other languages using simple features.

  • 334.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A data-driven approach to understanding spoken route directions in human-robot dialogue2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 2012, p. 226-229Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a data-driven chunking parser for automatic interpretation of spoken route directions into a route graph that is useful for robot navigation. Different sets of features and machine learning algorithms are explored. The results indicate that our approach is robust to speech recognition errors.

  • 335.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Data-driven Model for Timing Feedback in a Map Task Dialogue System2013In: 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGdial, Metz, France, 2013, p. 375-383Conference paper (Refereed)
    Abstract [en]

    We present a data-driven model for detecting suitable response locations in the user’s speech. The model has been trained on human–machine dialogue data and implemented and tested in a spoken dialogue system that can perform the Map Task with users. To our knowledge, this is the first example of a dialogue system that uses automatically extracted syntactic, prosodic and contextual features for online detection of response locations. A subjective evaluation of the dialogue system suggests that interactions with a system using our trained model were perceived significantly better than those with a system using a model that made decisions at random.

  • 336.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Human Evaluation of Conceptual Route Graphs for Interpreting Spoken Route Descriptions2013In: Proceedings of the 3rd International Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI), Potsdam, Germany, 2013, p. 30-35Conference paper (Refereed)
    Abstract [en]

    We present a human evaluation of the usefulness of conceptual route graphs (CRGs) when it comes to route following using spoken route descriptions. We describe a method for data-driven semantic interpretation of route de-scriptions into CRGs. The comparable performances of human participants in sketching a route using the manually transcribed CRGs and the CRGs produced on speech recognized route descriptions indicate the robustness of our method in preserving the vital conceptual information required for route following despite speech recognition errors.

  • 337.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Map Task Dialogue System: A Test-bed for Modelling Human-Like Dialogue2013In: 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGdial, Metz, France, 2013, p. 366-368Conference paper (Refereed)
    Abstract [en]

    The demonstrator presents a test-bed for collecting data on human–computer dialogue: a fully automated dialogue system that can perform Map Task with a user. In a first step, we have used the test-bed to collect human–computer Map Task dialogue data, and have trained various data-driven models on it for detecting feedback response locations in the user’s speech. One of the trained models has been tested in user interactions and was perceived better in comparison to a system using a random model. The demonstrator will exhibit three versions of the Map Task dialogue system—each using a different trained data-driven model of Response Location Detection.

  • 338.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafsson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Data-driven models for timing feedback responses in a Map Task dialogue system2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 4, p. 903-922Article in journal (Refereed)
    Abstract [en]

    Traditional dialogue systems use a fixed silence threshold to detect the end of users' turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and gestures, to coordinate smooth exchange of speaking turns. However, little effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human computer interactions. We present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). On this data, we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for detecting response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The subjective and objective measures from the user evaluation confirm that a model trained on speaker behavioural cues offers both smoother turn-transitions and more responsive system behaviour.

  • 339.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue2012Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Parallel with the orthographic streams of words in conversation are multiple layered epiphenomena, short in duration and with a communicativepurpose. These paralinguistic events regulate the interaction flow via gaze,gestures and intonation. This thesis focus on how to compute, model, discoverand analyze prosody and it’s applications for spoken dialog systems.Specifically it addresses automatic classification and analysis of conversationalcues related to turn-taking, brief feedback, affective expressions, their crossrelationshipsas well as their cognitive and neurological basis. Techniques areproposed for instantaneous and suprasegmental parameterization of scalarand vector valued representations of fundamental frequency, but also intensity and voice quality. Examples are given for how to engineer supervised learned automata’s for off-line processing of conversational corpora as well as for incremental on-line processing with low-latency constraints suitable as detector modules in a responsive social interface. Specific attention is given to the communicative functions of vocal feedback like "mhm", "okay" and "yeah, that’s right" as postulated by the theories of grounding, emotion and a survey on laymen opinions. The potential functions and their prosodic cues are investigated via automatic decoding, data-mining, exploratory visualization and descriptive measurements.

  • 340.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Visualizing prosodic densities and contours: Forming one from many2011In: TMH-QPSR, ISSN 1104-5787, Vol. 51, no 1, p. 57-60Article in journal (Other academic)
    Abstract [en]

    This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.

  • 341.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    On the Non-uniqueness of Acoustic-to-Articulatory Mapping2008In: Proceedings FONETIK 2008, Göteborg, 2008, p. 9-13Conference paper (Other academic)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by minimizing the BIC while fitting the data into a GMM using the EM algorithm. The kurtosisis used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non unique articulator configurations. It is found that stop consonants and alveolar fricatives are generally not only non-linear but also non unique,while dental fricatives are found to be highly non-linear but fairly unique.

  • 342.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 1387-1390Conference paper (Refereed)
    Abstract [en]

    The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

  • 343.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Acoustic to Articulation Mapping: Non-linear or Non-unique?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 1485-1488Conference paper (Refereed)
    Abstract [en]

    This paper studies the hypothesis that the acoustic-to-articulatory mapping is non-unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by fitting the data into a Gaussian Mixture Model. The kurtosis is used to measure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non-unique articulator configurations. It is found that stop consonants and alveolar fricatives arc generally not only non-linear but also non-unique, while dental fricatives arc found to be highly non-linear but fairly unique. Two more investigations are also discussed: the first is on how well the best possible piecewise linear regression is likely to perform, the second is on whether the dynamic constraints improve the ability to predict different articulatory regions corresponding to the same region in the acoustic space.

  • 344.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tracking pitch contours using minimum jerk trajectories2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 2056-2059Conference paper (Refereed)
    Abstract [en]

    This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

  • 345.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Automatic Recognition of Anger in Spontaneous Speech2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2755-2758Conference paper (Refereed)
    Abstract [en]

    Automatic detection of real life negative emotions in speech has been evaluated using Linear Discriminant Analysis, LDA, with "classic" emotion features and a classifier based on Gaussian Mixture Models, GMMs. The latter uses Mel-Frequency Cepstral Coefficients, MFCCs, from a filter bank covering the 300-3400 Hz region to capture spectral shape and formants, and another in the 20-600 Hz region to capture prosody. Both classifiers have been tested on an extensive corpus from Swedish voice controlled telephone services. The results indicate that it is possible to detect anger with reasonable accuracy (average recall 83%) in natural speech and that the GMM method performed better than the LDA one.

  • 346.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Burger, S.
    Emotion Recognition2009In: Computers in the Human Interaction Loop / [ed] Waibel, A.; Stiefelhagen, R, Berlin/Heidelberg: Springer , 2009, p. 96-105Chapter in book (Refereed)
  • 347.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, K.
    Emotion Recognition in Spontaneous Speech2006In: Working Papers 52: Proceedings of Fonetik 2006, Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics , 2006, p. 101-104Conference paper (Other academic)
    Abstract [en]

    Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs, and a variant, MFCC-low, that is calculated between 20 and 300 Hz in order to model pitch. Plain pitch features have been used as well. These acoustic features have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method has been tested on two different corpora and languages; Swedish voice controlled telephone services and English meetings. The results indicate that using GMMs on the frame level is a feasible technique for emotion classification. The two MFCC methods have similar perform-ance, and MFCC-low outperforms the pitch features. Combining the three classifiers signifi-cantly improves performance.

  • 348.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, Kornel
    Emotion Recognition in Spontaneous Speech Using GMMs2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 809-812Conference paper (Refereed)
    Abstract [en]

    Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral Coefficients, MFCCs, and a variant, MFCC-low, calculated between 20 and 300 Hz, in order to model pitch. Also plain pitch features have been used. These acoustic features have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method has been tested on two different corpora and languages; Swedish voice controlled telephone services and English meetings. The results indicate that using GMMs on the frame level is a feasible technique for emotion classification. The two MFCC methods have similar performance, and MFCC-low outperforms the pitch features. Combining the three classifiers significantly improves performance.

  • 349.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Dual Channel Coupled Decoder for Fillers and Feedback2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 3097-3100Conference paper (Refereed)
    Abstract [en]

    This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

  • 350.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cues to perceived functions of acted and spontaneous feedback expressions2012In: Proceedings of theInterdisciplinary Workshop on Feedback Behaviors in Dialog, 2012, p. 53-56Conference paper (Refereed)
    Abstract [en]

    We present a two step study where the first part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.

45678910 301 - 350 of 472
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf