Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue
KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. (Tal)
2012 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Parallel with the orthographic streams of words in conversation are multiple layered epiphenomena, short in duration and with a communicativepurpose. These paralinguistic events regulate the interaction flow via gaze,gestures and intonation. This thesis focus on how to compute, model, discoverand analyze prosody and it’s applications for spoken dialog systems.Specifically it addresses automatic classification and analysis of conversationalcues related to turn-taking, brief feedback, affective expressions, their crossrelationshipsas well as their cognitive and neurological basis. Techniques areproposed for instantaneous and suprasegmental parameterization of scalarand vector valued representations of fundamental frequency, but also intensity and voice quality. Examples are given for how to engineer supervised learned automata’s for off-line processing of conversational corpora as well as for incremental on-line processing with low-latency constraints suitable as detector modules in a responsive social interface. Specific attention is given to the communicative functions of vocal feedback like "mhm", "okay" and "yeah, that’s right" as postulated by the theories of grounding, emotion and a survey on laymen opinions. The potential functions and their prosodic cues are investigated via automatic decoding, data-mining, exploratory visualization and descriptive measurements.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2012. , s. xiv, 86
Serie
Trita-CSC-A, ISSN 1653-5723 ; 2012:08
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
URN: urn:nbn:se:kth:diva-102335ISBN: 978-91-7501-467-8 (tryckt)OAI: oai:DiVA.org:kth-102335DiVA, id: diva2:552376
Disputation
2012-09-28, Sal F3, Lindstedtsvägen 26, KTH, Stockholm, 13:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20120914

Tillgänglig från: 2012-09-14 Skapad: 2012-09-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
Delarbeten
1. Tracking pitch contours using minimum jerk trajectories
Öppna denna publikation i ny flik eller fönster >>Tracking pitch contours using minimum jerk trajectories
2011 (Engelska)Ingår i: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, s. 2056-2059Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

Nyckelord
pitch tracking, Constant-Q, Swedish accent II
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52192 (URN)000316502201003 ()2-s2.0-84865794085 (Scopus ID)978-1-61839-270-1 (ISBN)
Konferens
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. 28-31 August 2011
Anmärkning

tmh_import_11_12_14. QC 20111222

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
2. Exploring the implications for feedback of a neurocognitive theory of overlapped speech
Öppna denna publikation i ny flik eller fönster >>Exploring the implications for feedback of a neurocognitive theory of overlapped speech
2012 (Engelska)Ingår i: Proceedings of Workshop on Feedback Behaviors in Dialog, 2012, s. 57-60Konferensbidrag, Poster (med eller utan abstract) (Refereegranskat)
Nyckelord
feedback, functions of feedback, goal driven categories, taxonomy
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-102329 (URN)
Konferens
The Interdisciplinary Workshop on Feedback Behaviors in Dialog
Projekt
SAMSYNTIURO
Forskningsfinansiär
Vetenskapsrådet, 2009-4291EU, Europeiska forskningsrådet, FP7 – 248314
Anmärkning

QC 20120914

Tillgänglig från: 2012-09-13 Skapad: 2012-09-13 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
3. Semi-supervised methods for exploring the acoustics of simple productive feedback
Öppna denna publikation i ny flik eller fönster >>Semi-supervised methods for exploring the acoustics of simple productive feedback
2013 (Engelska)Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, nr 3, s. 451-469Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

Nyckelord
social signal processing, affective annotation, feedback modeling, grounding 2000 MSC
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-102334 (URN)10.1016/j.specom.2012.12.007 (DOI)000316837000005 ()2-s2.0-84875460872 (Scopus ID)
Projekt
SAMSYNTIURO
Forskningsfinansiär
Vetenskapsrådet, 2009-4291EU, Europeiska forskningsrådet, FP7 – 248314
Anmärkning

QC 20130508

Tillgänglig från: 2012-09-14 Skapad: 2012-09-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
4. Prosodic cues to engagement in non-lexical response tokens in Swedish
Öppna denna publikation i ny flik eller fönster >>Prosodic cues to engagement in non-lexical response tokens in Swedish
2010 (Engelska)Ingår i: Proceedings of DiSS-LPSS Joint Workshop 2010, Tokyo, Japan, 2010Konferensbidrag, Publicerat paper (Refereegranskat)
Ort, förlag, år, upplaga, sidor
Tokyo, Japan: , 2010
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52148 (URN)
Konferens
DiSS-LPSS Joint Workshop 2010, University of Tokyo, Japan, September 25-26, 2010
Anmärkning
tmh_import_11_12_14. QC 20120125Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
5. Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English
Öppna denna publikation i ny flik eller fönster >>Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English
2012 (Engelska)Ingår i: Proceedings of Fonetik, 2012Konferensbidrag, Enbart muntlig presentation (Övrigt vetenskapligt)
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-102330 (URN)
Konferens
Fonetik
Projekt
IUROSAMSYNT
Forskningsfinansiär
EU, Europeiska forskningsrådet, FP7 – 248314Vetenskapsrådet, 2009-4291ICT - The Next Generation
Anmärkning

QC 20120914

Tillgänglig från: 2012-09-14 Skapad: 2012-09-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
6. Cues to perceived functions of acted and spontaneous feedback expressions
Öppna denna publikation i ny flik eller fönster >>Cues to perceived functions of acted and spontaneous feedback expressions
2012 (Engelska)Ingår i: Proceedings of theInterdisciplinary Workshop on Feedback Behaviors in Dialog, 2012, s. 53-56Konferensbidrag, Poster (med eller utan abstract) (Refereegranskat)
Abstract [en]

We present a two step study where the first part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.

Nyckelord
feedback, functions of feedback, goal driven categories, taxonomy
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-102333 (URN)
Konferens
The Interdisciplinary Workshop on Feedback Behaviors in Dialog
Projekt
SAMSYNTIURO
Forskningsfinansiär
Vetenskapsrådet, 2009-4291EU, Europeiska forskningsrådet, FP7 – 248314ICT - The Next Generation
Anmärkning

QC 20120914

Tillgänglig från: 2012-09-14 Skapad: 2012-09-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
7. Predicting Speaker Changes and Listener Responses With And Without Eye-contact
Öppna denna publikation i ny flik eller fönster >>Predicting Speaker Changes and Listener Responses With And Without Eye-contact
2011 (Engelska)Ingår i: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy., 2011, s. 1576-1579Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.

Ort, förlag, år, upplaga, sidor
Florence, Italy.: , 2011
Nyckelord
Turn-taking, Back-channels
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52195 (URN)000316502200396 ()2-s2.0-84865794088 (Scopus ID)978-1-61839-270-1 (ISBN)
Konferens
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy
Anmärkning

tmh_import_11_12_14 QC 20111216

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
8. Continuous Interaction with a Virtual Human
Öppna denna publikation i ny flik eller fönster >>Continuous Interaction with a Virtual Human
Visa övriga...
2011 (Engelska)Ingår i: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 4, nr 2, s. 97-118Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking-and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.

Nyckelord
Attentive speaking, Continuous interaction, Listener responses, Virtual humans
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52194 (URN)10.1007/s12193-011-0060-x (DOI)000309997100004 ()2-s2.0-80955180056 (Scopus ID)
Anmärkning

tmh_import_11_12_14. QC 20111215

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
9. A Dual Channel Coupled Decoder for Fillers and Feedback
Öppna denna publikation i ny flik eller fönster >>A Dual Channel Coupled Decoder for Fillers and Feedback
2011 (Engelska)Ingår i: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, s. 3097-3100Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

Nyckelord
Conversation, Coupled hidden Markov models, Cross-speaker modeling, Feedback, Filler
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52193 (URN)000316502201265 ()2-s2.0-8486579156 (Scopus ID)978-1-61839-270-1 (ISBN)
Konferens
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. 28-31 August 2011
Anmärkning

tmh_import_11_12_14. QC 20111222

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
10. The Prosody of Swedish Conversational Grunts
Öppna denna publikation i ny flik eller fönster >>The Prosody of Swedish Conversational Grunts
2010 (Engelska)Ingår i: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, 2010, s. 2562-2565Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper explores conversational grunts in a face-to-face setting. The study investigates the prosody and turn-taking effect of fillers and feedback tokens that has been annotated for attitudes. The grunts were selected from the DEAL corpus and automatically annotated for their turn taking effect. A novel suprasegmental prosodic signal representation and contextual timing features are used for classification and visualization. Classification results using linear discriminant analysis, show that turn-initial feedback tokens lose some of their attitude-signaling prosodic cues compared to non-overlapping continuer feedback tokens. Turn taking effects can be predicted well over chance level, except Simultaneous Starts. However, feedback tokens before places where both speakers take the turn were more similar to feedback continuers than to turn initial feedback tokens.

Nyckelord
prosody, fillers, feedback, suprasegmental, conversational grunts
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52141 (URN)000313086500255 ()2-s2.0-79959844001 (Scopus ID)978-1-61782-123-3 (ISBN)
Konferens
INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba. 26 September 2010 - 30 September 2010
Anmärkning

tmh_import_11_12_14. QC 20111222

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
11. Intra-, Inter-, and Cross-cultural Classification of Vocal Affect
Öppna denna publikation i ny flik eller fönster >>Intra-, Inter-, and Cross-cultural Classification of Vocal Affect
2011 (Engelska)Ingår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy., 2011, s. 1592-1595Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.

Ort, förlag, år, upplaga, sidor
Florence, Italy.: , 2011
Nyckelord
emotion, affect, cross-cultural
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52191 (URN)000316502200400 ()2-s2.0-84865794836 (Scopus ID)978-1-61839-270-1 (ISBN)
Konferens
12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011; Florence; Italy; 27 August 2011 through 31 August 2011
Anmärkning

tmh_import_11_12_14 QC 20111219

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
12. Emotion Recognition
Öppna denna publikation i ny flik eller fönster >>Emotion Recognition
2009 (Engelska)Ingår i: Computers in the Human Interaction Loop / [ed] Waibel, A.; Stiefelhagen, R, Berlin/Heidelberg: Springer , 2009, s. 96-105Kapitel i bok, del av antologi (Refereegranskat)
Ort, förlag, år, upplaga, sidor
Berlin/Heidelberg: Springer, 2009
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52064 (URN)10.1007/978-1-84882-054-8_10 (DOI)978-1-84882-053-1 (ISBN)
Anmärkning
tmh_import_11_12_14. QC 20111222Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad

Open Access i DiVA

Thesis(1611 kB)967 nedladdningar
Filinformation
Filnamn FULLTEXT02.pdfFilstorlek 1611 kBChecksumma SHA-512
2247b2ac8f05358ea8e9f1d207a184d20cb7a32d433c190af0c3aafb1c552b1721db96f84b415b77726e9ae8d2881e9de2b6881c06485a5e5d6351826d71065a
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Neiberg, Daniel
Av organisationen
Tal-kommunikation
Språkteknologi (språkvetenskaplig databehandling)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 967 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 971 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf