Change search
Refine search result
123 51 - 100 of 108
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The effects of prosodic features on the interpretation of clarification ellipses2005In: Proceedings of Interspeech 2005: Eurospeech, 2005, p. 2389-2392Conference paper (Refereed)
    Abstract [en]

    In this paper, the effects of prosodic features on the interpretation of elliptical clarification requests in dialogue are studied. An experiment is presented where subjects were asked to listen to short human-computer dialogue fragments in Swedish, where a synthetic voice was making an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and the subjects were asked to judge what was actually intended by the computer. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.

  • 52.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Strömbergsson, Sofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Question types and some prosodic correlates in 600 questions in the Spontal database of Swedish dialogues2012In: Proceedings Of The 6th International Conference On Speech Prosody, Vols I and  II, Shanghai, China: Tongji Univ Press , 2012, p. 737-740Conference paper (Refereed)
    Abstract [en]

    Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. We present initial steps of a larger project investigating and describing intonational variation in the Spontal database of 120 half-hour spontaneous dialogues in Swedish, and testing the hypothesis that the concept of a standard question intonation such as a final pitch rise contrasting a final low declarative intonation is not consistent with the pragmatic use of intonation in dialogue. We report on the extraction of 600 questions from the Spontal corpus, coding and annotation of question typology, and preliminary results concerning some prosodic correlates related to question type.

  • 53.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Investigating negotiation for load-time in the GetHomeSafe project2012In: Proc. of Workshop on Innovation and Applications in Speech Technology (IAST), Dublin, Ireland, 2012, p. 45-48Conference paper (Refereed)
    Abstract [en]

    This paper describes ongoing work by KTH Speech, Music and Hearing in GetHomeSafe, a newly inaugurated EU project in collaboration with DFKI, Nuance, IBM and Daimler. Under the assumption that drivers will utilize technology while driving regardless of legislation, the project aims at finding out how to make the use of in-car technology as safe as possible rather than prohibiting it. We describe the project in general briefly and our role in some more detail, in particular one of our tasks: to build a system that can ask the driver if now is a good time to speak about X? in an unobtrusive manner, and that knows how to deal with rejection, for example by asking the driver to get back when it is a good time or to schedule a time that will be convenient.

  • 54.
    Edlund, Jens
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Skantze, Gabriel
    KTH, Superseded Departments, Speech, Music and Hearing.
    Carlson, Rolf
    KTH, Superseded Departments, Speech, Music and Hearing.
    Higgins: a spoken dialogue system for investigating error handling techniques2004In: Proceedings of the International Conference on Spoken Language Processing, ICSLP 04, 2004, p. 229-231Conference paper (Refereed)
    Abstract [en]

    In this paper, an overview of the Higgins project and the research within the project is presented. The project incorporates studies of error handling for spoken dialogue systems on several levels, from processing to dialogue level. A domain in which a range of different error types can be studied has been chosen: pedestrian navigation and guiding. Several data collections within Higgins have been analysed along with data from Higgins' predecessor, the AdApt system. The error handling research issues in the project are presented in light of these analyses.

  • 55.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Strömbergsson, Sofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Telling questions from statements in spoken dialogue systems2012In: Proc. of SLTC 2012, Lund, Sweden, 2012Conference paper (Refereed)
  • 56.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tånnander, Christina
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Audience response system-based assessment for analysis-by-synthesis2015In: Proc. of ICPhS 2015, ICPhS , 2015Conference paper (Refereed)
  • 57.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 58.
    Gustafson, Joakim
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Bell, Linda
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Johan, Boye
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Edlund, Jens
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Wirn, Mats
    Constraint Manipulation and Visualization in a Multimodal Dialogue System2002In: Proceedings of MultiModal Dialogue in Mobile Environments, 2002Conference paper (Refereed)
  • 59.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ask the experts - Part I: Elicitation2010In: Linguistic Theory and Raw Sound / [ed] Juel Henrichsen, Peter, Samfundslitteratur , 2010, p. 169-182Chapter in book (Refereed)
    Abstract [en]

    We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the first part of a two-part discussion which is illustrated with examples mainly from the work at KTH Speech, Music and Hearing. In this part, we discuss methods of capturing what people do when they talk to each other.

  • 60.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    EXPROS: Tools for exploratory experimentation with prosody2008In: Proceedings of FONETIK 2008, Gothenburg, Sweden, 2008, p. 17-20Conference paper (Other academic)
    Abstract [en]

    This demo paper presents EXPROS, a toolkit for experimentation with prosody in diphone voices. Although prosodic features play an important role in human-human spoken dialogue, they are largely unexploited in current spoken dialogue systems. The toolkit contains tools for a number of purposes: for example extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances and creation of purpose-built customized MBROLA mini-voices.

  • 61.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Potential benefits of human-like dialogue behaviour in the call routing domain2008In: Perception In Multimodal Dialogue Systems, Proceedings / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Pieraccini, R; Weber, M, 2008, Vol. 5078, p. 240-251Conference paper (Refereed)
    Abstract [en]

    This paper presents a Wizard-of-Oz (Woz) experiment in the call routing domain that took place during the development of a call routing system for the TeliaSonera residential customer care in Sweden. A corpus of 42,000 calls was used as a basis for identifying problematic dialogues and the strategies used by operators to overcome the problems. A new Woz recording was made, implementing some of these strategies. The collected data is described and discussed with a view to explore the possible benefits of more human-like dialogue behaviour in call routing applications.

  • 62.
    Gustafson, Joakirn
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    EXPROS: A toolkit for exploratory experimentation with prosody in customized diphone voices2008In: Perception In Multimodal Dialogue Systems, Proceedings / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Pieraccini, R; Weber, M, 2008, Vol. 5078, p. 293-296Conference paper (Refereed)
    Abstract [en]

    This paper presents a toolkit for experimentation with prosody in diphone voices. Prosodic features play an important role for aspects of human-human spoken dialogue that are largely unexploited in current spoken dialogue systems. The toolkit contains tools for recording utterances for a number of purposes. Examples include extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances, and creation of purpose-built customized MBROLA mini-voices.

  • 63. Heldner, Mattias
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Continuer relevance spaces2012In: Proc. of Nordic Prosody XI, Tartu, Estonia, 2012Conference paper (Other academic)
  • 64.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Pauses, gaps and overlaps in conversations2010In: Journal of Phonetics, ISSN 0095-4470, E-ISSN 1095-8576, Vol. 38, no 4, p. 555-568Article in journal (Refereed)
    Abstract [en]

    This paper explores durational aspects of pauses gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking Distributions of pause gap and overlap durations in conversations are presented and methodological issues regarding the statistical treatment of such distributions are discussed The results are related to published minimal response times for spoken utterances and thresholds for detection of acoustic silences in speech It is shown that turn-taking is generally less precise than is often claimed by researchers in the field of conversation analysis or interactional linguistics These results are discussed in the light of their implications for models of timing in turn-taking and for interaction control models in speech technology In particular it is argued that the proportion of speaker changes that could potentially be triggered by information immediately preceding the speaker change is large enough for reactive interaction controls models to be viable in speech technology.

  • 65.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic cues for interaction control in spoken dialogue systems2006In: Proceedings of Fonetik 2006, Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics , 2006, p. 53-56Conference paper (Other academic)
    Abstract [en]

    This paper discusses the feasibility of using prosodic features for interaction control in spoken dialogue systems, and points to experimental evidence that automatically extracted prosodic features can be used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.

  • 66.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    What turns speech into conversation?: A project description2007In: TMH-QPSR, ISSN 1104-5787, Vol. 50, no 1, p. 45-48Article in journal (Refereed)
    Abstract [en]

    The project Vad gör tal till samtal? (What turns speech into conversation?) takes as its starting point that while conversation must be considered the primary kind of speech, we are still far better at modelling monologue than dialogue, in theory as well as for speech technology applications. There are also good reasons to assume that conversation contains a number of features that are not found in other kinds of speech, including, among other things, the active cooperation among interlocutors to control the interaction, and to establish common ground. Through this project, we hope to improve the situation by investigating features that are specific to human-human conversation – features that turns speech into conversation. We will focus on acoustic and prosodic aspects of such features.

  • 67.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Björkenstam, Tomas
    Automatically extracted F0 features as acoustic correlates of prosodic boundaries2004In: Fonetik 2004: Proc of The XVIIth Swedish Phonetics Conference, Stockholm University, 2004, p. 52-55Conference paper (Refereed)
    Abstract [en]

    This work presents preliminary results of an investigation of various automatically extracted F0 features as acoustic correlates of prosodic boundaries. The F0 features were primarily intended to capture phenomena such as boundary tones, F0 resets across boundaries and position in the speaker's F0 range. While there were no correspondences between boundary tones and boundaries, the reset and range features appeared to separate boundaries from no boundaries fairly well.

  • 68.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Carlson, Rolf
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Interruption impossible2006In: Nordic Prosody: Proceedings of the IXth Conference, Lund 2004 / [ed] Bruce, G.; Horne, M., Frankfurt am Main, Germany, 2006, p. 97-105Conference paper (Refereed)
    Abstract [en]

    Most current work on spoken human-computer interaction has so far concentrated on interactions between a single user and a dialogue system. The advent of ideas of the computer or dialogue system as a conversational partner in a group of humans, for example within the CHIL-project1 and elsewhere (e.g. Kirchhoff & Ostendorf, 2003), introduces new requirements on the capabilities of the dialogue system. Among other things, the computer as a participant in a multi-part conversation has to appreciate the human turn-taking system, in order to time its' own interjections appropriately. As the role of a conversational computer is likely to be to support human collaboration, rather than to guide or control it, it is particularly important that it does not interrupt or disturb the human participants. The ultimate goal of the work presented here is to predict suitable places for turn-takings, as well as positions where it is impossible for a conversational computer to interrupt without irritating the human interlocutors.

  • 69.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hirschberg, Julia
    Pitch similarity in the vicinity of backchannels2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, p. 3054-3057Conference paper (Refereed)
    Abstract [en]

    Dynamic modeling of spoken dialogue seeks to capture how interlocutors change their speech over the course of a conversation. Much work has focused on how speakers adapt or entrain to different aspects of one another’s speaking style. In this paper we focus on local aspects of this adaptation. We investigate the relationship between backchannels and the interlocutor utterances that precede them with respect to pitch. We demonstrate that the pitch of backchannels is more similar to the immediately preceding utterance than nonbackchannels. This inter-speaker pitch relationship captures the same distinctions as more cumbersome intra-speaker relations, and supports the intuition that, in terms of pitch, such similarity may be one of the mechanisms by which backchannels are rendered ’unobtrusive’.

  • 70.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, Kornel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Very short utterances and timing in turn-taking2011In: Proceedings of Interspeech 2011, 2011, p. 2848-2851Conference paper (Refereed)
    Abstract [en]

    This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).

  • 71.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Laskowski, Kornel
    Pelcé, Antoine
    Prosodic features in the vicinity of pauses, gaps and overlaps2009In: Nordic Prosody: Proceedings of the Xth Conference / [ed] Vainio, Martti; Aulanko, Reijo; Aaltonen, Olli, Frankfurt am Main: Peter Lang , 2009, p. 95-106Conference paper (Refereed)
    Abstract [en]

    In this study, we describe the range of prosodic variation observed in two types of dialogue contexts, using fully automatic methods. The first type of context is that of speaker-changes, or transitions from only one participant speaking to only the other, involving either acoustic silences or acoustic overlaps. The second type of context is comprised of mutual silences or overlaps where a speaker change could in principle occur but does not. For lack of a better term, we will refer to these contexts as non-speaker-changes. More specifically, we investigate F0 patterns in the intervals immediately preceding overlaps and silences – in order to assess whether prosody before overlaps or silences may invite or inhibit speaker change.

  • 72.
    Heldner, Mattias
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Backchannel relevance spaces2013In: Prosody: Proceedings of the XIth Conference, Tartu 2012 / [ed] Eva Liina / Lippus, Pärtel, Peter Lang Publishing Group, 2013, p. 137-146Conference paper (Refereed)
  • 73.
    Hincks, Rebecca
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Language and Communication.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    PROMOTING INCREASED PITCH VARIATION IN ORAL PRESENTATIONS WITH TRANSIENT VISUAL FEEDBACK2009In: Language Learning & Technology, ISSN 1094-3501, E-ISSN 1094-3501, Vol. 13, no 3, p. 32-50Article in journal (Refereed)
    Abstract [en]

    This paper investigates learner response to a novel kind of intonation feedback generated from speech analysis. Instead of displays of pitch curves, our feedback is flashing lights that show how much pitch variation the speaker has produced. The variable used to generate the feedback is the standard deviation of fundamental frequency as measured in semitones. Flat speech causes the system to show yellow lights, while more expressive speech that has used pitch to give focus to any part of an utterance generates green lights. Participants in the study were 14 Chinese students of English at intermediate and advanced levels. A group that received visual feedback was compared with a group that received audio feedback. Pitch variation was measured at four stages: in a baseline oral presentation; for the first and second halves of three hours of training; and finally in the production of a new oral presentation. Both groups increased their pitch variation with training, and the effect lasted after the training had ended. The test group showed a significantly higher increase than the control group, indicating that the feedback is effective. These positive results imply that the feedback could be beneficially used in a system for practicing oral presentations.

  • 74.
    Hincks, Rebecca
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Language and Communication (closed 2011-01-01).
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Transient visual feedback on pitch variation for Chinese speakers of English2009In: Proc. of Fonetik 2009, Stockholm, 2009Conference paper (Other academic)
    Abstract [en]

    This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imitated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.

  • 75.
    Hincks, Rebecca
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Language and Communication.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Using speech technology to promote increased pitch variation in oral presentations2009In: Proc. of SLaTE Workshop on Speech and Language Technology in Education, Wroxall, UK, 2009Conference paper (Refereed)
    Abstract [en]

    This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imitated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.

  • 76.
    Hjalmarsson, Anna
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Human-likeness in utterance generation: Effects of variability2008In: Perception In Multimodal Dialogue Systems, Proceedings / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Pieraccini, R; Weber, M, 2008, Vol. 5078, p. 252-255Conference paper (Refereed)
    Abstract [en]

    There are compelling reasons to endow dialogue systems with human-like conversational abilities, which require modelling of aspects of human behaviour. This paper examines the value of using human behaviour as a target for system behaviour through a study making use of a simulation method. Two versions of system behaviour are compared: a replica of a human speaker's behaviour and a constrained version with less variability. The version based on human behaviour is rated more human-like, polite and intelligent.

  • 77. Landsiedel, Christian
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Eyben, Florian
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Schuller, Björn
    Syllabification of conversational speech using bidirectional long-short-term memory neural networks2011In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, Prague, Czech Republic, 2011, p. 5256-5259Conference paper (Refereed)
    Abstract [en]

    Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.

  • 78. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Snack Implementation and Tcl/Tk Interface to the Fundamental Frequency Variation Spectrum Algorithm2010In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010 / [ed] Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Rosner, Mike; Tapias, Daniel, Valetta, Malta, 2010, p. 3742-3749Conference paper (Refereed)
    Abstract [en]

    Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today’s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.

  • 79. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A single-port non-parametric model of turn-taking in multi-party conversation2011In: Proc. of ICASSP 2011, Prague, Czech Republic, 2011, p. 5600-5603Conference paper (Refereed)
    Abstract [en]

    The taking of turns to speak is an intrinsic property of conversation. It is therefore expected that models of turn-taking, providing a prior distribution over conversational form, can usefully reduce the perplexity of what is observed and processed in real-time spoken dialogue systems. We propose a conversation-independent single-port model of multi-party turn-taking, one which allows conversants to undertake independent actions but to condition them on the past behavior of all participants. The model is shown to generally out perform an existing multi-port model on a measure of perplexity over subsequently observed speech activity. We quantify the effect of history truncation and the success of predicting distant conversational futures, and argue that the framework is sufficiently accessible and has significant potential to usefully inform thedesignandbehaviorofspokendialoguesystems.

  • 80. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems2008In: 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, New York: IEEE , 2008, p. 5041-5044Conference paper (Refereed)
    Abstract [en]

    As spoken dialogue systems become deployed in increasingly complex domains, they face rising demands on the naturalness of interaction. We focus on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations. We derive an instantaneous vector representation of pitch variation and show that it isamenable to standard acoustic modeling techniques. Using a small amount of automatically labeled data, we train models which significantly outperform current state-of-the-art pause-only systems, and replicate to within 1% absolute the performance of our previously published hand-crafted baseline. The new system additionally offers scope for run-time control over the precision or recall of locations at which to speak.

  • 81. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Incremental learning and forgetting in incremental stochastic turn-taking models2011In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Florence, Italy, 2011, p. 2080-2083Conference paper (Refereed)
    Abstract [en]

    We present a computational framework for stochastically modeling dyad interaction chronograms. The framework's most novel feature is the capacity for incremental learning and forgetting. To showcase its flexibility, we design experiments answering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in interaction strategy; (3) strategies wander in time rather than converge; and (4) individuals exhibit similarity with their interlocutors. We expect the proposed framework to be capable of answering many such questions with little additional effort.

  • 82. Laskowski, Kornel
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Learning prosodic sequences using the fundamental frequency variation spectrum2008In: Proceedings of the Speech Prosody 2008 Conference, Campinas, Brazil: Editora RG/CNPq , 2008, p. 151-154Conference paper (Refereed)
    Abstract [en]

    We investigate a recently introduced vector-valued representation of fundamental frequency variation, whose properties appear to be well-suited for statistical sequence modeling. We show what the representation looks like, and apply hidden Markov models to learn prosodic sequences characteristic of higher-level turn-taking phenomena. Our analysis shows that the models learn exactly those characteristics which have been reported for the phenomena in the literature. Further refinements to the representation lead to 12-17% relative improvement in speaker change prediction for conversational spoken dialogue systems.

  • 83. Laskowski, Kornel
    et al.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A general-purpose 32 ms prosodic vector for Hidden Markov Modeling2009In: Proceedings of Interspeech 2009, Brighton, UK: ISCA , 2009, p. 724-729Conference paper (Refereed)
    Abstract [en]

    Prosody plays a central role in communicating via speech, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experimentson 4 tasks demontrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by more than an order of magnitude. The resulting is sufficiently mature for general deployment in a broad range of automatic speech processing applications.

  • 84.
    Laskowski, Kornel
    et al.
    Carnegie Mellon University; Universit¨at Karlsruhe.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum2009In: Proceedings of the 2009 European Signal Processing Conference (EUSIPCO-2009), Glasgow, Scotland, 2009, p. 2539-2543Conference paper (Refereed)
    Abstract [en]

    A basic requirement for participation in conversation is the ability to jointly manage interaction. Examples of interaction management include indications to acquire, re-acquire, hold, release, and acknowledge floor ownership, and these are often implemented using specialized dialog act (DA) types. In this work, we explore the prosody of one class of such DA types, known as floor mechanisms, using a methodology based on a recently proposed representation of fundamental frequency variation (FFV). Models over the representation illustrate significant differences between floor mechanisms and other dialog act types, and lead to automatic detection accuracies in equal-prior test data of up to 75%. Analysis indicates that FFV modeling offers a useful tool for the discovery of prosodic phenomena which are not explicitly labeled in the audio.

  • 85. Laskowski, Kornel
    et al.
    Heldner, Mattias
    Stockholm University, Stockholm, Sweden .
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the dynamics of overlap in multi-party conversation2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 2012, p. 846-849Conference paper (Refereed)
    Abstract [en]

    Overlap, although short in duration, occurs frequently in multiparty conversation. We show that its duration is approximately log-normal, and inversely proportional to the number of simultaneously speaking parties. Using a simple model, we demonstrate that simultaneous talk tends to end simultaneously less frequently than in begins simultaneously, leading to an arrow of time in chronograms constructed from speech activity alone. The asymmetry is significant and discriminative. It appears to be due to dialog acts which do not carry propositional content, and those which are not brought to completion.

  • 86.
    Laskowski, Kornel
    et al.
    Carnegie Mellon University.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass2010In: Proceedings of NIPS Workshop on Modeling Human Communication Dynamics, Vancouver, B.C., Canada, 2010Conference paper (Refereed)
    Abstract [en]

    We present empirical justification of why logistic regression may acceptably approximate, using the number of currently vocalizing interlocutors, the probabilities returned by a time-invariant, conditionally independent model of turn-taking. The resulting parametric model with 3 degrees of freedom is shown to be identical to an infinite-range Ising antiferromagnet, with slow connections, in an external field; it is suitable for undifferentiated-participant scenarios. In differentiated-participant scenarios, untying parameters results in an infinite-range spin glass whose degrees of freedom scale as the square of the number of participants; it offers an efficient representation of participant-pair synchrony. We discuss the implications of model parametrization and of the thermodynamic and feed-forward perceptron formalisms for easily quantifying aspects of conversational dynamics.

  • 87. Laskowski, Kornel
    et al.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    The fundamental frequency variation spectrum2008In: Proceedings of FONETIK 2008, Gothenburg, Sweden: Department of Linguistics, University of Gothenburg , 2008, p. 29-32Conference paper (Other academic)
    Abstract [en]

    This paper describes a recently introduced vector-valued representation of fundamental frequency variation – the FFV spectrum – which has a number of desirable properties. In particular, it is instantaneous, continuous, distributed, and well suited for application of standard acoustic modeling techniques. We show what the representation looks like, and how it can be used to model prosodic sequences.

  • 88. Laskowski, Kornel
    et al.
    Wölfel, Matthias
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems2008In: Proceedings of Acoustics'08, Paris, France, 2008, p. 3305-3310Conference paper (Refereed)
    Abstract [en]

    Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling fundamental frequency, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is unde?ned in unvoiced segments, that its absolute magnitude is speaker-specific, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-specific error rate in a conversational spoken dialogue system to the specific definition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.

  • 89. Oertel, C.
    et al.
    Cummins, F.
    Campbell, N.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wagner, P.
    D64: A corpus of richly recorded conversational interaction2010In: Proceedings of LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality / [ed] Kipp, Michael; Martin, Jean-Claude; Paggio, Patrizia; Heylen, Dirk, 2010, p. 27-30Conference paper (Refereed)
    Abstract [en]

    Rich non-intrusive recording of anaturalistic conversation was conducted in a domestic setting. Four (sometimes five) participants engaged in lively conversation over two 4-hour sessions on two successive days. Conversation was not directed, and ranged widely over topics both trivial and technical. The entire conversation, on both days, was richly recorded using 7 video cameras, 10 audio microphones, and the registration of 3-D head,torso and arm motion using an Optitrack system. To add liveliness to the conversation, several bottles of wine were consumed during the final two hours of recording. The resulting corpus will be of immediate interest to all researchers interested in studying naturalistic, ethologically situated, conversational interaction.

  • 90.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cummins, Fred
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wagner, Petra
    Campbell, Nick
    D64: A corpus of richly recorded conversational interaction2013In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 7, no 1-2, p. 19-28Article in journal (Refereed)
    Abstract [en]

    In recent years there has been a substantial debate about the need for increasingly spontaneous, conversational corpora of spoken interaction that are not controlled or task directed. In parallel the need has arisen for the recording of multi-modal corpora which are not restricted to the audio domain alone. With a corpus that would fulfill both needs, it would be possible to investigate the natural coupling, not only in turn-taking and voice, but also in the movement of participants. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of dynamic interaction. The D64 corpus is a multimodal corpus recorded over two successive days. Each day resulted in approximately 4 h of recordings. In total five participants took part in the recordings of whom two participants were female and three were male. Seven video cameras were used of which at least one was trained on each participant. The Optitrack motion capture kit was used in order to enrich information. The D64 corpus comprises annotations on conversational involvement, speech activity and pauses as well as information of the average degree of change in the movement of participants.

  • 91.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Götze, Jana
    KTH, School of Computer Science and Communication (CSC), Theoretical Computer Science, TCS.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The KTH Games Corpora: How to Catch a Werewolf2013In: IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video: MMC 2013, 2013Conference paper (Refereed)
  • 92.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wlodarczak, Marcin
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wagner, Petra
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gaze Patterns in Turn-Taking2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 3, Portland, Oregon, US, 2012, p. 2243-2246Conference paper (Refereed)
    Abstract [en]

    This paper investigates gaze patterns in turn-taking. We focus on differences between speaker changes resulting in silences and overlaps. We also investigate gaze patterns around backchannels and around silences not involving speaker changes.

  • 93. Renklint, Elisabet
    et al.
    Cardell, Fannyand
    Dahlbäck, Johanna
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Conversational gaze in light and darkness2012In: Proc. of Fonetik 2012, Gothenburg, Sweden, 2012, p. 59-60Conference paper (Other academic)
    Abstract [en]

    The way we use our gaze in face-to-face interaction is an important part of our social behavior. This exploratory study investigates the relationship between mutual gaze and joint silences and overlaps, where speaker changes and backchannels often occur. Seven dyadic conversations between two persons were recorded in a studio. Gaze patterns were annotated in ELAN to find instances of mutual gaze. Part of the study was conducted in total darkness as a way to observe what happens to our gaze-patterns when we cannot see our interlocutor, although the physical face-to-face condition is upheld. The results show a difference in the frequency of mutual gaze in conversation in light and darkness.

  • 94. Sikveland, Rein-Ove
    et al.
    Öttl, Anton
    Amdal, Ingunn
    Ernestus, Mirjam
    Svendsen, Thorbjørn
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Spontal-N: A Corpus of Interactional Spoken Norwegian2010In: Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) / [ed] Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Rosner, Mike; Tapias, Daniel, 2010, p. 2986-2991Conference paper (Refereed)
    Abstract [en]

    Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of thematerial on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.

  • 95.
    Skantze, Gabriel
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Edlund, Jens
    KTH, Superseded Departments, Speech, Music and Hearing.
    Early error detection on word level2004In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, 2004Conference paper (Refereed)
    Abstract [en]

    In this paper two studies are presented in which the detection of speech recognition errors on the word level was examined. In the first study, memory-based and transformation-based machine learning was used for the task, using confidence, lexical, contextual and discourse features. In the second study, we investigated which factors humans benefit from when detecting errors. Information from the speech recogniser (i.e. word confidence scores and 5-best lists) and contextual information were the factors investigated. The results show that word confidence scores are useful and that lexical and contextual (both from the utterance and from the discourse) features further improve performance.

  • 96.
    Skantze, Gabriel
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Edlund, Jens
    KTH, Superseded Departments, Speech, Music and Hearing.
    Robust interpretation in the Higgins spoken dialogue system2004In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, 2004Conference paper (Refereed)
    Abstract [en]

    This paper describes Pickering, the semantic interpreter developed in the Higgins project - a research project on error handling in spoken dialogue systems. In the project, the initial efforts are centred on the input side of the system. The semantic interpreter combines a rich set of robustness techniques with the production of deep semantic structures. It allows insertions and non-agreement inside phrases, and combines partial results to return a limited list of semantically distinct solutions. A preliminary evaluation shows that the interpreter performs well under error conditions, and that the built-in robustness techniques contribute to this performance.

  • 97.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, Rolf
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Talking with Higgins: Research challenges in a spoken dialogue system2006In: PERCEPTION AND INTERACTIVE TECHNOLOGIES, PROCEEDINGS / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Weber, M, BERLIN: SPRINGER-VERLAG BERLIN , 2006, Vol. 4021, p. 193-196Conference paper (Refereed)
    Abstract [en]

    This paper presents the current status of the research in the Higgins project and provides background for a demonstration of the spoken dialogue system implemented within the project. The project represents the latest development in the ongoing dialogue systems research at KTH. The practical goal of the project is to build collaborative conversational dialogue systems in which research issues such as error handling techniques can be tested empirically.

  • 98.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Grounding and prosody in dialog2006In: Working Papers 52: Proceedings of Fonetik 2006, Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics , 2006, p. 117-120Conference paper (Other academic)
    Abstract [en]

    In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.

  • 99.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    User Responses to Prosodic Variation in Fragmentary Grounding Utterances in Dialog2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 2002-2005Conference paper (Refereed)
    Abstract [en]

    In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.

  • 100. Strömbergsson, S.
    et al.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Götze, Jana
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. Karolinska Institutet (KI), Sweden.
    Björkenstam, K. N.
    Approximating phonotactic input in children's linguistic environments from orthographic transcripts2017In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 2213-2217Conference paper (Refereed)
    Abstract [en]

    Child-directed spoken data is the ideal source of support for claims about children's linguistic environments. However, phonological transcriptions of child-directed speech are scarce, compared to sources like adult-directed speech or text data. Acquiring reliable descriptions of children's phonological environments from more readily accessible sources would mean considerable savings of time and money. The first step towards this goal is to quantify the reliability of descriptions derived from such secondary sources. We investigate how phonological distributions vary across different modalities (spoken vs. written), and across the age of the intended audience (children vs. adults). Using a previously unseen collection of Swedish adult-and child-directed spoken and written data, we combine lexicon look-up and graphemeto-phoneme conversion to approximate phonological characteristics. The analysis shows distributional differences across datasets both for single phonemes and for longer phoneme sequences. Some of these are predictably attributed to lexical and contextual characteristics of text vs. speech. The generated phonological transcriptions are remarkably reliable. The differences in phonological distributions between child-directed speech and secondary sources highlight a need for compensatory measures when relying on written data or on adult-directed spoken data, and/or for continued collection of actual child-directed speech in research on children's language environments.

123 51 - 100 of 108
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf