Change search
Refine search result
12 51 - 90 of 90
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafsson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Data-driven models for timing feedback responses in a Map Task dialogue system2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 4, p. 903-922Article in journal (Refereed)
    Abstract [en]

    Traditional dialogue systems use a fixed silence threshold to detect the end of users' turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and gestures, to coordinate smooth exchange of speaking turns. However, little effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human computer interactions. We present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). On this data, we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for detecting response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The subjective and objective measures from the user evaluation confirm that a model trained on speaker behavioural cues offers both smoother turn-transitions and more responsive system behaviour.

  • 52. Mirnig, Nicole
    et al.
    Weiss, Astrid
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Tscheligi, Manfred
    Face-To-Face With A Robot: What do we actually talk about?2013In: International Journal of Humanoid Robotics, ISSN 0219-8436, Vol. 10, no 1, p. 1350011-Article in journal (Refereed)
    Abstract [en]

    While much of the state-of-the-art research in human-robot interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not predefined. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scientific museum, but without a predefined purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot's capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would define a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its first marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasant experience.

  • 53.
    Peters, Christopher
    et al.
    KTH.
    Li, Chengjie
    KTH.
    Yang, Fangkai
    KTH.
    Avramova, Vanya
    KTH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Investigating Social Distances between Humans, Virtual Humans and Virtual Robots in Mixed Reality2018In: Proceedings of 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018, p. 2247-2249Conference paper (Refereed)
    Abstract [en]

    Mixed reality environments offer new potentials for the design of compelling social interaction experiences with virtual characters. In this paper, we summarise initial experiments we are conducting in which we measure comfortable social distances between humans, virtual humans and virtual robots in mixed reality environments. We consider a scenario in which participants walk within a comfortable distance of a virtual character that has its appearance varied between a male and female human, and a standard- and human-height virtual Pepper robot. Our studies in mixed reality thus far indicate that humans adopt social zones with artificial agents that are similar in manner to human-human social interactions and interactions in virtual reality.

  • 54. Roddy, M.
    et al.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Harte, N.
    Investigating speech features for continuous turn-taking prediction using LSTMs2018In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2018, p. 586-590Conference paper (Refereed)
    Abstract [en]

    For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

  • 55. Roddy, Matthew
    et al.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Harte, Naomi
    Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs2018In: ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction, 2018, p. 186-190Conference paper (Refereed)
    Abstract [en]

    In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.

  • 56. Schlangen, D.
    et al.
    Baumann, T.
    Buschmeier, H.
    Buss, O.
    Kopp, S.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Yaghoubzadeh, R.
    Middleware for Incremental Processing in Conversational Agents2010In: Proceedings of SIGDIAL 2010: the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2010, p. 51-54Conference paper (Refereed)
    Abstract [en]

    We describe work done at three sites on designing conversational agents capable of incremental processing. We focus on the ‘middleware’ layer in these systems, which takes care of passing around and maintaining incremental information between the modules of such agents. All implementations are based on the abstract model of incremental dialogue processing proposed by Schlangen and Skantze (2009), and the paper shows what different instantiations of the model can look like given specific requirements and application areas.

  • 57. Schlangen, D.
    et al.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A General, Abstract Model of Incremental Dialogue Processing2011In: Dialogue & Discourse, ISSN 2152-9620, Vol. 2, no 1, p. 83-111Article in journal (Refereed)
  • 58. Schlangen, D.
    et al.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A general, abstract model of incremental dialogue processing2009In: EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings, Association for Computational Linguistics, 2009, p. 710-718Conference paper (Refereed)
    Abstract [en]

    We present a general model and conceptual framework for specifying architectures for incremental processing in dialogue systems, in particular with respect to the topology of the network of modules that make up the system, the way information flows through this network, how information increments are 'packaged', and how these increments are processed by the modules. This model enables the precise specification of incremental systems and hence facilitates detailed comparisons between systems, as well as giving guidance on designing new systems.

  • 59.
    Shore, Todd
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Androulakaki, Theofronia
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Tokyo, 2019, p. 768-775Conference paper (Refereed)
    Abstract [en]

    There is a growing body of research focused on task-oriented instructor-manipulator dialogue, whereby one dialogue participant initiates a reference to an entity in a common environment while the other participant must resolve this reference in order to manipulate said entity. Many of these works are based on disparate if nevertheless similar datasets. This paper described an English corpus of referring expressions in relatively free, unrestricted dialogue with physical features generated in a simulation, which facilitate analysis of dialogic linguistic phenomena regarding alignment in the formation of referring expressions known as conceptual pacts.

    Download full text (pdf)
    fulltext
  • 60.
    Shore, Todd
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Enhancing reference resolution in dialogue using participant feedback2017In: Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden: International Speech Communication Association, 2017, p. 78-82Conference paper (Refereed)
    Abstract [en]

    Expressions used to refer to entities in a common environment do not originate solely from one participant in a dialogue but are formed collaboratively. It is possible to train a model for resolving these referring expressions (REs) in a static manner using an appropriate corpus, but, due to the collaborative nature of their formation, REs are highly dependent not only on attributes of the referent in question (e.g. color, shape) but also on the dialogue participants themselves. As a proof of concept, we improved the accuracy of a words-as-classifiers logistic regression  model  by  incorporating  knowledge about  accepting/rejecting REs proposed from other participants.

  • 61. Shore, Todd
    et al.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Using Lexical Alignment and Referring Ability to Address Data Sparsity in Situated Dialog Reference Resolution2018In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, p. 2288-2297Conference paper (Refereed)
  • 62.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Testbed for Examining the Timing of Feedback using a Map Task2012In: Proceedings of the Interdisciplinary Workshop on Feedback Behaviors in Dialog, Portland, OR, 2012Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a fully automated spoken dialogue sys-tem that can perform the Map Task with a user. By implementing a trick, the system can convincingly act as an attentive listener, without any speech recognition. An initial study is presented where we let users interact with the system and recorded the interactions. Using this data, we have then trained a Support Vector Machine on the task of identifying appropriate locations to give feedback, based on automatically extractable prosodic and contextual features. 200 ms after the end of the user’s speech, the model may identify response locations with an accuracy of 75%, as compared to a baseline of 56.3%.

  • 63.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Error Handling in Spoken Dialogue Systems: Managing Uncertainty, Grounding and Miscommunication2007Doctoral thesis, monograph (Other scientific)
    Abstract [en]

    Due to the large variability in the speech signal, the speech recognition process constitutes the major source of errors in most spoken dialogue systems. A spoken dialogue system can never know for certain what the user is saying, it can only make hypotheses. As a result of this uncertainty, two types of errors can be made: over-generation of hypotheses, which leads to misunderstanding, and under-generation, which leads to non-understanding. In human-human dialogue, speakers try to minimise such miscommunication by constantly sending and picking up signals about their understanding, a process commonly referred to as grounding.

    The topic of this thesis is how to deal with this uncertainty in spoken dialogue systems: how to detect errors in speech recognition results, how to recover from non-understanding, how to choose when to engage in grounding, how to model the grounding process, how to realise grounding utterances and how to detect and repair misunderstandings. The approach in this thesis is to explore and draw lessons from human error handling, and to study how error handling may be performed in different parts of a complete spoken dialogue system. These studies are divided into three parts.

    In the first part, an experimental setup is presented in which a speech recogniser is used to induce errors in human-human dialogues. The results show that, unlike the behaviour of most dialogue systems, humans tend to employ other strategies than encouraging the interlocutor to repeat when faced with non-understandings. The collected data is also used in a follow-up experiment to explore which factors humans may benefit from when detecting errors in speech recognition results. Two machine learning algorithms are also used for the task.

    In the second part, the spoken dialogue system HIGGINS is presented, including the robust semantic interpreter PICKERING and the error aware discourse modeller GALATEA. It is shown how grounding is modelled and error handling is performed on the concept level. The system may choose to display its understanding of individual concepts, pose fragmentary clarification requests, or risk a misunderstanding and possibly detect and repair it at a later stage. An evaluation of the system with naive users indicates that the system performs well under error conditions.

    In the third part, models for choosing when to engage in grounding and how to realise grounding utterances are presented. A decision-theoretic, data-driven model for making grounding decisions is applied to the data from the evaluation of the HIGGINS system. Finally, two experiments are presented, which explore how the intonation of synthesised fragmentary grounding utterances affect their pragmatic meaning.

    The primary target of this thesis is the management of uncertainty, grounding and miscommunication in conversational dialogue systems, which to a larger extent build upon the principles of human conversation. However, many of the methods, models and results presented should also be applicable to dialogue systems in general.

    Download full text (pdf)
    FULLTEXT01
  • 64.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring human error recovery strategies: Implications for spoken dialogue systems2005In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 45, no 3, p. 325-341Article in journal (Refereed)
    Abstract [en]

    In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover from speech recognition errors. This method for studying error handling has the advantages that the level of understanding is transparent to the analyser, and the errors that occur are similar to errors in spoken dialogue systems. The results show that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis about the situation instead of signalling non-understanding. Compared to other strategies, such as asking for a repetition, this strategy leads to better understanding of subsequent utterances, whereas signalling non-understanding leads to decreased experience of task success.

  • 65.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Galatea: A discourse modeller supporting concept-level error handling in spoken dialogue systems2008In: Recent Trends in Discourse and Dialogue / [ed] Dybkjær, L.; Minker, W., Dordrecht: Springer Science + Business Media B.V , 2008Chapter in book (Refereed)
  • 66.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Galatea: a discourse modeller supporting concept-level error handling in spoken dialogue systems2005In: 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal, 2005, p. 178-189Conference paper (Refereed)
    Abstract [en]

    In this paper, a discourse modeller for conversational spoken dialogue systems, called GALATEA, is presented. Apart from handling the resolution of ellipses and anaphora, it tracks the “grounding status” of concepts thatare mentioned during the discourse, i.e. information about who said what when. This grounding information also contains concept confidence scores that are derived from the speech recogniser word confidence scores. The discourse model may then be used for concept-level error handling, i.e. grounding of concepts, fragmentary clarification requests, and detection of erroneous concepts in the model at later stages in the dialogue.

  • 67.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Making grounding decisions: Data-driven estimation of dialogue costs and confidence thresholds2007In: Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, 2007, p. 206-210Conference paper (Refereed)
    Abstract [en]

    This paper presents a data-driven decision-theoretic approach to making grounding decisions in spoken dialogue systems, i.e., to decide which recognition hypotheses to consider as correct and which grounding action to take. Based on task analysis of the dialogue domain, cost functions are derived, which take dialogue efficiency, consequence of task failure and information gain into account. Dialogue data is then used to estimate speech recognition confidence thresholds that are dependent on the dialogue context.

  • 68.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Real-Time Coordination in Human-Robot Interaction Using Face and Voice2016In: AI Magazine, ISSN 0738-4602, Vol. 37, no 4, p. 19-31Article in journal (Refereed)
    Abstract [en]

    When humans interact and collaborate with each other, they coordinate their turn-taking behaviors using verbal and nonverbal signals, expressed in the face and voice. If robots of the future are supposed to engage in social interaction with humans, it is essential that they can generate and understand these behaviors. In this article, I give an overview of several studies that show how humans in interaction with a humanlike robot make use of the same coordination signals typically found in studies on human-human interaction, and that it is possible to automatically detect and combine these cues to facilitate real-time coordination. The studies also show that humans react naturally to such signals when used by a robot, without being given any special instructions. They follow the gaze of the robot to disambiguate referring expressions, they conform when the robot selects the next speaker using gaze, and they respond naturally to subtle cues, such as gaze aversion, breathing, facial gestures, and hesitation sounds.

  • 69.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks2017In: Proceedings of SigDial, Saarbrucken, Germany, 2017Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 70.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    IrisTK: A statechart-based toolkit for multi-party face-to-face interaction2012In: ICMI'12 - Proceedings of the ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2012, p. 69-75Conference paper (Refereed)
    Abstract [en]

    In this paper, we present IrisTK - a toolkit for rapid development of real-time systems for multi-party face-to-face interaction. The toolkit consists of a message passing system, a set of modules for multi-modal input and output, and a dialog authoring language based on the notion of statecharts. The toolkit has been applied to a large scale study in a public museum setting, where the backprojected robot head Furhat interacted with the visitors in multiparty dialog.

  • 71.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue2012In: Proceedings of the Workshop on Real-time Conversation with Virtual Agents IVA-RCVA, 2012Conference paper (Refereed)
  • 72.
    Skantze, Gabriel
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Edlund, Jens
    KTH, Superseded Departments, Speech, Music and Hearing.
    Early error detection on word level2004In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, 2004Conference paper (Refereed)
    Abstract [en]

    In this paper two studies are presented in which the detection of speech recognition errors on the word level was examined. In the first study, memory-based and transformation-based machine learning was used for the task, using confidence, lexical, contextual and discourse features. In the second study, we investigated which factors humans benefit from when detecting errors. Information from the speech recogniser (i.e. word confidence scores and 5-best lists) and contextual information were the factors investigated. The results show that word confidence scores are useful and that lexical and contextual (both from the utterance and from the discourse) features further improve performance.

  • 73.
    Skantze, Gabriel
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Edlund, Jens
    KTH, Superseded Departments, Speech, Music and Hearing.
    Robust interpretation in the Higgins spoken dialogue system2004In: Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, 2004Conference paper (Refereed)
    Abstract [en]

    This paper describes Pickering, the semantic interpreter developed in the Higgins project - a research project on error handling in spoken dialogue systems. In the project, the initial efforts are centred on the input side of the system. The semantic interpreter combines a rich set of robustness techniques with the production of deep semantic structures. It allows insertions and non-agreement inside phrases, and combines partial results to return a limited list of semantically distinct solutions. A preliminary evaluation shows that the interpreter performs well under error conditions, and that the built-in robustness techniques contribute to this performance.

  • 74.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, Rolf
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Talking with Higgins: Research challenges in a spoken dialogue system2006In: PERCEPTION AND INTERACTIVE TECHNOLOGIES, PROCEEDINGS / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Weber, M, BERLIN: SPRINGER-VERLAG BERLIN , 2006, Vol. 4021, p. 193-196Conference paper (Refereed)
    Abstract [en]

    This paper presents the current status of the research in the Higgins project and provides background for a demonstration of the spoken dialogue system implemented within the project. The project represents the latest development in the ongoing dialogue systems research at KTH. The practical goal of the project is to build collaborative conversational dialogue systems in which research issues such as error handling techniques can be tested empirically.

  • 75.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Attention and interaction control in a human-human-computer dialogue setting2009In: Proceedings of SIGDIAL 2009: the 10th Annual Meeting of the Special Interest Group in Discourse and Dialogue, 2009, p. 310-313Conference paper (Refereed)
    Abstract [en]

    This paper presents a simple, yet effective model for managing attention and interaction control in multimodal spoken dialogue systems. The model allows the user to switch attention between the system and other humans, and the system to stop and resume speaking. An evaluation in a tutoring setting shows that the user’s attention can be effectively monitored using head pose tracking, and that this is a more reliable method than using push-to-talk.

  • 76.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Multimodal interaction control in the MonAMI Reminder2009In: Proceedings of DiaHolmia: 2009 Workshop on the Semantics and Pragmatics of Dialogue / [ed] Jens Edlund, Joakim Gustafson, Anna Hjalmarsson, Gabriel Skantze, 2009, p. 127-128Conference paper (Refereed)
  • 77.
    Skantze, Gabriel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal Conversational Interaction with Robots2019In: The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions / [ed] Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger, ACM Press, 2019Chapter in book (Refereed)
  • 78.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards incremental speech generation in conversational systems2013In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 27, no 1, p. 243-262Article in journal (Refereed)
    Abstract [en]

    This paper presents a model of incremental speech generation in practical conversational systems. The model allows a conversational system to incrementally interpret spoken input, while simultaneously planning, realising and self-monitoring the system response. If these processes are time consuming and result in a response delay, the system can automatically produce hesitations to retain the floor. While speaking, the system utilises hidden and overt self-corrections to accommodate revisions in the system. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a conversational game application. A Wizard-of-Oz experiment is presented, where the automatic speech recognizer is replaced by a Wizard who transcribes the spoken input. In this setting, the incremental model allows the system to start speaking while the user's utterance is being transcribed. In comparison to a non-incremental version of the same system, the incremental version has a shorter response time and is perceived as more efficient by the users.

  • 79.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards Incremental Speech Generation in Dialogue Systems2010In: Proceedings of the SIGDIAL 2010 Conference: 11th Annual Meeting of the Special Interest Group onDiscourse and Dialogue, 2010, p. 1-8Conference paper (Refereed)
    Abstract [en]

    We present a first step towards a model of speech generation for incremental dialogue systems. The model allows a dialogue system to incrementally interpret spoken input, while simultaneously planning, realising and selfmonitoring the system response. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a specific application and tested it in a Wizard-of-Oz setting, comparing it with a non-incremental version of the same system. The results show that the incremental version, while producing longer utterances, has a shorter response time and is perceived as more efficient by the users.

  • 80.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring the effects of gaze and pauses in situated human-robot interaction2013In: 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue: SIGDIAL 2013, ACL , 2013Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a user study where a robot instructs a human on how to draw a route on a map, similar to a Map Task. This setup has allowed us to study user reactions to the robot’s conversational behaviour in order to get a better understanding of how to generate utterances in incremental dialogue systems. We have analysed the participants' subjective rating, task completion, verbal responses, gaze behaviour, drawing activity, and cognitive load. The results show that users utilise the robot’s gaze in order to disambiguate referring expressions and manage the flow of the interaction. Furthermore, we show that the user’s behaviour is affected by how pauses are realised in the robot’s speech.

  • 81.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Turn-taking, feedback and joint attention in situated human-robot interaction2014In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 65, p. 50-66Article in journal (Refereed)
    Abstract [en]

    In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user's and the robot's gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot's speech. By analysing the participants' subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot's gaze when talking about landmarks, and that the robot's verbal and gaze behaviour has a strong effect on the users' turn-taking behaviour. We also present an analysis of the users' gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user's level of uncertainty.

  • 82.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Grounding and prosody in dialog2006In: Working Papers 52: Proceedings of Fonetik 2006, Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics , 2006, p. 117-120Conference paper (Other academic)
    Abstract [en]

    In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.

  • 83.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    User Responses to Prosodic Variation in Fragmentary Grounding Utterances in Dialog2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 2002-2005Conference paper (Refereed)
    Abstract [en]

    In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.

  • 84.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Johansson, Martin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Modelling situated human-robot interaction using IrisTK2015In: Proceedings of the SIGDIAL 2015 Conference, 2015, p. 165-167Conference paper (Refereed)
    Abstract [en]

    In this demonstration we show how situated multi-party human-robot interaction can be modelled using the open source framework IrisTK. We will demonstrate the capabilities of IrisTK by showing an application where two users are playing a collaborative card sorting game together with the robot head Furhat, where the cards are shown on a touch table between the players. The application is interesting from a research perspective, as it involves both multi-party interaction, as well as joint attention to the objects under discussion.

  • 85.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Johansson, Martin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    A Collaborative Human-Robot Game as a Test-bed for Modelling Multi-party, Situated Interaction2015In: INTELLIGENT VIRTUAL AGENTS, IVA 2015, 2015, p. 348-351Conference paper (Refereed)
    Abstract [en]

    In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention. The system has been exhibited at the Swedish National Museum of Science and Technology during nine days, resulting in a rich multi-modal corpus with users of mixed ages.

  • 86.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Johansson, Martin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects2015In: Proceedings of the 2015 ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2015Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors at a time could play a collaborative card sorting game together with the robot head Furhat, where the three players discuss the solution together. The cards are shown on a touch table between the players, thus constituting a target for joint attention. We describe how the system was implemented in order to manage turn-taking and attention to users and objects in the shared physical space. We also discuss how multi-modal redundancy (from speech, card movements and head pose) is exploited to maintain meaningful discussions, given that the system has to process conversational speech from both children and adults in a noisy environment. Finally, we present an analysis of 373 interactions, where we investigate the robustness of the system, to what extent the system's attention can shape the users' turn-taking behaviour, and how the system can produce multi-modal turn-taking signals (filled pauses, facial gestures, breath and gaze) to deal with processing delays in the system.

  • 87.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    User Feedback in Human-Robot Dialogue: Task Progression and Uncertainty2014In: Proceedings of the HRI Workshop on Timing in Human-Robot Interaction, Bielefeld, Germany, 2014Conference paper (Refereed)
  • 88.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    User feedback in human-robot interaction: Prosody, gaze and timing2013In: Proceedings of Interspeech 2013, 2013, p. 1901-1905Conference paper (Refereed)
    Abstract [en]

    This paper investigates forms and functions of user feedback in a map task dialogue between a human and a robot, where the robot is the instruction-giver and the human is the instruction- follower. First, we investigate how user acknowledgements in task-oriented dialogue signal whether an activity is about to be initiated or has been completed. The parameters analysed include the users' lexical and prosodic realisation as well as gaze direction and response timing. Second, we investigate the relation between these parameters and the perception of uncertainty.

  • 89.
    Skantze, Gabriel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Schlangen, D.
    Incremental dialogue processing in a micro-domain2009In: Proceedings of the 12th Conference of the European Chapter of the ACL, 2009, p. 745-753Conference paper (Refereed)
    Abstract [en]

    This paper describes a fully incremental dialogue system that can engage in dialogues in a simple domain, number dictation. Because it uses incremental speech recognition and prosodic analysis, the system can give rapid feedback as the user is speaking, with a very short latency of around 200ms. Because it uses incremental speech synthesis and self-monitoring, the system can react to feedback from the user as the system is speaking. A comparative evaluation shows that naïve users preferred this system over a non-incremental version, and that it was perceived as more human-like.

  • 90.
    Wallers, Åsa
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    The effect of prosodic features on the interpretation of synthesised backchannels2006In: Perception And Interactive Technologies, Proceedings / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Weber, M, 2006, Vol. 4021, p. 183-187Conference paper (Refereed)
    Abstract [en]

    A study of the interpretation of prosodic features in backchannels (Swedish /a/ and /m/) produced by speech synthesis is presented. The study is part of work-in-progress towards endowing conversational spoken dialogue systems with the ability to produce and use backchannels and other feedback.

12 51 - 90 of 90
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf