Change search
Refine search result
123 51 - 100 of 110
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Gustafson, Joakim
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Larsson, Anette
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Carlson, Rolf
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Hellman, K
    Department of Linguistics, Stockholm University, S-106 91 Stockholm, Sweden.
    How do System Questions Influence Lexical Choices in User Answers?1997In: Proceedings of Eurospeech '97, 5th European Conference on Speech Communication and Technology : Rhodes, Greece, 22 - 25 September 1997, Grenoble: European Speech Communication Association (ESCA) , 1997, p. 2275-2278Conference paper (Refereed)
    Abstract [en]

    This paper describes some studies on the effect of the system vocabulary on the lexical choices of the users. There are many theories about human-human dialogues that could be useful in the design of spoken dialoguesystems. This paper will give an overview of some of these theories and report the results from two experiments that examines one of these theories, namely lexical entrainment. The first experiment was a small Wizard of Oz-test that simulated a tourist informationsystem with a speech interface, and the second experiment simulated a system with speech recognition that controlled a questionnaire about peoples plans for their vacation. Both experiments show that the subjects mostly adapt their lexical choices to the system questions. Only in less than 5% of the cases did they use an alternative main verb in the answer. These results encourage us to investigate the possibility to add anadaptive language model in the speech recognizer in our dialogue system, where the probabilities for the words used in the system questions are increased.

  • 52.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Merkes, Miray
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Eliciting interactional phenomena in human-human dialogues2009In: Proceedings of the SIGDIAL 2009 Conference: 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2009, p. 298-301Conference paper (Refereed)
    Abstract [en]

    In order to build a dialogue system that can interact with humans in the same way as humans interact with each other, it is important to be able to collect conversational data. This paper introduces a dialogue recording method where an eavesdropping human operator sends instructions to the participants in an ongoing humanhuman task-oriented dialogue. The purpose of the instructions is to control the dialogue progression or to elicit interactional phenomena. The recordings were used to build a Swedish synthesis voice with disfluent diphones.

  • 53.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Directing conversation using the prosody of mm and mhm2010In: Proceedings of SLTC 2010, Linköping, Sweden, 2010, p. 15-16Conference paper (Refereed)
  • 54.
    Gustafson, Joakim
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic cues to engagement in non-lexical response tokens in Swedish2010In: Proceedings of DiSS-LPSS Joint Workshop 2010, Tokyo, Japan, 2010Conference paper (Refereed)
  • 55.
    Gustafson, Joakim
    et al.
    Voice Technologies, Expert Functions, TeliaSonera, Farsta, Sweden.
    Sjölander, Kåre
    KTH, Superseded Departments, Speech, Music and Hearing.
    Voice creations for conversational fairy-tale characters2004In: Proc 5th ISCA speech synthesis workshop, Pittsburgh, 2004, p. 145-150Conference paper (Refereed)
  • 56.
    Gustafson, Joakim
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing. Telia Research AB, Sweden.
    Sjölander, Kåre
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Voice Transformations For Improving Children's Speech Recognition In A Publicly Available Dialogue System2002In: Proceedings of ICSLP 02, International Speech Communication Association , 2002, p. 297-300Conference paper (Refereed)
    Abstract [en]

    To be able to build acoustic models for children, that can beused in spoken dialogue systems, speech data has to be collected. Commercial recognizers available for Swedish are trained on adult speech, which makes them less suitable for children’s computer-directed speech. This paper describes some experiments with on-the-fly voice transformation of children’s speech. Two transformation methods were tested, one inspired by the Phase Vocoder algorithm and another by the Time-Domain Pitch-Synchronous Overlap-Add (TD-PSOLA)algorithm. The speech signal is transformed before being sent to the speech recognizer for adult speech. Our results show that this method reduces the error rates in the order of thirty to fortyfive percent for children users.

  • 57.
    Gustafson, Joakirn
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    EXPROS: A toolkit for exploratory experimentation with prosody in customized diphone voices2008In: Perception In Multimodal Dialogue Systems, Proceedings / [ed] Andre, E; Dybkjaer, L; Minker, W; Neumann, H; Pieraccini, R; Weber, M, 2008, Vol. 5078, p. 293-296Conference paper (Refereed)
    Abstract [en]

    This paper presents a toolkit for experimentation with prosody in diphone voices. Prosodic features play an important role for aspects of human-human spoken dialogue that are largely unexploited in current spoken dialogue systems. The toolkit contains tools for recording utterances for a number of purposes. Examples include extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances, and creation of purpose-built customized MBROLA mini-voices.

  • 58. Johansson, M.
    et al.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Understanding route directions in human-robot dialogue2011In: Proceedings of SemDial, Los Angeles, CA, 2011, p. 19-27Conference paper (Refereed)
  • 59.
    Johansson, Martin
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Hori, Tatsuro
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Hothker, Anja
    Gustafson, Joakirn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Making Turn-Taking Decisions for an Active Listening Robot for Memory Training2016In: SOCIAL ROBOTICS, (ICSR 2016), Springer, 2016, p. 940-949Conference paper (Refereed)
    Abstract [en]

    In this paper we present a dialogue system and response model that allows a robot to act as an active listener, encouraging users to tell the robot about their travel memories. The response model makes a combined decision about when to respond and what type of response to give, in order to elicit more elaborate descriptions from the user and avoid non-sequitur responses. The model was trained on human-robot dialogue data collected in a Wizard-of-Oz setting, and evaluated in a fully autonomous version of the same dialogue system. Compared to a baseline system, users perceived the dialogue system with the trained model to be a significantly better listener. The trained model also resulted in dialogues with significantly fewer mistakes, a larger proportion of user speech and fewer interruptions.

  • 60.
    Johansson, Martin
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Comparison of human-human and human-robot Turn-taking Behaviour in multi-party Situated interaction2014In: UM3I '14: Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Istanbul, Turkey, 2014, p. 21-26Conference paper (Refereed)
    Abstract [en]

    In this paper, we present an experiment where two human subjects are given a team-building task to solve together with a robot. The setting requires that the speakers' attention is partly directed towards objects on the table between them, as well as to each other, in order to coordinate turn-taking. The symmetrical setup allows us to compare human-human and human-robot turn-taking behaviour in the same interactional setting. The analysis centres around the interlocutors' attention (as measured by head pose) and gap length between turns, depending on the pragmatic function of the utterances.

  • 61.
    Johansson, Martin
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Head Pose Patterns in Multiparty Human-Robot Team-Building Interactions2013In: Social Robotics: 5th International Conference, ICSR 2013, Bristol, UK, October 27-29, 2013, Proceedings / [ed] Guido Herrmann, Martin J. Pearson, Alexander Lenz, Paul Bremner, Adam Spiers, Ute Leonards, Springer, 2013, p. 351-360Conference paper (Refereed)
    Abstract [en]

    We present a data collection setup for exploring turn-taking in three-party human-robot interaction involving objects competing for attention. The collected corpus comprises 78 minutes in four interactions. Using automated techniques to record head pose and speech patterns, we analyze head pose patterns in turn-transitions. We find that introduction of objects makes addressee identification based on head pose more challenging. The symmetrical setup also allows us to compare human-human to human-robot behavior within the same interaction. We argue that this symmetry can be used to assess to what extent the system exhibits a human-like behavior.

  • 62.
    Johnson-Roberson, Matthew
    et al.
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Bohg, Jeannette
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Kragic, Danica
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Carlson, Rolf
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Enhanced visual scene understanding through human-robot dialog2010In: Dialog with Robots: AAAI 2010 Fall Symposium, 2010, p. -144Conference paper (Refereed)
  • 63.
    Johnson-Roberson, Matthew
    et al.
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Bohg, Jeannette
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, Rolf
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kragic, Danica
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Rasolzadeh, Babak
    KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Autonomous Systems, CAS.
    Enhanced Visual Scene Understanding through Human-Robot Dialog2011In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE , 2011, p. 3342-3348Conference paper (Refereed)
    Abstract [en]

    We propose a novel human-robot-interaction framework for robust visual scene understanding. Without any a-priori knowledge about the objects, the task of the robot is to correctly enumerate how many of them are in the scene and segment them from the background. Our approach builds on top of state-of-the-art computer vision methods, generating object hypotheses through segmentation. This process is combined with a natural dialog system, thus including a ‘human in the loop’ where, by exploiting the natural conversation of an advanced dialog system, the robot gains knowledge about ambiguous situations. We present an entropy-based system allowing the robot to detect the poorest object hypotheses and query the user for arbitration. Based on the information obtained from the human-robot dialog, the scene segmentation can be re-seeded and thereby improved. We present experimental results on real data that show an improved segmentation performance compared to segmentation without interaction.

  • 64.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Crowdsourced Multimodal Corpora Collection Tool2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper (Refereed)
    Abstract [en]

    In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

  • 65.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots2019In: The Challenges of Working on Social Robots that Collaborate with People, 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of social robots and conversational agents is growing at a rapid pace. These agents, however, are still not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this paper, we discuss the effects of simulating anthropomorphism and non-verbal social behaviour in social robots and its implications for human-robot collaborative guided tasks. Our results indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal behaviour. We found a clear trade-off between interaction time and social facilitation when controlling for anthropomorphism and social behaviour.

  • 66.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

  • 67.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sibirtseva, Elena
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Pereira, André
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal reference resolution in collaborative assembly tasks2018In: Multimodal reference resolution in collaborative assembly tasks, ACM Digital Library, 2018Conference paper (Refereed)
  • 68.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Effects of Embodiment and Social Eye-Gaze in Conversational Agents2019In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the effects of simulating anthropomorphism and social eye-gaze in three conversational agents. We tested whether subjects’ visual attention would be similar to agents in different forms of embodiment and social eye-gaze. In a within-subject situated interaction study (N=30), we asked subjects to engage in task-oriented dialogue with a smart speaker and two variations of a social robot. We observed shifting of interactive behaviour by human users, as shown in differences in behavioural and objective measures. With a trade-off in task performance, social facilitation is higher with more anthropomorphic social agents when performing the same task.

  • 69.
    Kragic, Danica
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Karaoǧuz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Jensfelt, Patric
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Krug, Robert
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Interactive, collaborative robots: Challenges and opportunities2018In: IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence , 2018, p. 18-25Conference paper (Refereed)
    Abstract [en]

    Robotic technology has transformed manufacturing industry ever since the first industrial robot was put in use in the beginning of the 60s. The challenge of developing flexible solutions where production lines can be quickly re-planned, adapted and structured for new or slightly changed products is still an important open problem. Industrial robots today are still largely preprogrammed for their tasks, not able to detect errors in their own performance or to robustly interact with a complex environment and a human worker. The challenges are even more serious when it comes to various types of service robots. Full robot autonomy, including natural interaction, learning from and with human, safe and flexible performance for challenging tasks in unstructured environments will remain out of reach for the foreseeable future. In the envisioned future factory setups, home and office environments, humans and robots will share the same workspace and perform different object manipulation tasks in a collaborative manner. We discuss some of the major challenges of developing such systems and provide examples of the current state of the art.

  • 70.
    Lopes, José
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Abad, A.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Batista, F.
    Meena, Raveesh
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Trancoso, I.
    Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances2015In: INTERSPEECH-2015, 2015, p. 1805-1809Conference paper (Refereed)
    Abstract [en]

    Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

  • 71.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Berthelsen, H.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Controlling prominence realisation in parametric DNN-based speech synthesis2017In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 1079-1083Conference paper (Refereed)
    Abstract [en]

    This work aims to improve text-To-speech synthesis forWikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.

  • 72.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Crowdsourcing Street-level Geographic Information Using a Spoken Dialogue System2014In: Proceedings of the SIGDIAL 2014 Conference, Association for Computational Linguistics, 2014, p. 2-11Conference paper (Refereed)
    Abstract [en]

    We present a technique for crowd-sourcing street-level geographic information using spoken natural language. In particular, we are interested in obtaining first-person-view information about what can be seen from different positions in the city. This information can then for example be used for pedestrian routing services. The approach has been tested in the lab using a fully implemented spoken dialogue system, and is showing promising results.

  • 73.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Using a Spoken Dialogue System for Crowdsourcing Street-level Geographic Information2014Conference paper (Refereed)
    Abstract [en]

    We present a novel scheme for enriching geographic database with street-level geographic information that could be useful for pedestrian navigation. A spoken dialogue system for crowdsourcing street-level geographic details was developed and tested in an in-lab experimentation, and has shown promising results.

  • 74.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    David Lopes, José
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Automatic Detection of Miscommunication in Spoken Dialogue Systems2015In: Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015, p. 354-363Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system interactions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error awareness of the system, whereas offline detection could be used by a system designer to identify potential flaws in the dialogue design. In experimental evaluations on system logs from three different dialogue systems that vary in their dialogue strategy, the proposed models performed substantially better than the majority class baseline models.

  • 75.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Chunking Parser for Semantic Interpretation of Spoken Route Directions in Human-Robot Dialogue2012In: Proceedings of the 4th Swedish Language Technology Conference (SLTC 2012), Lund, Sweden, 2012, p. 55-56Conference paper (Refereed)
    Abstract [en]

    We present a novel application of the chunking parser for data-driven semantic interpretation of spoken route directions into route graphs that are useful for robot navigation. Various sets of features and machine learning algorithms were explored. The results indicate that our approach is robust to speech recognition errors, and could be easily used in other languages using simple features.

  • 76.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A data-driven approach to understanding spoken route directions in human-robot dialogue2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 2012, p. 226-229Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a data-driven chunking parser for automatic interpretation of spoken route directions into a route graph that is useful for robot navigation. Different sets of features and machine learning algorithms are explored. The results indicate that our approach is robust to speech recognition errors.

  • 77.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Data-driven Model for Timing Feedback in a Map Task Dialogue System2013In: 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGdial, Metz, France, 2013, p. 375-383Conference paper (Refereed)
    Abstract [en]

    We present a data-driven model for detecting suitable response locations in the user’s speech. The model has been trained on human–machine dialogue data and implemented and tested in a spoken dialogue system that can perform the Map Task with users. To our knowledge, this is the first example of a dialogue system that uses automatically extracted syntactic, prosodic and contextual features for online detection of response locations. A subjective evaluation of the dialogue system suggests that interactions with a system using our trained model were perceived significantly better than those with a system using a model that made decisions at random.

  • 78.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Human Evaluation of Conceptual Route Graphs for Interpreting Spoken Route Descriptions2013In: Proceedings of the 3rd International Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI), Potsdam, Germany, 2013, p. 30-35Conference paper (Refereed)
    Abstract [en]

    We present a human evaluation of the usefulness of conceptual route graphs (CRGs) when it comes to route following using spoken route descriptions. We describe a method for data-driven semantic interpretation of route de-scriptions into CRGs. The comparable performances of human participants in sketching a route using the manually transcribed CRGs and the CRGs produced on speech recognized route descriptions indicate the robustness of our method in preserving the vital conceptual information required for route following despite speech recognition errors.

  • 79.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Map Task Dialogue System: A Test-bed for Modelling Human-Like Dialogue2013In: 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGdial, Metz, France, 2013, p. 366-368Conference paper (Refereed)
    Abstract [en]

    The demonstrator presents a test-bed for collecting data on human–computer dialogue: a fully automated dialogue system that can perform Map Task with a user. In a first step, we have used the test-bed to collect human–computer Map Task dialogue data, and have trained various data-driven models on it for detecting feedback response locations in the user’s speech. One of the trained models has been tested in user interactions and was perceived better in comparison to a system using a random model. The demonstrator will exhibit three versions of the Map Task dialogue system—each using a different trained data-driven model of Response Location Detection.

  • 80.
    Meena, Raveesh
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafsson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Data-driven models for timing feedback responses in a Map Task dialogue system2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 4, p. 903-922Article in journal (Refereed)
    Abstract [en]

    Traditional dialogue systems use a fixed silence threshold to detect the end of users' turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and gestures, to coordinate smooth exchange of speaking turns. However, little effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human computer interactions. We present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). On this data, we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for detecting response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The subjective and objective measures from the user evaluation confirm that a model trained on speaker behavioural cues offers both smoother turn-transitions and more responsive system behaviour.

  • 81. Mirnig, Nicole
    et al.
    Weiss, Astrid
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Tscheligi, Manfred
    Face-To-Face With A Robot: What do we actually talk about?2013In: International Journal of Humanoid Robotics, ISSN 0219-8436, Vol. 10, no 1, p. 1350011-Article in journal (Refereed)
    Abstract [en]

    While much of the state-of-the-art research in human-robot interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not predefined. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scientific museum, but without a predefined purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot's capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would define a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its first marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasant experience.

  • 82.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tracking pitch contours using minimum jerk trajectories2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 2056-2059Conference paper (Refereed)
    Abstract [en]

    This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm includedin the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

  • 83.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A Dual Channel Coupled Decoder for Fillers and Feedback2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, 2011, p. 3097-3100Conference paper (Refereed)
    Abstract [en]

    This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

  • 84.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cues to perceived functions of acted and spontaneous feedback expressions2012In: Proceedings of theInterdisciplinary Workshop on Feedback Behaviors in Dialog, 2012, p. 53-56Conference paper (Refereed)
    Abstract [en]

    We present a two step study where the first part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.

  • 85.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Exploring the implications for feedback of a neurocognitive theory of overlapped speech2012In: Proceedings of Workshop on Feedback Behaviors in Dialog, 2012, p. 57-60Conference paper (Refereed)
  • 86.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Modeling Conversational Interaction Using Coupled Markov Chains2010In: Proceedings of DiSS-LPSS Joint Workshop 2010, 2010Conference paper (Refereed)
    Abstract [en]

    This paper presents a series of experiments on automatic transcription and classification of fillers and feedbacks in conversational speech corpora. A feature combination of PCA projected normalized F0 Constant-Q Cepstra and MFCCs has shown to be effective for standard Hidden Markov Models (HMM). We demonstrate how to model both speaker channel with coupled HMMs and show expected improvements. In particular, we explore model topologies which take advantage of predictive cues for fillers and feedback. This is done by initialize the training with special labels located immediately before fillers in the same channel and immediately before feedbacks in the other speaker channel. The average F-score for a standard HMM is 34.1%, for a coupled HMM 36.7% and for a coupled HMM with pre-filler and pre-feedback labels 40.4%. In a pilot study the detectors are found to be useful for semi-automatic transcription of feedback and fillers in socializing conversations.

  • 87.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Predicting Speaker Changes and Listener Responses With And Without Eye-contact2011In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy., 2011, p. 1576-1579Conference paper (Refereed)
    Abstract [en]

    This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.

  • 88.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prosodic Characterization and Automatic Classification of Conversational Grunts in Swedish2010In: Working Papers 54: Proceedings from Fonetik 2010, 2010Conference paper (Other academic)
    Abstract [en]

    Conversation is the most common use of speech. Any automatic dialog system, pretending to mimic a human, must be able to successfully detect typical sounds and meanings of spontaneous conversational speech. Automatic transcription of the function of linguistic units, sometimes refereed to as Dialog Acts (DAs), Cue Phrases or Discourse Markers is an emerging area of research. This can be done on a pure lexical level, or by using prosody alone (Laskowski and Shriberg, 2010; Goto et al., 1999), or a combination of thereof (Sridhar et al., 2009; Gravano et al., 2007). However, it is not straightforward to train a language model for non-verbal content (e.g. “mm”, “mhm”, “eh”, “em”), not only since it is questionable if these sounds are words, but also because of lack of standardized annotation schemes. Ward (2000) refer to these tokens as conversational grunts, which is also the scope of this study. Feedback tokens are usually sub-divided into yes/no answers, backchannels and acknowledgments. In this study, it is the attitude of the response which is the focus of interest. Thus, the cut is instead made between dis-preference, news receiving and general feedback. These are further subdivided into their turn-taking effect: Other speaker, Same speaker and Simultaneous start. This allows us to verify if conversational grunts are simply carriers of prosodic information. In this study, we use a supra-segmental prosodic signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) introduced in (Neiberg et al., 2010), for classification and intuitive visualization of feedback and fillers. The contribution of the end of interlocutor left context for predicting turn taking effect has been studied for a while (Duncan, 1972) and is also addressed in this study. In addition, we examine the effect of contextual timing features, which has been shown to be useful in DAs recognition (Laskowski and Shriberg, 2010). We use the Swedish DEAL corpus which has annotated fillers and feedback attitudes. Classification results using linear discriminant analysis are presented. It was found that feedbacks followed by a clean floor taking lose some of their prosodic cues which signal attitude compared to a clean continuer feedback. Turn taking effects can be predicted well over chance level, while Simultaneous Start can’t be predicted at all. However, feedback tokens before Simultaneous Starts were found to be more equal feedback continuers than turn initial feedback tokens, which may be explained as inappropriate floor stealing attempts from the feedback producing speaker. An analysis based on the prototypical spectrograms closely follows the results for Bad News (Dispreference) vs Good news (News reciving) found in Freese and Maynard (1998) although the defnitions differ slightly.

  • 89.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    The Prosody of Swedish Conversational Grunts2010In: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, 2010, p. 2562-2565Conference paper (Refereed)
    Abstract [en]

    This paper explores conversational grunts in a face-to-face setting. The study investigates the prosody and turn-taking effect of fillers and feedback tokens that has been annotated for attitudes. The grunts were selected from the DEAL corpus and automatically annotated for their turn taking effect. A novel suprasegmental prosodic signal representation and contextual timing features are used for classification and visualization. Classification results using linear discriminant analysis, show that turn-initial feedback tokens lose some of their attitude-signaling prosodic cues compared to non-overlapping continuer feedback tokens. Turn taking effects can be predicted well over chance level, except Simultaneous Starts. However, feedback tokens before places where both speakers take the turn were more similar to feedback continuers than to turn initial feedback tokens.

  • 90.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards letting machines humming in the right way: prosodic analysis of six functions of short feedback tokens in English2012In: Proceedings of Fonetik, 2012Conference paper (Other academic)
  • 91.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Semi-supervised methods for exploring the acoustics of simple productive feedback2013In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, no 3, p. 451-469Article in journal (Refereed)
    Abstract [en]

    This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

  • 92.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    David Lopes, José
    KTH, Språk och kommunikation.
    Yu, Y.
    Funes, K.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Black, A.
    Odobez, J-M.
    Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens2016In: Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI 2016), Tokyo, Japan, 2016Conference paper (Refereed)
  • 93.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Funes, K.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Odobez, J-M.
    Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions2015In: Proccedings of ICMI 2015, ACM Digital Library, 2015Conference paper (Refereed)
    Abstract [en]

    Estimating a silent participant's degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio-visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64 %.

  • 94.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Funes, K.
    Sheiki, S.
    Odobez, J-M.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Who will get the grant?: A multimodal corpus for the analysis of conversational behaviours in group interviews2014In: UM3I 2014 - Proceedings of the 2014 ACM Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Co-located with ICMI 2014, Association for Computing Machinery (ACM), 2014, p. 27-32Conference paper (Refereed)
    Abstract [en]

    In the last couple of years more and more multimodal corpora have been created. Recently many of these corpora have also included RGB-D sensors' data. However, there is to our knowledge no publicly available corpus, which combines accurate gaze-tracking, and high- quality audio recording for group discussions of varying dynamics. With a corpus that would fulfill these needs, it would be possible to investigate higher level constructs such as group involvement, individual engagement or rapport, which all require multimodal feature extraction. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of group dynamics.

  • 95.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Black, A.
    On Data Driven Parametric Backchannel Synthesis for Expressing Attentiveness in Conversational Agents2016In: Proceedings of Multimodal Analyses enabling Artificial Agents in Human­-Machine Interaction (MA3HMI), satellite workshop of ICMI 2016, 2016Conference paper (Refereed)
    Abstract [en]

    In this study, we are using a multi-party recording as a template for building a parametric speech synthesiser which is able to express different levels of attentiveness in backchannel tokens. This allowed us to investigate i) whether it is possible to express the same perceived level of attentiveness in synthesised than in natural backchannels; ii) whether it is possible to increase and decrease the perceived level of attentiveness of backchannels beyond the range observed in the original corpus.

  • 96.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Black, A.
    Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Feedback Utterances2016In: Proceedings of Interspeech 2016, San Fransisco, USA, 2016Conference paper (Refereed)
  • 97.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Haddad, K. E.
    Szekely, Eva
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Using crowd-sourcing for the design of listening agents: Challenges and opportunities2017In: ISIAA 2017 - Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, Co-located with ICMI 2017, Association for Computing Machinery (ACM), 2017, p. 37-38Conference paper (Refereed)
    Abstract [en]

    In this paper we are describing how audio-visual corpora recordings using crowd-sourcing techniques can be used for the audio-visual synthesis of attitudinal non-verbal feedback expressions for virtual agents. We are discussing the limitations of this approach as well as where we see the opportunities for this technology.

  • 98.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Götze, Jana
    KTH, School of Computer Science and Communication (CSC), Theoretical Computer Science, TCS.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Heldner, Mattias
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The KTH Games Corpora: How to Catch a Werewolf2013In: IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video: MMC 2013, 2013Conference paper (Refereed)
  • 99.
    Oertel, Catharine
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wlodarczak, Marcin
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Wagner, Petra
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gaze Patterns in Turn-Taking2012In: 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, Vol 3, Portland, Oregon, US, 2012, p. 2243-2246Conference paper (Refereed)
    Abstract [en]

    This paper investigates gaze patterns in turn-taking. We focus on differences between speaker changes resulting in silences and overlaps. We also investigate gaze patterns around backchannels and around silences not involving speaker changes.

  • 100. Schötz, S.
    et al.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bruce, G.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Simulating Intonation in Regional Varieties of Swedish2010In: Speech Prosody 2010, Chicago, USA, 2010Conference paper (Refereed)
    Abstract [en]

    Within the research project SIMULEKT (Simulating Intonational Varieties of Swedish), our recent work includes two approaches to simulating intonation in regional varieties of Swedish. The first involves a method for modelling intonation using the SWING (SWedish INtonation Generator) tool, where annotated speech samples are resynthesised with rule-based intonation and audio-visually analysed with regards to the major intonational varieties of Swedish. The second approach concerns a method for simulating dialects with HMM synthesis, where speech is generated from emphasis-tagged text. We consider both approaches important in our aim to test and further develop the Swedish prosody model, as well as to convincingly simulate Swedish regional varieties using speech synthesis. Index Terms: Swedish dialects, prosody, speech synthesis.

123 51 - 100 of 110
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf