kth.sePublikationer
Ändra sökning
Avgränsa sökresultatet
1234567 1 - 50 av 460
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Abdelnour, Jerome
    et al.
    NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.
    Rouat, Jean
    NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Department of Electronic Systems, Norwegian University of Science and Technology, Norway.
    NAAQA: A Neural Architecture for Acoustic Question Answering2022Ingår i: IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, s. 1-12Artikel i tidskrift (Refereegranskat)
    Ladda ner fulltext (pdf)
    fulltext
  • 2.
    Abelho Pereira, André Tiago
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation.
    Oertel, Catharine
    TU Delft Delft, Netherlands.
    Fermoselle, Leonor
    TNO Den Haag, Netherlands.
    Mendelson, Joe
    Furhat Robotics Stockholm, Sweden.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Effects of Different Interaction Contexts when Evaluating Gaze Models in HRI2020Konferensbidrag (Refereegranskat)
    Abstract [en]

    We previously introduced a responsive joint attention system that uses multimodal information from users engaged in a spatial reasoning task with a robot and communicates joint attention via the robot's gaze behavior [25]. An initial evaluation of our system with adults showed it to improve users' perceptions of the robot's social presence. To investigate the repeatability of our prior findings across settings and populations, here we conducted two further studies employing the same gaze system with the same robot and task but in different contexts: evaluation of the system with external observers and evaluation with children. The external observer study suggests that third-person perspectives over videos of gaze manipulations can be used either as a manipulation check before committing to costly real-time experiments or to further establish previous findings. However, the replication of our original adults study with children in school did not confirm the effectiveness of our gaze manipulation, suggesting that different interaction contexts can affect the generalizability of results in human-robot interaction gaze studies.

    Ladda ner fulltext (pdf)
    fulltext
  • 3.
    Abelho Pereira, André Tiago
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation.
    Oertel, Catharine
    Computer-Human Interaction Lab for Learning & Instruction Ecole Polytechnique Federale de Lausanne, Switzerland..
    Fermoselle, Leonor
    Mendelson, Joe
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Responsive Joint Attention in Human-Robot Interaction2019Ingår i: Proceedings 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2019, Institute of Electrical and Electronics Engineers (IEEE) , 2019, s. 1080-1087Konferensbidrag (Refereegranskat)
    Abstract [en]

    Joint attention has been shown to be not only crucial for human-human interaction but also human-robot interaction. Joint attention can help to make cooperation more efficient, support disambiguation in instances of uncertainty and make interactions appear more natural and familiar. In this paper, we present an autonomous gaze system that uses multimodal perception capabilities to model responsive joint attention mechanisms. We investigate the effects of our system on people’s perception of a robot within a problem-solving task. Results from a user study suggest that responsive joint attention mechanisms evoke higher perceived feelings of social presence on scales that regard the direction of the robot’s perception.

    Ladda ner fulltext (pdf)
    fulltext
  • 4. Adiban, M.
    et al.
    Safari, A.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Step-gan: A one-class anomaly detection model with applications to power system security2021Ingår i: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2021, s. 2605-2609Konferensbidrag (Refereegranskat)
    Abstract [en]

    Smart grid systems (SGSs), and in particular power systems, play a vital role in today's urban life. The security of these grids is now threatened by adversaries that use false data injection (FDI) to produce a breach of availability, integrity, or confidential principles of the system. We propose a novel structure for the multigenerator generative adversarial network (GAN) to address the challenges of detecting adversarial attacks. We modify the GAN objective function and the training procedure for the malicious anomaly detection task. The model only requires normal operation data to be trained, making it cheaper to deploy and robust against unseen attacks. Moreover, the model operates on the raw input data, eliminating the need for feature extraction. We show that the model reduces the well-known mode collapse problem of GAN-based systems, it has low computational complexity and considerably outperforms the baseline system (OCAN) with about 55% in terms of accuracy on a freely available cyber attack dataset.

  • 5. Adiban, Mohammad
    et al.
    Siniscalchi, Marco
    Stefanov, Kalin
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation. Norwegian University of Science and Technology Trondheim, Norway.
    Hierarchical Residual Learning Based Vector Quantized Variational Autoencorder for Image Reconstruction and Generation2022Ingår i: The 33rd British Machine Vision Conference Proceedings, 2022Konferensbidrag (Refereegranskat)
    Abstract [en]

    We propose a multi-layer variational autoencoder method, we call HR-VQVAE, thatlearns hierarchical discrete representations of the data. By utilizing a novel objectivefunction, each layer in HR-VQVAE learns a discrete representation of the residual fromprevious layers through a vector quantized encoder. Furthermore, the representations ateach layer are hierarchically linked to those at previous layers. We evaluate our methodon the tasks of image reconstruction and generation. Experimental results demonstratethat the discrete representations learned by HR-VQVAE enable the decoder to reconstructhigh-quality images with less distortion than the baseline methods, namely VQVAE andVQVAE-2. HR-VQVAE can also generate high-quality and diverse images that outperform state-of-the-art generative models, providing further verification of the efficiency ofthe learned representations. The hierarchical nature of HR-VQVAE i) reduces the decoding search time, making the method particularly suitable for high-load tasks and ii) allowsto increase the codebook size without incurring the codebook collapse problem.

    Ladda ner fulltext (pdf)
    fulltext
  • 6.
    Adiban, Mohammad
    et al.
    NTNU, Dept Elect Syst, Trondheim, Norway.;Monash Univ, Dept Human Centred Comp, Melbourne, Australia..
    Siniscalchi, Sabato Marco
    NTNU, Dept Elect Syst, Trondheim, Norway..
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. NTNU, Dept Elect Syst, Trondheim, Norway..
    A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity2023Ingår i: Neurocomputing, ISSN 0925-2312, E-ISSN 1872-8286, Vol. 537, s. 296-308Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Cyber attacks and anomaly detection are problems where the data is often highly unbalanced towards normal observations. Furthermore, the anomalies observed in real applications may be significantly different from the ones contained in the training data. It is, therefore, desirable to study methods that are able to detect anomalies only based on the distribution of the normal data. To address this problem, we propose a novel objective function for generative adversarial networks (GANs), referred to as STEPGAN. STEP-GAN simulates the distribution of possible anomalies by learning a modified version of the distribution of the task-specific normal data. It leverages multiple generators in a step-by-step interaction with a discriminator in order to capture different modes in the data distribution. The discriminator is optimized to distinguish not only between normal data and anomalies but also between the different generators, thus encouraging each generator to model a different mode in the distribution. This reduces the well-known mode collapse problem in GAN models considerably. We tested our method in the areas of power systems and network traffic control systems (NTCSs) using two publicly available highly imbalanced datasets, ICS (Industrial Control System) security dataset and UNSW-NB15, respectively. In both application domains, STEP-GAN outperforms the state-of-the-art systems as well as the two baseline systems we implemented as a comparison. In order to assess the generality of our model, additional experiments were carried out on seven real-world numerical datasets for anomaly detection in a variety of domains. In all datasets, the number of normal samples is significantly more than that of abnormal samples. Experimental results show that STEP-GAN outperforms several semi-supervised methods while being competitive with supervised methods.

  • 7.
    Ahlberg, Sofie
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.
    Axelsson, Agnes
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Yu, Pian
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.
    Shaw Cortez, Wenceslao E.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.
    Gao, Yuan
    Uppsala Univ, Dept Informat Technol, Uppsala, Sweden.;Shenzhen Inst Artificial Intelligence & Robot Soc, Ctr Intelligent Robots, Shenzhen, Peoples R China..
    Ghadirzadeh, Ali
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Castellano, Ginevra
    Uppsala Univ, Dept Informat Technol, Uppsala, Sweden..
    Kragic, Danica
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Dimarogonas, Dimos V.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.
    Co-adaptive Human-Robot Cooperation: Summary and Challenges2022Ingår i: Unmanned Systems, ISSN 2301-3850, E-ISSN 2301-3869, Vol. 10, nr 02, s. 187-203Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The work presented here is a culmination of developments within the Swedish project COIN: Co-adaptive human-robot interactive systems, funded by the Swedish Foundation for Strategic Research (SSF), which addresses a unified framework for co-adaptive methodologies in human-robot co-existence. We investigate co-adaptation in the context of safe planning/control, trust, and multi-modal human-robot interactions, and present novel methods that allow humans and robots to adapt to one another and discuss directions for future work.

  • 8.
    Alexanderson, Simon
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Robust model training and generalisation with Studentising flows2020Ingår i: Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models / [ed] Chin-Wei Huang, David Krueger, Rianne van den Berg, George Papamakarios, Chris Cremer, Ricky Chen, Danilo Rezende, 2020, Vol. 2, s. 25:1-25:9, artikel-id 25Konferensbidrag (Refereegranskat)
    Abstract [en]

    Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood. We discuss how these methods can be further improved based on insights from robust (in particular, resistant) statistics. Specifically, we propose to endow flow-based models with fat-tailed latent distributions such as multivariate Student's t, as a simple drop-in replacement for the Gaussian distribution used by conventional normalising flows. While robustness brings many advantages, this paper explores two of them: 1) We describe how using fatter-tailed base distributions can give benefits similar to gradient clipping, but without compromising the asymptotic consistency of the method. 2) We also discuss how robust ideas lead to models with reduced generalisation gap and improved held-out data likelihood. Experiments on several different datasets confirm the efficacy of the proposed approach in both regards.

    Ladda ner fulltext (pdf)
    alexanderson2020robust
  • 9.
    Alexanderson, Simon
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Kucherenko, Taras
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows2020Ingår i: Computer graphics forum (Print), ISSN 0167-7055, E-ISSN 1467-8659, Vol. 39, nr 2, s. 487-496Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

    Ladda ner fulltext (pdf)
    fulltext
    Ladda ner fulltext (pdf)
    erratum
  • 10.
    Alexanderson, Simon
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Motorica AB, Sweden.
    Nagy, Rajmund
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. Motorica AB, Sweden.
    Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models2023Ingår i: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 42, nr 4, artikel-id 44Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

  • 11.
    Alexanderson, Simon
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Kucherenko, Taras
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Generating coherent spontaneous speech and gesture from text2020Ingår i: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, Association for Computing Machinery (ACM) , 2020Konferensbidrag (Refereegranskat)
    Abstract [en]

    Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.

  • 12. Ambrazaitis, G.
    et al.
    Frid, J.
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Word prominence ratings in Swedish television news readings: Effects of pitch accents and head movements2020Ingår i: Proceedings of the International Conference on Speech Prosody, International Speech Communication Association , 2020, Vol. 2020, s. 314-318Konferensbidrag (Refereegranskat)
    Abstract [en]

    Prosodic prominence is a multimodal phenomenon where pitch accents are frequently aligned with visible movements by the hands, head, or eyebrows. However, little is known about how such movements function as visible prominence cues in multimodal speech perception with most previous studies being restricted to experimental settings. In this study, we are piloting the acquisition of multimodal prominence ratings for a corpus of natural speech (Swedish television news readings). Sixteen short video clips (218 words) of news readings were extracted from a larger corpus and rated by 44 native Swedish adult volunteers using a web-based set-up. The task was to rate each word in a clip as either non-prominent, moderately prominent or strongly prominent based on audiovisual cues. The corpus was previously annotated for pitch accents and head movements. We found that words realized with a pitch accent and head movement tended to receive higher prominence ratings than words with a pitch accent only. However, we also examined ratings for a number of carefully selected individual words, and these case studies suggest that ratings are affected by complex relations between the presence of a head movement and its type of alignment, the word's F0 profile, and semantic and pragmatic factors.

  • 13. Ambrazaitis, G.
    et al.
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings2017Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 95, s. 100-113Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Facial beat gestures align with pitch accents in speech, functioning as visual prominence markers. However, it is not yet well understood whether and how gestures and pitch accents might be combined to create different types of multimodal prominence, and how specifically visual prominence cues are used in spoken communication. In this study, we explore the use and possible interaction of eyebrow (EB) and head (HB) beats with so-called focal pitch accents (FA) in a corpus of 31 brief news readings from Swedish television (four news anchors, 986 words in total), focusing on effects of position in text, information structure as well as speaker expressivity. Results reveal an inventory of four primary (combinations of) prominence markers in the corpus: FA+HB+EB, FA+HB, FA only (i.e., no gesture), and HB only, implying that eyebrow beats tend to occur only in combination with the other two markers. In addition, head beats occur significantly more frequently in the second than in the first part of a news reading. A functional analysis of the data suggests that the distribution of head beats might to some degree be governed by information structure, as the text-initial clause often defines a common ground or presents the theme of the news story. In the rheme part of the news story, FA, HB, and FA+HB are all common prominence markers. The choice between them is subject to variation which we suggest might represent a degree of freedom for the speaker to use the markers expressively. A second main observation concerns eyebrow beats, which seem to be used mainly as a kind of intensification marker for highlighting not only contrast, but also value, magnitude, or emotionally loaded words; it is applicable in any position in a text. We thus observe largely different patterns of occurrence and usage of head beats on the one hand and eyebrow beats on the other, suggesting that the two represent two separate modalities of visual prominence cuing.

  • 14.
    Ambrazaitis, Gilbert
    et al.
    Linnaeus University, Växjö, Sweden.
    Frid, Johan
    Lund University Humanities Lab, Sweden.
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Auditory vs. audiovisual prominence ratings of speech involving spontaneously produced head movements2022Ingår i: Proceedings of the 11th International Conference on Speech Prosody, Speech Prosody 2022, International Speech Communication Association , 2022, s. 352-356Konferensbidrag (Refereegranskat)
    Abstract [en]

    Visual information can be integrated in prominence perception, but most available evidence stems from controlled experimental settings, often involving synthetic stimuli. The present study provides evidence from spontaneously produced head gestures that occurred in Swedish television news readings. Sixteen short clips (containing 218 words in total) were rated for word prominence by 85 adult volunteers in a between-subjects design (44 in an audio-visual vs. 41 in an audio-only condition) using a web-based rating task. As an initial test of overall rating behavior, average prominence across all 218 words was compared between the two conditions, revealing no significant difference. In a second step, we compared normalized prominence ratings between the two conditions for all 218 words individually. These results displayed significant (or near significant, p<.08) differences for 28 out of 218 words, with higher ratings in either the audiovisual (13 words) or the audio-only-condition (15 words). A detailed examination revealed that the presence of head movements (previously annotated) can boost prominence ratings in the audiovisual condition, while words with low prominence tend to be rated slightly higher in the audio-only condition. The study suggests that visual prominence signals are integrated in speech processing even in a relatively uncontrolled, naturalistic setting.

  • 15.
    Ambrazaitis, Gilbert
    et al.
    Centre for Languages and Literature, Lund University, Sweden.
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Acoustic features of multimodal prominences: Do visual beat gestures affect verbal pitch accent realization?2017Ingår i: Proceedings 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017, International Speech Communication Association , 2017, s. 89-94Konferensbidrag (Refereegranskat)
    Abstract [en]

    The interplay of verbal and visual prominence cues has attracted recent attention, but previous findings are inconclusive as to whether and how the two modalities are integrated in the production and perception of prominence. In particular, we do not know whether the phonetic realization of pitch accents is influenced by co-speech beat gestures, and previous findings seem to generate different predictions. In this study, we investigate acoustic properties of prominent words as a function of visual beat gestures in a corpus of read news from Swedish television. The corpus was annotated for head and eyebrow beats as well as sentence-level pitch accents. Four types of prominence cues occurred particularly frequently in the corpus: (1) pitch accent only, (2) pitch accent plus head, (3) pitch accent plus head plus eyebrows, and (4) head only. The results show that (4) differs from (1-3) in terms of a smaller pitch excursion and shorter syllable duration. They also reveal significantly larger pitch excursions in (2) than in (1), suggesting that the realization of a pitch accent is to some extent influenced by the presence of visual prominence cues. Results are discussed in terms of the interaction between beat gestures and prosody with a potential functional difference between head and eyebrow beats.

  • 16.
    Ambrazaitis, Gilbert
    et al.
    Linnaeus Univ, Dept Swedish, Växjö, Sweden..
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Probing effects of lexical prosody on speech-gesture integration in prominence production by Swedish news presenters2022Ingår i: LABORATORY PHONOLOGY, ISSN 1868-6346, Vol. 13, nr 1, s. 1-35Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This study investigates the multimodal implementation of prosodic-phonological categories, asking whether the accentual fall and the following rise in the Swedish word accents (Accent 1, Accent 2) are varied as a function of accompanying head and eyebrow gestures. Our purpose is to evaluate the hypothesis that prominence production displays a cumulative relation between acoustic and kinematic dimensions of spoken language, especially focusing on the clustering of gestures (head, eyebrows), at the same time asking if lexical-prosodic features would interfere with this cumulative relation. Our materials comprise 12 minutes of speech from Swedish television news presentations. The results reveal a significant trend for larger fo rises when a head movement accompanies the accented word, and even larger when an additional eyebrow movement is present. This trend is observed for accentual rises that encode phrase-level prominence, but not for accentual falls that are primarily related to lexical prosody. Moreover, the trend is manifested differently in different lexical-prosodic categories (Accent 1 versus Accent 2 with one versus two lexical stresses). The study provides novel support for a cumulative-cue hypothesis and the assumption that prominence production is essentially multimodal, well in line with the idea of speech and gesture as an integrated system.

  • 17.
    Amerotti, Marco
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Benford, Steve
    University of Nottingham.
    Sturm, Bob
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Vear, Craig
    University of Nottingham.
    A Live Performance Rule System Informed by Irish Traditional Dance Music2023Ingår i: Proc. International Symposium on Computer Music Multidisciplinary Research, 2023Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper describes ongoing work in programming a live performance system for interpreting melodies in ways that mimic Irish traditional dance music practice, and thatallows plug and play human interaction. Existing performance systemsare almost exclusively aimed at piano performance and classical music, and noneare aimed specifically at traditional music.We develop a rule-based approach using expert knowledgethat converts a melody into control parametersto synthesize an expressive MIDI performance,focusing on ornamentation, dynamics and subtle time deviation.Furthermore, we make the system controllable (e.g., via knobs or expression pedals) such that it can be controlled in real time by a musician.Our preliminary evaluations show the systemcan render expressive performances mimicking traditional practice, and allows for engaging withIrish traditional dance music in new ways. We provide several examples online.

    Ladda ner fulltext (pdf)
    fulltext
  • 18.
    Ardal, Dui
    et al.
    KTH.
    Alexanderson, Simon
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Lempert, Mirko
    Abelho Pereira, André Tiago
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    A Collaborative Previsualization Tool for Filmmaking in Virtual Reality2019Ingår i: Proceedings - CVMP 2019: 16th ACM SIGGRAPH European Conference on Visual Media Production, ACM Digital Library, 2019Konferensbidrag (Refereegranskat)
    Abstract [en]

    Previsualization is a process within pre-production of filmmaking where filmmakers can visually plan specific scenes with camera works, lighting, character movements, etc. The costs of computer graphics-based effects are substantial within film production. Using previsualization, these scenes can be planned in detail to reduce the amount of work put on effects in the later production phase. We develop and assess a prototype for previsualization in virtual reality for collaborative purposes where multiple filmmakers can be present in a virtual environment to share a creative work experience, remotely. By performing a within-group study on 20 filmmakers, our findings show that the use of virtual reality for distributed, collaborative previsualization processes is useful for real-life pre-production purposes.

    Ladda ner fulltext (pdf)
    Previs
  • 19. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Guasch, Oriol
    Engwall, Olov
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation.
    MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs2019Ingår i: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 27, nr 12, s. 2173-2182Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [${\alpha i}$] and [${\alpha u}$] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.

    Ladda ner fulltext (pdf)
    fulltext
  • 20.
    Ashkenazi, Shaul
    et al.
    University of Glasgow Glasgow, UK.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Stuart-Smith, Jane
    University of Glasgow Glasgow, UK.
    Foster, Mary Ellen
    University of Glasgow Glasgow, UK.
    Goes to the Heart: Speaking the User's Native Language2024Ingår i: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, s. 214-218Konferensbidrag (Refereegranskat)
    Abstract [en]

    We are developing a social robot to work alongside human support workers who help new arrivals in a country to navigate the necessary bureaucratic processes in that country. The ultimate goal is to develop a robot that can support refugees and asylum seekers in the UK. As a first step, we are targeting a less vulnerable population with similar support needs: international students in the University of Glasgow. As the target users are in a new country and may be in a state of stress when they seek support, forcing them to communicate in a foreign language will only fuel their anxiety, so a crucial aspect of the robot design is that it should speak the users' native language if at all possible. We provide a technical description of the robot hardware and software, and describe the user study that will shortly be carried out. At the end, we explain how we are engaging with refugee support organisations to extend the robot into one that can also support refugees and asylum seekers.

  • 21.
    Axelsson, Agnes
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Adaptive Robot Presenters: Modelling Grounding in Multimodal Interaction2023Doktorsavhandling, monografi (Övrigt vetenskapligt)
    Abstract [sv]

    Denna avhandling behandlar ämnet multimodal kommunikativ grundning (grounding) mellan robotar och människor. Detta är processen för hur en människa och en robot kan säkerställa att de har en gemensam förståelse. För att utforska detta ämne ämne, används ett scenario där en robot håller en presentation för en mänsklig publik. Roboten måste analysera multimodala signaler från människan för att anpassa presentationen till människans nivå av förståelse.

    Först undersöks hur beteendeträd kan användas för att modellera realtidsaspekterna av interaktionen mellan robotpresentatören och dess publik. Ett system som baseras på beteendeträdsarkitekturen används i ett delvis automatiskt, delvis människostyrt experiment, där det visas att publikmedlemmar i labbmiljö föredrar ett system som anpassar presentationen till deras reaktioner över ett som inte anpassar sin presentation.

    Efter detta, urdersöker också avhandlingen hur kunskapsgrafer kan användas för att representera innehållet som roboten presenterar. Om en liten, lokal kunskapsgraf byggs så att den innehåller relationer (kanter) som representerar fakta i presentationen, så kan roboten iterera över grafen och konsekvent hitta refererande uttryck som använder sig av kunskap som publiken redan har. Ett system som baseras på denna arkitektur implementeras, och ett experiment med simulerade interaktioner utförs och presenteras. Experimentets resultat visar att utvärderare som jämför olika anpassningsstrategier föredrar ett system som kan utföra den sortens anpassning som grafmetoden tillåter. 

    Publikens reaktioner i ett presentationsscenario kan ske genom olika modaliteter, som tal, huvudrörelser, blickriktning, ansiktsuttryck och kroppsspråk. För att klassificera kommunikativ återmatning (feedback) av dessa modaliteter från presentationspubliken, utforskas hur sådana signaler kan analyseras automatiskt. En datamängd med interaktioner mellan en människa och vår robot annoteras, och statistiska modeller tränas för att klassificera mänskliga återmatningssignaler från flera olika modaliteter som positiva, negativa eller neutrala. En jämförelsevis hög klassifikationsprecision uppnås genom att träna enklare klassifikationsmodeller på relativt få klasser av signaler i tal- och huvudrörelsemodaliteterna. Detta antyder att museiscenariot med en robotpresentatör inte uppmuntrar publiken att använda komplicerade, mångtydiga kommunikativa beteenden.

    När kunskapsgrafer används som presentationssystemets informationsrepresentation, behövs det konsekventa metoder för att generera text som kan omvandlas till tal, från grafdata. Graf-till-text-problemet utforskas genom att föreslå flera olika metoder, både enklare mall-baserade sådana och mer avancerade metoder baserade på stora språkmodeller (LLM:er). Genom att föreslå en ny utvärderingsmetod där sanna, fiktiva och falska grafer genereras, visar vi också att sanningshalten i vad som uttrycks påverkar kvaliteten i texten som LLM-metoderna ger från kunskapsgrafdata.

    Avhandlingen använder sig slutligen av alla de ovanstående föreslagna komponenterna i ett och samma helautomatiska presentationssystem. Resultaten visar att publikmedlemmar föredrar ett system som anpassar sin presentation över ett som inte anpassar sin presentation, vilket speglar resultaten från början av avhandlingen. Vi ser också att tydliga inlärningsresultat uteblir i detta experiment, vilket kanske kan tolkas som att publikmedlemmarna i museiscenariot snarare letar efter en underhållare än efter en lärare som presentatör.

    Ladda ner fulltext (pdf)
    Fulltext
  • 22.
    Axelsson, Agnes
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Buschmeier, Hendrik
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Modeling Feedback in Interaction With Conversational Agents—A Review2022Ingår i: Frontiers in Computer Science, E-ISSN 2624-9898, Vol. 4, artikel-id 744574Artikel, forskningsöversikt (Refereegranskat)
    Abstract [en]

    Intelligent agents interacting with humans through conversation (such as a robot, embodied conversational agent, or chatbot) need to receive feedback from the human to make sure that its communicative acts have the intended consequences. At the same time, the human interacting with the agent will also seek feedback, in order to ensure that her communicative acts have the intended consequences. In this review article, we give an overview of past and current research on how intelligent agents should be able to both give meaningful feedback toward humans, as well as understanding feedback given by the users. The review covers feedback across different modalities (e.g., speech, head gestures, gaze, and facial expression), different forms of feedback (e.g., backchannels, clarification requests), and models for allowing the agent to assess the user's level of understanding and adapt its behavior accordingly. Finally, we analyse some shortcomings of current approaches to modeling feedback, and identify important directions for future research.

    Ladda ner fulltext (pdf)
    fulltext
  • 23.
    Axelsson, Agnes
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Do you follow?: A fully automated system for adaptive robot presenters2023Ingår i: HRI 2023: Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2023, s. 102-111Konferensbidrag (Refereegranskat)
    Abstract [en]

    An interesting application for social robots is to act as a presenter, for example as a museum guide. In this paper, we present a fully automated system architecture for building adaptive presentations for embodied agents. The presentation is generated from a knowledge graph, which is also used to track the grounding state of information, based on multimodal feedback from the user. We introduce a novel way to use large-scale language models (GPT-3 in our case) to lexicalise arbitrary knowledge graph triples, greatly simplifying the design of this aspect of the system. We also present an evaluation where 43 participants interacted with the system. The results show that users prefer the adaptive system and consider it more human-like and flexible than a static version of the same system, but only partial results are seen in their learning of the facts presented by the robot.

  • 24.
    Axelsson, Agnes
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Multimodal User Feedback During Adaptive Robot-Human Presentations2022Ingår i: Frontiers in Computer Science, E-ISSN 2624-9898, Vol. 3Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Feedback is an essential part of all communication, and agents communicating with humans must be able to both give and receive feedback in order to ensure mutual understanding. In this paper, we analyse multimodal feedback given by humans towards a robot that is presenting a piece of art in a shared environment, similar to a museum setting. The data analysed contains both video and audio recordings of 28 participants, and the data has been richly annotated both in terms of multimodal cues (speech, gaze, head gestures, facial expressions, and body pose), as well as the polarity of any feedback (negative, positive, or neutral). We train statistical and machine learning models on the dataset, and find that random forest models and multinomial regression models perform well on predicting the polarity of the participants' reactions. An analysis of the different modalities shows that most information is found in the participants' speech and head gestures, while much less information is found in their facial expressions, body pose and gaze. An analysis of the timing of the feedback shows that most feedback is given when the robot makes pauses (and thereby invites feedback), but that the more exact timing of the feedback does not affect its meaning.

    Ladda ner fulltext (pdf)
    fulltext
  • 25.
    Axelsson, Agnes
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Using Large Language Models for Zero-Shot Natural Language Generation from Knowledge Graphs2023Ingår i: Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), 2023, s. 39-54Konferensbidrag (Refereegranskat)
    Abstract [en]

    In any system that uses structured knowledgegraph (KG) data as its underlying knowledge representation, KG-to-text generation is a useful tool for turning parts of the graph data into text that can be understood by humans. Recent work has shown that models that make use of pretraining on large amounts of text data can perform well on the KG-to-text task, even with relatively little training data on the specific graph-to-text task. In this paper, we build on this concept by using large language models to perform zero-shot generation based on nothing but the model’s understanding of the triple structure from what it can read. We show that ChatGPT achieves near state-of-the-art performance on some measures of the WebNLG 2020 challenge, but falls behind on others. Additionally, we compare factual, counter-factual and fictional statements, and show that there is a significant connection between what the LLM already knows about the data it is parsing and the quality of the output text.

    Ladda ner fulltext (pdf)
    mmnlg-2023-axelsson-skantze-kg-to-text-chatgpt
  • 26.
    Axelsson, Agnes
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Vaddadi, Bhavana
    KTH, Skolan för industriell teknik och management (ITM), Centra, Integrated Transport Research Lab, ITRL.
    Bogdan, Cristian M
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Människocentrerad teknologi, Medieteknik och interaktionsdesign, MID.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Robots in autonomous buses: Who hosts when no human is there?2024Ingår i: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, s. 1278-1280Konferensbidrag (Refereegranskat)
    Abstract [en]

    In mid-2023, we performed an experiment in autonomous buses in Stockholm, Sweden, to evaluate the role that social robots might have in such settings, and their effects on passengers' feeling of safety and security, given the absence of human drivers or clerks. To address the situations that may occur in autonomous public transit (APT), we compared an embodied agent to a disembodied agent. In this video publication, we showcase some of the things that worked with the interactions we created, and some problematic issues that we had not anticipated.

  • 27.
    Axelsson, Nils
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Modelling Adaptive Presentations in Human-Robot Interaction using Behaviour Trees2019Ingår i: 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue: Proceedings of the Conference / [ed] Satoshi Nakamura, Stroudsburg, PA: Association for Computational Linguistics (ACL) , 2019, s. 345-352Konferensbidrag (Refereegranskat)
    Abstract [en]

    In dialogue, speakers continuously adapt their speech to accommodate the listener, based on the feedback they receive. In this paper, we explore the modelling of such behaviours in the context of a robot presenting a painting. A Behaviour Tree is used to organise the behaviour on different levels, and allow the robot to adapt its behaviour in real-time; the tree organises engagement, joint attention, turn-taking, feedback and incremental speech processing. An initial implementation of the model is presented, and the system is evaluated in a user study, where the adaptive robot presenter is compared to a non-adaptive version. The adaptive version is found to be more engaging by the users, although no effects are found on the retention of the presented material.

  • 28.
    Axelsson, Nils
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Using knowledge graphs and behaviour trees for feedback-aware presentation agents2020Ingår i: Proceedings of Intelligent Virtual Agents 2020, Association for Computing Machinery (ACM) , 2020Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this paper, we address the problem of how an interactive agent (such as a robot) can present information to an audience and adaptthe presentation according to the feedback it receives. We extend a previous behaviour tree-based model to generate the presentation from a knowledge graph (Wikidata), which allows the agent to handle feedback incrementally, and adapt accordingly. Our main contribution is using this knowledge graph not just for generating the system’s dialogue, but also as the structure through which short-term user modelling happens. In an experiment using simulated users and third-party observers, we show that referring expressions generated by the system are rated more highly when they adapt to the type of feedback given by the user, and when they are based on previously grounded information as opposed to new information.

  • 29.
    Aylett, Matthew Peter
    et al.
    Heriot Watt University and CereProc Ltd. Edinburgh, UK.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    McMillan, Donald
    Stockholm University Stockholm, Sweden.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Romeo, Marta
    Heriot Watt University Edinburgh, UK.
    Fischer, Joel
    University of Nottingham Nottingham, UK.
    Reyes-Cruz, Gisela
    University of Nottingham Nottingham, UK.
    Why is my Agent so Slow? Deploying Human-Like Conversational Turn-Taking2023Ingår i: HAI 2023 - Proceedings of the 11th Conference on Human-Agent Interaction, Association for Computing Machinery (ACM) , 2023, s. 490-492Konferensbidrag (Refereegranskat)
    Abstract [en]

    The emphasis on one-to-one speak/wait spoken conversational interaction with intelligent agents leads to long pauses between conversational turns, undermines the flow and naturalness of the interaction, and undermines the user experience. Despite ground breaking advances in the area of generating and understanding natural language with techniques such as LLMs, conversational interaction has remained relatively overlooked. In this workshop we will discuss and review the challenges, recent work and potential impact of improving conversational interaction with artificial systems. We hope to share experiences of poor human/system interaction, best practices with third party tools, and generate design guidance for the community.

  • 30. Baker, C. P.
    et al.
    Sundberg, Johan
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Purdy, S. C.
    Rakena, T. O.
    Female adolescent singing voice characteristics: an exploratory study using LTAS and inverse filtering2022Ingår i: Logopedics, Phoniatrics, Vocology, ISSN 1401-5439, E-ISSN 1651-2022, s. 1-13Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Background and Aim: To date, little research is available that objectively quantifies female adolescent singing-voice characteristics in light of the physiological and functional developments that occur from puberty to adulthood. This exploratory study sought to augment the pool of data available that offers objective voice analysis of female singers in late adolescence. Methods: Using long-term average spectra (LTAS) and inverse filtering techniques, dynamic range and voice-source characteristics were determined in a cohort of vocally healthy cis-gender female adolescent singers (17 to 19 years) from high-school choirs in Aotearoa New Zealand. Non-parametric statistics were used to determine associations and significant differences. Results: Wide intersubject variation was seen between dynamic range, spectral measures of harmonic organisation (formant cluster prominence, FCP), noise components in the spectrum (high-frequency energy ratio, HFER), and the normalised amplitude quotient (NAQ) suggesting great variability in ability to control phonatory mechanisms such as subglottal pressure (Psub), glottal configuration and adduction, and vocal tract shaping. A strong association between the HFER and NAQ suggest that these non-invasive measures may offer complimentary insights into vocal function, specifically with regard to glottal adduction and turbulent noise in the voice signal. Conclusion: Knowledge of the range of variation within healthy adolescent singers is necessary for the development of effective and inclusive pedagogical practices, and for vocal-health professionals working with singers of this age. LTAS and inverse filtering are useful non-invasive tools for determining such characteristics. 

  • 31. Baker, C. P.
    et al.
    Sundberg, Johan
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Purdy, S. C.
    Rakena, T. O.
    Leão, S. H. D. S.
    CPPS and Voice-Source Parameters: Objective Analysis of the Singing Voice2024Ingår i: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 38, nr 3, s. 549-560Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Introduction: In recent years cepstral analysis and specific cepstrum-based measures such as smoothed cepstral peak prominence (CPPS) has become increasingly researched and utilized in attempts to determine the extent of overall dysphonia in voice signals. Yet, few studies have extensively examined how specific voice-source parameters affect CPPS values. Objective: Using a range of synthesized tones, this exploratory study sought to systematically analyze the effect of fundamental frequency (fo), vibrato extent, source-spectrum tilt, and the amplitude of the voice-source fundamental on CPPS values. Materials and Methods: A series of scales were synthesised using the freeware Madde. Fundamental frequency, vibrato extent, source-spectrum tilt, and the amplitude of the voice-source fundamental were systematically and independently varied. The tones were analysed in PRAAT, and statistical analyses were conducted in SPSS. Results: CPPS was significantly affected by both fo and source-spectrum tilt, independently. A nonlinear association was seen between vibrato extent and CPPS, where CPPS values increased from 0 to 0.6 semitones (ST), then rapidly decreased approaching 1.0 ST. No relationship was seen between the amplitude of the voice-source fundamental and CPPS. Conclusion: The large effect of fo should be taken into account when analyzing the voice, particularly in singing-voice research, when comparing pre and posttreatment data, and when comparing inter-subject CPPS data. 

  • 32.
    Beck, Gustavo
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wennberg, Ulme
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Malisz, Zofia
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wavebender GAN: An architecture for phonetically meaningful speech manipulation2022Ingår i: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE conference proceedings, 2022Konferensbidrag (Refereegranskat)
    Abstract [en]

    Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies.This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.

  • 33. Ben-Tal, Oded
    et al.
    Harris, Matthew
    Sturm, Bob
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    How music AI is useful: Engagements with composers, performers, and audiences2021Ingår i: Leonardo music journal, ISSN 0961-1215, E-ISSN 1531-4812, Vol. 54, nr 5, s. 510-516Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Critical but often overlooked research questions in artificial intelligence (AI) applied to music involve the impact of the results for music. How and to what extent does such research contribute to the domain of music? How are the resulting models useful for music practitioners? In this article, we describe how we are addressing such questions by engaging composers, musicians, and audiences with our research. We first describe two websites we have created that make our AI models accessible to a wide audience. We then describe a professionally recorded album that we released to expert reviewers to gauge the plausibility of AI-generated material. Finally, we describe the use of our AI models as tools for co-creation. Evaluating AI research and music models in these ways illuminate their impact on music making in a range of styles and practices.

    Ladda ner fulltext (pdf)
    fulltext
  • 34.
    Ben-Tal, Oded
    et al.
    Kingston Univ, London, England..
    Sturm, Bob
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Quinton, Elio
    Universal Mus Grp, Santa Monica, CA USA..
    Simonnot, Josephine
    CNRS, UMR7186, CREM LESC, Nanterre, France..
    Helmlinger, Aurelie
    CNRS, UMR7186, CREM LESC, Nanterre, France..
    Finding Music in Music Data: A Summary of the DaCaRyH Project2019Ingår i: Computational phonogram archiving / [ed] Bader, R, Springer Nature , 2019, Vol. 5, s. 191-205Konferensbidrag (Refereegranskat)
    Abstract [en]

    The international research project, "Data science for the study of calypso-rhythm through history" (DaCaRyH), involved a collaboration between ethnomusicologists, computer scientists, and a composer. The primary aim of DaCaRyH was to explore how ethnomusicology could inform data science, and vice versa. Its secondary aim focused on creative applications of the results. This article summarises the results of the project, and more broadly discusses the benefits and challenges in such interdisciplinary research. It concludes with suggestions for reducing the barriers to similar work.

  • 35.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Spoken and non-verbal interaction experiments with a social robot2016Ingår i: The Journal of the Acoustical Society of America, Acoustical Society of America , 2016, Vol. 140, nr 3005Konferensbidrag (Refereegranskat)
    Abstract [en]

    During recent years, we have witnessed the start of a revolution in personal robotics. Once associated with highly specialized manufacturing tasks, robots are rapidly starting to become part of our everyday lives. The potential of these systems is far-reaching; from co-worker robots that operate and collaborate with humans side-by-side to robotic tutors in schools that interact with humans in a shared environment. All of these scenarios require systems that are able to act and react in a social way. Evidence suggests that robots should leverage channels of communication that humans understand—despite differences in physical form and capabilities. We have developed Furhat—a social robot that is able to convey several important aspects of human face-to-face interaction such as visual speech, facial expression, and eye gaze by means of facial animation that is retro-projected on a physical mask. In this presentation, we cover a series of experiments attempting to quantize the effect of our social robot and how it compares to other interaction modalities. It is shown that a number of functions ranging from low-level audio-visual speech perception to vocabulary learning improve when compared to unimodal (e.g., audio-only) settings or 2D virtual avatars.

  • 36.
    Beskow, Jonas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Berthelsen, Harald
    STTS Speech Technology Services, Stockholm, Sweden.
    A hybrid harmonics-and-bursts modelling approach to speech synthesis2016Ingår i: Proceedings 9th ISCA Speech Synthesis Workshop, SSW 2016, The International Society for Computers and Their Applications (ISCA) , 2016, s. 208-213Konferensbidrag (Refereegranskat)
    Abstract [en]

    Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.

  • 37.
    Beskow, Jonas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Caper, Charlie
    Furhat Robotics, Sweden.
    Ehrenfors, J.
    Furhat Robotics, Sweden.
    Hagberg, N.
    Furhat Robotics, Sweden.
    Jansen, A.
    Furhat Robotics, Sweden.
    Wood, C.
    Furhat Robotics, Sweden.
    Expressive robot performance based on facial motion capture2021Ingår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2021, s. 2165-2166Konferensbidrag (Refereegranskat)
    Abstract [en]

    The Furhat robot is a social robot that uses facial projection technology to achieve a high degree of expressivity and flexibility. In this demonstration, we will present new features that takes this facial expressiveness further. A new face engine for the robot is presented which not only drastically improves the visual fidelity of the face and the eyes, it also adds increased flexibility when it comes to designing new robotic characters as well as modifying existing ones. Most importantly, we will present a new toolset and a workflow that allows users to record their own face motion and incorporate them into skills (i.e. custom robot applications) as gestures, prompts or entire canned performances.

  • 38.
    Beskow, Jonas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Caper, Charlie
    Furhat Robot, Stockholm, Sweden..
    Ehrenfors, Johan
    Furhat Robot, Stockholm, Sweden..
    Hagberg, Nils
    Furhat Robot, Stockholm, Sweden..
    Jansen, Anne
    Furhat Robot, Stockholm, Sweden..
    Wood, Chris
    Furhat Robot, Stockholm, Sweden..
    Expressive Robot Performance based on Facial Motion Capture2021Ingår i: INTERSPEECH 2021, ISCA-INT SPEECH COMMUNICATION ASSOC , 2021, s. 2343-2344Konferensbidrag (Refereegranskat)
    Abstract [en]

    The Furhat robot is a social robot that uses facial projection technology to achieve a high degree of expressivity and flexibility. In this demonstration, we will present new features that takes this facial expressiveness further. A new face engine for the robot is presented which not only drastically improves the visual fidelity of the face and the eyes, it also adds increased flexibility when it comes to designing new robotic characters as well as modifying existing ones. Most importantly, we will present a new toolset and a workflow that allows users to record their own face motion and incorporate them into skills (i.e. custom robot applications) as gestures, prompts or entire canned performances.

  • 39.
    Beskow, Jonas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Al Moubayed, Samer
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    SynFace - Verbal and Non-verbal Face Animation from Audio2009Ingår i: Auditory-Visual Speech Processing 2009, AVSP 2009, The International Society for Computers and Their Applications (ISCA) , 2009Konferensbidrag (Refereegranskat)
    Abstract [en]

    We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores. 

  • 40. Betz, Simon
    et al.
    Zarrieß, Sina
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wagner, Petra
    The greennn tree - lengthening position influences uncertainty perception2019Ingår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, The International Speech Communication Association (ISCA), 2019, s. 3990-3994Konferensbidrag (Refereegranskat)
    Abstract [en]

    Synthetic speech can be used to express uncertainty in dialogue systems by means of hesitation. If a phrase like “Next to the green tree” is uttered in a hesitant way, that is, containing lengthening, silences, and fillers, the listener can infer that the speaker is not certain about the concepts referred to. However, we do not know anything about the referential domain of the uncertainty; if only a particular word in this sentence would be uttered hesitantly, e.g. “the greee:n tree”, the listener could infer that the uncertainty refers to the color in the statement, but not to the object. In this study, we show that the domain of the uncertainty is controllable. We conducted an experiment in which color words in sentences like “search for the green tree” were lengthened in two different positions: word onsets or final consonants, and participants were asked to rate the uncertainty regarding color and object. The results show that initial lengthening is predominantly associated with uncertainty about the word itself, whereas final lengthening is primarily associated with the following object. These findings enable dialogue system developers to finely control the attitudinal display of uncertainty, adding nuances beyond the lexical content to message delivery.

    Ladda ner fulltext (pdf)
    fulltext
  • 41. Bimbot, F.
    et al.
    Hutter, H. -P
    Jaboulet, C.
    Koolwaaij, J.
    Lindberg, Johan
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Pierrot, J. -B
    An overview of the CAVE project research activities in speaker verification2020Ingår i: RLA2C 1998 - Speaker Recognition and its Commercial and Forensic Applications, International Speech Communication Association , 2020, s. 215-220Konferensbidrag (Refereegranskat)
  • 42.
    Bisesi, Erica
    et al.
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    Friberg, Anders
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Parncutt, Richard
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    A Computational Model of Immanent Accent Salience in Tonal Music2019Ingår i: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 10, nr 317, s. 1-19Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Accents are local musical events that attract the attention of the listener, and can be either immanent (evident from the score) or performed (added by the performer). Immanent accents involve temporal grouping (phrasing), meter, melody, and harmony; performed accents involve changes in timing, dynamics, articulation, and timbre. In the past, grouping, metrical and melodic accents were investigated in the context of expressive music performance. We present a novel computational model of immanent accent salience in tonal music that automatically predicts the positions and saliences of metrical, melodic and harmonic accents. The model extends previous research by improving on preliminary formulations of metrical and melodic accents and introducing a new model for harmonic accents that combines harmonic dissonance and harmonic surprise. In an analysis-by-synthesis approach, model predictions were compared with data from two experiments, respectively involving 239 sonorities and 638 sonorities, and 16 musicians and 5 experts in music theory. Average pair-wise correlations between raters were lower for metrical (0.27) and melodic accents (0.37) than for harmonic accents (0.49). In both experiments, when combining all the raters into a single measure expressing their consensus, correlations between ratings and model predictions ranged from 0.43 to 0.62. When different accent categories of accents were combined together, correlations were higher than for separate categories (r = 0.66). This suggests that raters might use strategies different from individual metrical, melodic or harmonic accent models to mark the musical events.

  • 43.
    Blomsma, Peter
    et al.
    Tilburg University.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Swerts, Marc
    Tilburg University.
    Backchannel Behavior Influences the Perceived Personality of Human and Artificial Communication Partners2022Ingår i: Frontiers in Artificial Intelligence, E-ISSN 2624-8212, Vol. 5Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Different applications or contexts may require different settings for a conversational AI system, as it is clear that e.g., a child-oriented system would need a different interaction style than a warning system used in emergency situations. The current article focuses on the extent to which a system's usability may benefit from variation in the personality it displays. To this end, we investigate whether variation in personality is signaled by differences in specific audiovisual feedback behavior, with a specific focus on embodied conversational agents. This article reports about two rating experiments in which participants judged the personalities (i) of human beings and (ii) of embodied conversational agents, where we were specifically interested in the role of variability in audiovisual cues. Our results show that personality perceptions of both humans and artificial communication partners are indeed influenced by the type of feedback behavior used. This knowledge could inform developers of conversational AI on how to also include personality in their feedback behavior generation algorithms, which could enhance the perceived personality and in turn generate a stronger sense of presence for the human interlocutor.

    Ladda ner fulltext (pdf)
    fulltext
  • 44.
    Borg, Alexander
    et al.
    Karolinska Intitute Stockholm, Sweden.
    Parodis, Ioannis
    Karolinska Intitute Stockholm, Sweden.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Creating Virtual Patients using Robots and Large Language Models: A Preliminary Study with Medical Students2024Ingår i: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, s. 273-277Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents a virtual patient (VP) platform for medical education, combining a social robot, Furhat, with large language models (LLMs). Aimed at enhancing clinical reasoning (CR) training, particularly in rheumatology, this approach introduces more interactive and realistic patient simulations. The use of LLMs both for driving the dialogue, but also for the expression of emotions in the robot's face, as well as automatic analysis and generation of feedback to the student, is discussed. The platform's effectiveness was tested in a pilot study with 15 medical students, comparing it against a traditional semi-linear VP platform. The evaluation indicates a preference for the robot platform in terms of authenticity and learning effect. We conclude that this novel integration of a social robot and LLMs in VP simulations shows potential in medical education, offering a more engaging learning experience.

  • 45. Borin, L.
    et al.
    Forsberg, M.
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Domeij, R.
    Språkbanken 2018: Research resources for text, speech, & society2018Ingår i: CEUR Workshop Proceedings, CEUR-WS , 2018, s. 504-506Konferensbidrag (Refereegranskat)
    Abstract [en]

    We introduce an expanded version of the Swedish research resource Språkbanken (the Swedish Language Bank). In 2018, Språkbanken, which has supported national and international research for over four decades, adds two branches, one focusing on speech and one on societal aspects of language, to its existing organization, which targets text. 

  • 46.
    Borin, Lars
    et al.
    University of Gothenburg, Gothenburg, Sweden.
    Domeij, Rickard
    Institute of Languages and Folklore, Stockholm, Sweden.
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Forsberg, Markus
    University of Gothenburg, Gothenburg, Sweden.
    Language Report Swedish2023Ingår i: Cognitive Technologies, Springer Nature , 2023, Vol. Part F280, s. 219-222Kapitel i bok, del av antologi (Övrigt vetenskapligt)
    Abstract [en]

    Swedish speech and language technology (LT) research goes back over 70 years. This has paid off: there is a national research infrastructure, as well as significant research projects, and Swedish is well-endowed with language resources (LRs) and tools. However, there are gaps that need to be filled, especially high-quality goldstandard LRs required by the most recent deep-learning methods. In the future, we would like to see closer collaborations and communication between the “traditional” LT research community and the burgeoning AI field, the establishment of dedicated academic LT training programmes, and national funding for LT research.

  • 47.
    Bresin, Roberto
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Friberg, Anders
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Dahl, Sofia
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Toward a new model for sound control2001Ingår i: Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6-8, 200 / [ed] Fernström, M., Brazil, E., & Marshall, M., 2001, s. 45-49Konferensbidrag (Refereegranskat)
    Abstract [en]

    The control of sound synthesis is a well-known problem. This is particularly true if the sounds are generated with physical modeling techniques that typically need specification of numerous control parameters. In the present work outcomes from studies on automatic music performance are used for tackling this problem. 

    Ladda ner fulltext (pdf)
    fulltext
  • 48.
    Bystedt, Mattias
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    New applications of gaze tracking in speech science2019Ingår i: CEUR Workshop Proceedings, CEUR-WS , 2019, s. 73-78Konferensbidrag (Refereegranskat)
    Abstract [en]

    We present an overview of speech research applications of gaze tracking technology, where gaze behaviours are exploited as a tool for analysis rather than as a primary object of study. The methods presented are all in their infancy, but can greatly assist the analysis of digital audio and video as well as unlock the relationship between writing and other encodings on the one hand, and natural language, such as speech, on the other. We discuss three directions in this type of gaze tracking application: modelling of text that is read aloud, evaluation and annotation with naïve informants, and evaluation and annotation with expert annotators. In each of these areas, we use gaze tracking information to gauge the behaviour of people when working with speech and conversation, rather than when reading text aloud or partaking in conversations, in order to learn something about how the speech may be ana-lysed from a human perspective.

  • 49.
    Cai, Huanchen
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Ternström, Sten
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Mapping Phonation Types by Clustering of Multiple Metrics2022Ingår i: Applied Sciences, ISSN 2076-3417, Vol. 12, nr 23, s. 12092-Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    For voice analysis, much work has been undertaken with a multitude of acoustic and electroglottographic metrics. However, few of these have proven to be robustly correlated with physical and physiological phenomena. In particular, all metrics are affected by the fundamental frequency and sound level, making voice assessment sensitive to the recording protocol. It was investigated whether combinations of metrics, acquired over voice maps rather than with individual sustained vowels, can offer a more functional and comprehensive interpretation. For this descriptive, retrospective study, 13 men, 13 women, and 22 children were instructed to phonate on /a/ over their full voice range. Six acoustic and EGG signal features were obtained for every phonatory cycle. An unsupervised voice classification model created feature clusters, which were then displayed on voice maps. It was found that the feature clusters may be readily interpreted in terms of phonation types. For example, the typical intense voice has a high peak EGG derivative, a relatively high contact quotient, low EGG cycle-rate entropy, and a high cepstral peak prominence in the voice signal, all represented by one cluster centroid that is mapped to a given color. In a transition region between the non-contacting and contacting of the vocal folds, the combination of metrics shows a low contact quotient and relatively high entropy, which can be mapped to a different color. Based on this data set, male phonation types could be clustered into up to six categories and female and child types into four. Combining acoustic and EGG metrics resolved more categories than either kind on their own. The inter- and intra-participant distributional features are discussed.

  • 50.
    Cai, Huanchen
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Ternström, Sten
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Chaffanjon, Philippe
    University of Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France; Medical School, Université Grenoble Alpes, Grenoble, France.
    Henrich Bernardoni, Nathalie
    University of Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France.
    Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps2024Ingår i: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Objectives: This study aims to explore the effects of thyroidectomy—a surgical intervention involving the removal of the thyroid gland—on voice quality, as represented by acoustic and electroglottographic measures. Given the thyroid gland's proximity to the inferior and superior laryngeal nerves, thyroidectomy carries a potential risk of affecting vocal function. While earlier studies have documented effects on the voice range, few studies have looked at voice quality after thyroidectomy. Since voice quality effects could manifest in many ways, that a priori are unknown, we wish to apply an exploratory approach that collects many data points from several metrics.

    Methods: A voice-mapping analysis paradigm was applied retrospectively on a corpus of spoken and sung sentences produced by patients who had thyroid surgery. Voice quality changes were assessed objectively for 57 patients prior to surgery and 2 months after surgery, by making comparative voice maps, pre- and post-intervention, of six acoustic and electroglottographic (EGG) metrics.

    Results: After thyroidectomy, statistically significant changes consistent with a worsening of voice quality were observed in most metrics. For all individual metrics, however, the effect sizes were too small to be clinically relevant. Statistical clustering of the metrics helped to clarify the nature of these changes. While partial thyroidectomy demonstrated greater uniformity than did total thyroidectomy, the type of perioperative damage had no discernible impact on voice quality.ConclusionsChanges in voice quality after thyroidectomy were related mostly to increased phonatory instability in both the acoustic and EGG metrics. Clustered voice metrics exhibited a higher correlation to voice complaints than did individual voice metrics.

    Ladda ner fulltext (pdf)
    fulltext
1234567 1 - 50 av 460
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf