Ändra sökning
Avgränsa sökresultatet
1234567 1 - 50 av 686
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. AAl Abdulsalam, Abdulrahman
    et al.
    Velupillai, Sumithra
    Meystre, Stephane
    UtahBMI at SemEval-2016 Task 12: Extracting Temporal Information from Clinical Text2016Ingår i: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics , 2016, s. 1256-1262Konferensbidrag (Refereegranskat)
    Abstract [en]

    The 2016 Clinical TempEval continued the 2015 shared task on temporal information extraction with a new evaluation test set. Our team, UtahBMI, participated in all subtasks using machine learning approaches with ClearTK (LIBLINEAR), CRF++ and CRFsuite packages. Our experiments show that CRF-based classifiers yield, in general, higher recall for multi-word spans, while SVM-based classifiers are better at predicting correct attributes of TIMEX3. In addition, we show that an ensemble-based approach for TIMEX3 could yield improved results. Our team achieved competitive results in each subtask with an F1 75.4% for TIMEX3, F1 89.2% for EVENT, F1 84.4% for event relations with document time (DocTimeRel), and F1 51.1% for narrative container (CONTAINS) relations.

  • 2.
    Abou Zliekha, M.
    et al.
    Damascus University/Faculty of Information Technology.
    Al Moubayed, Samer
    Damascus University/Faculty of Information Technology.
    Al Dakkak, O.
    Higher Institute of Applied Science and Technology (HIAST).
    Ghneim, N.
    Higher Institute of Applied Science and Technology (HIAST).
    Emotional Audio-Visual Arabic Text to Speech2006Ingår i: Proceedings of the XIV European Signal Processing Conference (EUSIPCO), Florence, Italy, 2006Konferensbidrag (Refereegranskat)
    Abstract [en]

    The goal of this paper is to present an emotional audio-visual. Text to speech system for the Arabic Language. The system is based on two entities: un emotional audio text to speech system which generates speech depending on the input text and the desired emotion type, and un emotional Visual model which generates the talking heads, by forming the corresponding visemes. The phonemes to visemes mapping, and the emotion shaping use a 3-paramertic face model, based on the Abstract Muscle Model. We have thirteen viseme models and five emotions as parameters to the face model. The TTS produces the phonemes corresponding to the input text, the speech with the suitable prosody to include the prescribed emotion. In parallel the system generates the visemes and sends the controls to the facial model to get the animation of the talking head in real time.

  • 3. Abrahamsson, M.
    et al.
    Sundberg, Johan
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Musikakustik.
    Subglottal pressure variation in actors’ stage speech2007Ingår i: Voice and Gender Journal for the Voice and Speech Trainers Association / [ed] Rees, M., VASTA Publishing , 2007, s. 343-347Kapitel i bok, del av antologi (Refereegranskat)
  • 4.
    Ahmady, Tobias
    et al.
    KTH, Skolan för teknik och hälsa (STH), Medicinsk teknik, Data- och elektroteknik.
    Klein Rosmar, Sander
    KTH, Skolan för teknik och hälsa (STH), Medicinsk teknik, Data- och elektroteknik.
    Translation of keywords between English and Swedish2014Självständigt arbete på grundnivå (högskoleexamen), 10 poäng / 15 hpStudentuppsats (Examensarbete)
    Abstract [sv]

    I detta projekt har vi undersökt hur man utför regelbaserad maskinöver- sättning av nyckelord mellan två språk. Målet var att översätta en given mängd med ett eller flera nyckelord på ett källspråk till en motsvarande, lika stor mängd nyckelord på målspråket. Vissa ord i källspråket kan dock ha flera betydelser och kan översättas till flera, eller inga, ord på målsprå- ket. Om tvetydiga översättningar uppstår ska nyckelordets bästa över- sättning väljas med hänsyn till sammanhanget. I traditionell maskinö- versättning bestäms ett ords sammanhang av frasen eller meningen som det befinner sig i. I det här projektet representerar den givna mängden nyckelord sammanhanget.

    Genom att undersöka traditionella tillvägagångssätt för maskinöversätt- ning har vi designat och beskrivit modeller specifikt för översättning av nyckelord. Vi har presenterat en direkt maskinöversättningslösning av nyckelord mellan engelska och svenska där vi introducerat en enkel graf- baserad modell för tvetydiga översättningar. 

  • 5.
    Al Dakkak, O.
    et al.
    Higher Institute of Applied Sciencenand Technology (HIAST).
    Ghneim, N.
    Higher Institute of Applied Sciencenand Technology (HIAST).
    Abou Zliekha, M.
    Damascus University/Faculty of Information Technology.
    Al Moubayed, Samer
    Damascus University/Faculty of Information Technology.
    Emotional Inclusion in An Arabic Text-To-Speech2005Ingår i: Proceedings of the 13th European Signal Processing Conference (EUSIPCO), Antalya, Turkey, 2005Konferensbidrag (Refereegranskat)
    Abstract [en]

    The goal of this paper is to present an emotional audio-visua lText to speech system for the Arabic Language. The system is based on two entities: un emotional audio text to speech system which generates speech depending on the input text and the desired emotion type, and un emotional Visual model which generates the talking heads, by forming the corresponding visemes. The phonemes to visemes mapping, and the emotion shaping use a 3-paramertic face model, based on the Abstract Muscle Model. We have thirteen viseme models and five emotions as parameters to the face model. The TTS produces the phonemes corresponding to the input text, the speech with the suitable prosody to include the prescribed emotion. In parallel the system generates the visemes and sends the controls to the facial model to get the animation of the talking head in real time.

  • 6.
    Al Dakkak, O.
    et al.
    HIAST, Damascus, Syria.
    Ghneim, N.
    HIAST, Damascus, Syria.
    Abou Zliekha, M.
    Damascus University.
    Al Moubayed, Samer
    Damascus University.
    Prosodic Feature Introduction and Emotion Incorporation in an Arabic TTS2006Ingår i: Proceedings of IEEE International Conference on Information and Communication Technologies, Damascus, Syria, 2006, s. 1317-1322Konferensbidrag (Refereegranskat)
    Abstract [en]

    Text-to-speech is a crucial part of many man-machine communication applications, such as phone booking and banking, vocal e-mail, and many other applications. In addition to many other applications concerning impaired persons, such as: reading machines for blinds, talking machines for persons with speech difficulties. However, the main drawback of most speech synthesizers in the talking machines, are their metallic sounds. In order to sound naturally, we have to incorporate prosodic features, as close as possible to natural prosody, this helps to improve the quality of the synthetic speech. Actual researches in the world are towards better "automatic prosody generation".

  • 7.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Prosodic Disambiguation in Spoken Systems Output2009Ingår i: Proceedings of Diaholmia'09: 2009 Workshop on the Semantics and Pragmatics of Dialogue / [ed] Jens Edlund, Joakim Gustafson, Anna Hjalmarsson, Gabriel Skantze, Stockholm, Sweden., 2009, s. 131-132Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.

  • 8.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Towards rich multimodal behavior in spoken dialogues with embodied agents2013Ingår i: 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings, IEEE Computer Society, 2013, s. 817-822Konferensbidrag (Refereegranskat)
    Abstract [en]

    Spoken dialogue frameworks have traditionally been designed to handle a single stream of data - the speech signal. Research on human-human communication has been providing large evidence and quantifying the effects and the importance of a multitude of other multimodal nonverbal signals that people use in their communication, that shape and regulate their interaction. Driven by findings from multimodal human spoken interaction, and the advancements of capture devices and robotics and animation technologies, new possibilities are rising for the development of multimodal human-machine interaction that is more affective, social, and engaging. In such face-to-face interaction scenarios, dialogue systems can have a large set of signals at their disposal to infer context and enhance and regulate the interaction through the generation of verbal and nonverbal facial signals. This paper summarizes several design decision, and experiments that we have followed in attempts to build rich and fluent multimodal interactive systems using a newly developed hybrid robotic head called Furhat, and discuss issues and challenges that this effort is facing.

  • 9.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Alexanderson, Simon
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    A robotic head using projected animated faces2011Ingår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, s. 71-Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.

  • 10.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Ananthakrishnan, Gopal
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Acoustic-to-Articulatory Inversion based on Local Regression2010Ingår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 937-940Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents an Acoustic-to-Articulatory inversionmethod based on local regression. Two types of local regression,a non-parametric and a local linear regression have beenapplied on a corpus containing simultaneous recordings of positionsof articulators and the corresponding acoustics. A maximumlikelihood trajectory smoothing using the estimated dynamicsof the articulators is also applied on the regression estimates.The average root mean square error in estimating articulatorypositions, given the acoustics, is 1.56 mm for the nonparametricregression and 1.52 mm for the local linear regression.The local linear regression is found to perform significantlybetter than regression using Gaussian Mixture Modelsusing the same acoustic and articulatory features.

  • 11.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Ananthakrishnan, Gopal
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Enflo, Laura
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Musikakustik. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Automatic Prominence Classification in Swedish2010Ingår i: Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA, 2010Konferensbidrag (Refereegranskat)
    Abstract [en]

    This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speech

  • 12.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    A novel Skype interface using SynFace for virtual speech reading support2011Ingår i: Proceedings from Fonetik 2011, June 8 - June 10, 2011: Speech, Music and Hearing, Quarterly Progress and Status Report, TMH-OPSR, Volume 51, 2011, Stockholm, Sweden, 2011, s. 33-36Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    We describe in this paper a support client interface to the IP telephony application Skype. The system uses a variant of SynFace, a real-time speech reading support system using facial animation. The new interface is designed for the use by elderly persons, and tailored for use in systems supporting touch screens. The SynFace real-time facial animation system has previously shown ability to enhance speech comprehension for the hearing impaired persons. In this study weemploy at-home field studies on five subjects in the EU project MonAMI. We presentinsights from interviews with the test subjects on the advantages of the system, and onthe limitations of such a technology of real-time speech reading to reach the homesof elderly and the hard of hearing.

  • 13.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Effects of Visual Prominence Cues on Speech Intelligibility2009Ingår i: Proceedings of Auditory-Visual Speech Processing AVSP'09, Norwich, England, 2009Konferensbidrag (Refereegranskat)
    Abstract [en]

    This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.

  • 14.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Perception of Nonverbal Gestures of Prominence in Visual Speech Animation2010Ingår i: Proceedings of the ACM/SSPNET 2nd International Symposium on Facial Analysis and Animation, Edinburgh, UK, 2010, s. 25-Konferensbidrag (Refereegranskat)
    Abstract [en]

    It has long been recognized that visual speech information is important for speech perception [McGurk and MacDonald 1976] [Summerfield 1992]. Recently there has been an increasing interest in the verbal and non-verbal interaction between the visual and the acoustic modalities from production and perception perspectives. One of the prosodic phenomena which attracts much focus is prominence. Prominence is defined as when a linguistic segment is made salient in its context.

  • 15.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Prominence Detection in Swedish Using Syllable Correlates2010Ingår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 1784-1787Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights of these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.

  • 16.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Blomberg, Mats
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Mirning, N.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Talking with Furhat - multi-party interaction with a back-projected robot head2012Ingår i: Proceedings of Fonetik 2012, Gothenberg, Sweden, 2012, s. 109-112Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum

  • 17.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Animated Faces for Robotic Heads: Gaze and Beyond2011Ingår i: Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues / [ed] Anna Esposito, Alessandro Vinciarelli, Klára Vicsi, Catherine Pelachaud and Anton Nijholt, Springer Berlin/Heidelberg, 2011, s. 19-35Konferensbidrag (Refereegranskat)
    Abstract [en]

    We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to.

  • 18.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Auditory visual prominence From intelligibility to behavior2009Ingår i: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 3, nr 4, s. 299-309Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.

  • 19.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Mirning, Nicole
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Tscheligi, Manfred
    Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space2012Ingår i: Proc of LREC Workshop on Multimodal Corpora, Istanbul, Turkey, 2012Konferensbidrag (Refereegranskat)
    Abstract [en]

    In the four days of the Robotville exhibition at the London Science Museum, UK, during which the back-projected head Furhat in a situated spoken dialogue system was seen by almost 8 000 visitors, we collected a database of 10 000 utterances spoken to Furhat in situated interaction. The data collection is an example of a particular kind of corpus collection of human-machine dialogues in public spaces that has several interesting and specific characteristics, both with respect to the technical details of the collection and with respect to the resulting corpus contents. In this paper, we take the Furhat data collection as a starting point for a discussion of the motives for this type of data collection, its technical peculiarities and prerequisites, and the characteristics of the resulting corpus.

  • 20.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech2008Ingår i: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008, s. 3-6Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.

  • 21.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Spontaneous spoken dialogues with the Furhat human-like robot head2014Ingår i: HRI '14 Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, Bielefeld, Germany, 2014, s. 326-Konferensbidrag (Refereegranskat)
    Abstract [en]

    We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.

  • 22.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The Furhat Social Companion Talking Head2013Ingår i: Interspeech 2013 - Show and Tell, 2013, s. 747-749Konferensbidrag (Refereegranskat)
    Abstract [en]

    In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.

  • 23.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Öster, Anne-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    van Son, Nic
    Viataal, Nijmegen, The Netherlands.
    Ormel, Ellen
    Viataal, Nijmegen, The Netherlands.
    Herzke, Tobias
    HörTech gGmbH, Germany.
    Studies on Using the SynFace Talking Head for the Hearing Impaired2009Ingår i: Proceedings of Fonetik'09: The XXIIth Swedish Phonetics Conference, June 10-12, 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, s. 140-143Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.

  • 24.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections2012Ingår i: ACM Transactions on Interactive Intelligent Systems, ISSN 2160-6455, Vol. 1, nr 2, s. 25-, artikel-id 11Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3Dprojection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.

  • 25.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Analysis of gaze and speech patterns in three-party quiz game interaction2013Ingår i: Interspeech 2013, 2013, s. 1126-1130Konferensbidrag (Refereegranskat)
    Abstract [en]

    In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques in which we capture the gaze and speech patterns of triads deeply engaged in a high-stakes quiz game. The resulting corpus consists of five one-hour recordings, and is unique in that it makes use of three state-of-the-art gaze trackers (one per subject) in combination with a state-of-theart conical microphone array designed to capture roundtable meetings. Several video channels are also included. In this paper we present the obstacles we encountered and the possibilities afforded by a synchronised, reliable combination of large-scale multi-party speech and gaze data, and an overview of the first analyses of the data. Index Terms: multimodal corpus, multiparty dialogue, gaze patterns, multiparty gaze.

  • 26.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog2011Ingår i: SemDial 2011: Proceedings of the 15th Workshop on the Semantics and Pragmatics of Dialogue / [ed] Ron Artstein, Mark Core, David DeVault, Kallirroi Georgila, Elsi Kaiser, Amanda Stent, Los Angeles, CA, 2011, s. 192-193Konferensbidrag (Refereegranskat)
    Abstract [en]

    The perception of gaze from an animated agenton a 2D display has been shown to suffer fromthe Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. In this study, we investigate this effect when it comes to turntaking control in a multi-party human-computerdialog setting, where a 2D display is compared to a 3D projection. The results show that the 2D setting results in longer response times andlower turn-taking accuracy.

  • 27.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays2011Ingår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011, Stockholm: KTH Royal Institute of Technology, 2011, s. 99-102Konferensbidrag (Refereegranskat)
    Abstract [en]

    In a previous experiment we found that the perception of gazefrom an animated agent on a two-dimensional display suffersfrom the Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. By using a three-dimensional projection surface, this effect can be eliminated. In this study, we investigate whether this difference also holds for the turn-taking behaviour of subjects interacting with the animated agent in a multi-party dialogue. We present a Wizard-of-Oz experiment where five subjects talk toan animated agent in a route direction dialogue. The results show that the subjects to some extent can infer the intended target of the agent’s questions, in spite of the Mona Lisa effect, but that the accuracy of gaze when it comes to selecting an addressee is still significantly lower in the 2D condition, ascompared to the 3D condition. The response time is also significantly longer in the 2D condition, indicating that the inference of intended gaze may require additional cognitive efforts.

  • 28.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Lip-reading: Furhat audio visual intelligibility of a back projected animated face2012Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Berlin/Heidelberg, 2012, s. 196-203Konferensbidrag (Refereegranskat)
    Abstract [en]

    Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat's face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.

  • 29.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions2014Ingår i: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, nr 2, s. 607-618Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

  • 30.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer2012Ingår i: In proceedings of The Listening Talker, Edinburgh, UK., 2012, s. 52-52Konferensbidrag (Refereegranskat)
  • 31.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar2015Ingår i: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, nr 2, s. 7:1-7:17Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

  • 32.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Aspects of co-occurring syllables and head nods in spontaneous dialogue2013Ingår i: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), 2013, s. 169-172Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.

  • 33.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Extracting and analysing co-speech head gestures from motion-capture data2013Ingår i: Proceedings of Fonetik 2013 / [ed] Eklund, Robert, Linköping University Electronic Press, 2013, s. 1-4Konferensbidrag (Refereegranskat)
  • 34.
    Alexanderson, Simon
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Extracting and analyzing head movements accompanying spontaneous dialogue2013Ingår i: Conference Proceedings TiGeR 2013: Tilburg Gesture Research Meeting, 2013Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper reports on a method developed for extracting and analyzing head gestures taken from motion capture data of spontaneous dialogue in Swedish. Candidate head gestures with beat function were extracted automatically and then manually classified using a 3D player which displays timesynced audio and 3D point data of the motion capture markers together with animated characters. Prosodic features were extracted from syllables co-occurring with a subset of the classified gestures. The beat gestures show considerable variation in temporal synchronization with the syllables, while the syllables generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. Additional features for further analysis and automatic classification of the head gestures are discussed.

  • 35. Allan, James
    et al.
    Aslam, Jay
    Azzopardi, Leif
    Belkin, Nick
    Borlund, Pia
    Bruza, Peter
    Callan, Jamie
    Carman, Mark
    Clarke, Charles L.A.
    Craswell, Nick
    Croft, W. Bruce
    Culpepper, J. Shane
    Diaz, Fernando
    Dumais, Susan
    Ferro, Nicola
    Geva, Shlomo
    Gonzalo, Julio
    Hawking, David
    Jarvelin, Kalervo
    Jones, Gareth
    Jones, Rosie
    Kamps, Jaap
    Kando, Noriko
    Kanoulas, Evangelos
    Karlgren, Jussi
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Kelly, Diane
    Lease, Matthew
    Lin, Jimmy
    Mizzaro, Stefano
    Moffat, Alistair
    Murdock, Vanessa
    Oard, Douglas W.
    Rijke, Maarten de
    Sakai, Tetsuya
    Sanderson, Mark
    Scholer, Falk
    Si, Luo
    Thom, James A.
    Thomas, Paul
    Trotman, Andrew
    Turpin, Andrew
    Vries, Arjen P. de
    Webber, William
    Zhang, Xiuzhen (Jenny)
    Zhang, Yi
    Frontiers, Challenges, and Opportunities for Information Retrieval – Report from SWIRL 2012, The Second Strategic Workshop on Information Retrieval in Lorne2012Ingår i: SIGIR Forum, ISSN 0163-5840, Vol. 46, nr 1, s. 2-32Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    During a three-day workshop in February 2012, 45 Information Retrieval researchers met to discuss long-range challenges and opportunities within the field. The result of the workshop is a diverse set of research directions, project ideas, and challenge areas. This report describes the workshop format, provides summaries of broad themes that emerged, includes brief descriptions of all the ideas, and provides detailed discussion of six proposals that were voted "most interesting" by the participants. Key themes include the need to: move beyond ranked lists of documents to support richer dialog and presentation, represent the context of search and searchers, provide richer support for information seeking, enable retrieval of a wide range of structured and unstructured content, and develop new evaluation methodologies.

  • 36. Alonso, Omar
    et al.
    Kamps, Jaap
    Karlgren, Jussi
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Report on the Fourth Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 11)2012Ingår i: SIGIR Forum, ISSN 0163-5840, E-ISSN 1558-0229, Vol. 46, nr 1, s. 56-64Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, and emerging robust NLP tools. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by increasing the depth of analysis of today’s systems. Currently, we have only started to explore the possibilities and only begun to understand how these valuable semantic cues can be put to fruitful use. The workshop had an interactive format consisting of keynotes, boasters and posters, breakout groups and reports, and a final discussion, which was prolonged into the evening. There was a strong feeling that we made substantial progress. Specifically, each of the breakout groups contributed to our understanding of the way forward. First, annotations and use cases come in many different shapes and forms depending on the domain at hand, but at a higher level there are remarkable commonalities in annotation tools, indexing methods, user interfaces, and general methodology. Second, we got insights in the "exploitation" aspects, leading to a clear separation between the low-level annotations giving context or meaning to small units of information (e.g., NLP, sentiments, entities), and annotations bringing out the structure inherent in the data (e.g., sources, data schemas, document genres). Third, the plan to enrich ClueWeb with various document level (e.g., pagerank and spam scores, but also reading level) and lower level (e.g., named entities or sentiments) annotations was embraced by the workshop as a concrete next step to promote research in semantic annotations.

  • 37. Altmann, U.
    et al.
    Oertel, Catharine
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Campbell, N.
    Conversational Involvement and Synchronous Nonverbal Behaviour2012Ingår i: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers / [ed] Anna Esposito, Antonietta M. Esposito, Alessandro Vinciarelli, Rüdiger Hoffmann, Vincent C. Müller, Springer Berlin/Heidelberg, 2012, s. 343-352Konferensbidrag (Refereegranskat)
    Abstract [en]

    Measuring the quality of an interaction by means of low-level cues has been the topic of many studies in the last couple of years. In this study we propose a novel method for conversation-quality-assessment. We first test whether manual ratings of conversational involvement and automatic estimation of synchronisation of facial activity are correlated. We hypothesise that the higher the synchrony the higher the involvement. We compare two different synchronisation measures. The first measure is defined as the similarity of facial activity at a given point in time. The second is based on dependence analyses between the facial activity time series of two interlocutors. We found that dependence measure correlates more with conversational involvement than similarity measure.

  • 38.
    Altosaar, Toomas
    et al.
    Aalto Univ. School of Science and Tech., Dept. of Signal Proc. & Acoustics.
    ten Bosch, Louis
    Radboud University Nijmegen, Language and Speech unit.
    Aimetti, Guillaume
    Univ. of Sheffield, Speech & Hearing group, Dept. of Computer Science.
    Koniaris, Christos
    KTH, Skolan för elektro- och systemteknik (EES), Ljud- och bildbehandling (Stängd 130101).
    Demuynck, Kris
    K.U.Leuven - ESAT/PSI.
    van den Heuvel, Henk
    Radboud University Nijmegen, Language and Speech unit.
    A Speech Corpus for Modeling Language Acquisition: CAREGIVER2010Ingår i: 7th International Conference on Language Resources and Evaluation (LREC) 2010, Valletta, Malta / [ed] Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias, European Language Resources Association (ELRA) , 2010, s. 1062-1068Konferensbidrag (Refereegranskat)
    Abstract [en]

    A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The paper describes the motivation behind the corpus and its design by relying on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, is covered. The corpus contains nearly 66000 utterance based audio files spoken over a two-year period by 17 male and 17 female native speakers of Dutch, English, Finnish, and Swedish. An orthographical transcription is available for every utterance. Also, time-aligned word and phone annotations for many of the sub-corpora also exist. The CAREGIVER corpus will be published via ELRA.

  • 39. Ambrazaitis, G.
    et al.
    Svensson Lundmark, M.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Head beats and eyebrow movements as a function of phonological prominence levels and word accents in Stockholm Swedish news broadcasts2015Ingår i: The 3rd European Symposium on Multimodal Communication, Dublin, Ireland, 2015Konferensbidrag (Refereegranskat)
  • 40. Ambrazaitis, G.
    et al.
    Svensson Lundmark, M.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Head Movements, Eyebrows, and Phonological Prosodic Prominence Levels in Stockholm2015Ingår i: 13th International Conference on Auditory-Visual Speech Processing (AVSP 2015), Vienna, Austria, 2015, s. 42-Konferensbidrag (Refereegranskat)
  • 41. Ambrazaitis, G.
    et al.
    Svensson Lundmark, M.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Multimodal levels of promincence: a preliminary analysis of head and eyebrow movements in Swedish news broadcasts2015Ingår i: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., Lund, 2015, s. 11-16Konferensbidrag (Övrigt vetenskapligt)
  • 42. Amundin, Mats
    et al.
    Eklund, Robert
    Hållsten, Henrik
    Karlgren, Jussi
    KTH, Skolan för datavetenskap och kommunikation (CSC), Teoretisk datalogi, TCS.
    Molinder, Lars
    A proposal to use distributional models to analyse dolphin vocalization2017Ingår i: 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots, 2017, 2017Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper gives a brief introduction to the starting points of an experimental project to study dolphin communicative behaviour using distributional semantics, with methods implemented for the large scale study of human language.

  • 43.
    Ananthakrishnan, Gopal
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Imitating Adult Speech: An Infant's Motivation2011Ingår i: 9th International Seminar on Speech Production, 2011, s. 361-368Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper tries to detail two aspects of speech acquisition by infants which are often assumed to be intrinsic or innate knowledge, namely number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of the certain language is a strong motivation for choosing both a certain number of independent articulatory parameters as well as a certain scheme of acoustic normalization between adult and child speech.

  • 44.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Badin, P.
    GIPSA-Lab, Grenoble University.
    Vargas, J. A. V.
    GIPSA-Lab, Grenoble University.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Predicting Unseen Articulations from Multi-speaker Articulatory Models2010Ingår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 1588-1591Konferensbidrag (Refereegranskat)
    Abstract [en]

    In order to study inter-speaker variability, this work aims to assessthe generalization capabilities of data-based multi-speakerarticulatory models. We use various three-mode factor analysistechniques to model the variations of midsagittal vocal tractcontours obtained from MRI images for three French speakersarticulating 73 vowels and consonants. Articulations of agiven speaker for phonemes not present in the training set arethen predicted by inversion of the models from measurementsof these phonemes articulated by the other subjects. On the average,the prediction RMSE was 5.25 mm for tongue contours,and 3.3 mm for 2D midsagittal vocal tract distances. Besides,this study has established a methodology to determine the optimalnumber of factors for such models.

  • 45.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Eklund, Robert
    Voice Provider, Stockholm.
    Peters, Gustav
    Forschungsinstitut Alexander Koenig, Bonn, Germany.
    Mabiza, Evans
    Antelope Park, Gweru, Zimbabwe.
    An acoustic analysis of lion roars. II: Vocal tract characteristics2011Ingår i: Proceedings from Fonetik 2011: Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR, Volume 51, 2011, Stockholm: KTH Royal Institute of Technology, 2011, Vol. 51, nr 1, s. 5-8Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    This paper makes the first attempt to perform an acoustic-to-articulatory inversion of a lion (Panthera leo) roar. The main problems that one encounters in attempting this, is the fact that little is known about the dimensions of the vocal tract, other than a general range of vocal tract lengths. Precious little is also known about the articulation strategies that are adopted by the lion while roaring. The approach used here is to iterate between possible values of vocal tract lengths and vocal tractconfigurations. Since there seems to be a distinct articulatory changes during the process of a roar, we find a smooth path that minimizes the error function between arecorded roar and the simulated roar using a variable length articulatory model.

  • 46.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Important regions in the articulator trajectory2008Ingår i: Proceedings of International Seminar on Speech Production / [ed] Rudolph Sock, Susanne Fuchs, Yves Laprie, Strasbourg, France: INRIA , 2008, s. 305-308Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper deals with identifying important regions in the articulatory trajectory based on the physical properties of the trajectory. A method to locate critical time instants as well as the key articulator positions is suggested. Acoustic-to-Articulatory Inversion using linear and non-linear regression isperformed using only these critical points. The accuracy of inversion is found to be almost the same as using all the data points.

  • 47.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping2011Ingår i: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Prague, Czech republic, 2011, s. 4628-4631Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper studies the role of non-uniqueness in the Acoustic-to- Articulatory Inversion. It is generally believed that applying continuity constraints to the estimates of thearticulatory parameters can resolve the problem of non-uniqueness. This paper tries to find out whether all instances of non-uniqueness can be resolved using continuity constraints. The investigation reveals that applying continuity constraints provides the best estimate in roughly around 50 to 53 % of the non-unique mappings. Roughly around 8 to13 % of the non-unique mappings are best estimated by choosing discontinuous paths along the hypothetical high probability estimates of articulatory trajectories.

  • 48.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Neiberg, Daniel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings2012Ingår i: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, nr 10, s. 2672-2682Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

  • 49.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Neiberg, Daniel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Cross-modal Clustering in the Acoustic-Articulatory Space2009Ingår i: Proceedings Fonetik 2009: The XXIIth Swedish Phonetics Conference / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, s. 202-207Konferensbidrag (Övrigt vetenskapligt)
    Abstract [en]

    This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.

  • 50.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Using Imitation to learn Infant-Adult Acoustic Mappings2011Ingår i: 12th Annual Conference Of The International Speech Communication Association 2011 (INTERSPEECH 2011), Vols 1-5, ISCA , 2011, s. 772-775Konferensbidrag (Refereegranskat)
    Abstract [en]

    This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces, that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback Is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.

1234567 1 - 50 av 686
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf