Ändra sökning
Avgränsa sökresultatet
1 - 30 av 30
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Träffar per sida
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
  • Standard (Relevans)
  • Författare A-Ö
  • Författare Ö-A
  • Titel A-Ö
  • Titel Ö-A
  • Publikationstyp A-Ö
  • Publikationstyp Ö-A
  • Äldst först
  • Nyast först
  • Skapad (Äldst först)
  • Skapad (Nyast först)
  • Senast uppdaterad (Äldst först)
  • Senast uppdaterad (Nyast först)
  • Disputationsdatum (tidigaste först)
  • Disputationsdatum (senaste först)
Markera
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. Ambrazaitis, G.
    et al.
    House, David
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Tal, musik och hörsel, TMH.
    Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings2017Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 95, s. 100-113Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Facial beat gestures align with pitch accents in speech, functioning as visual prominence markers. However, it is not yet well understood whether and how gestures and pitch accents might be combined to create different types of multimodal prominence, and how specifically visual prominence cues are used in spoken communication. In this study, we explore the use and possible interaction of eyebrow (EB) and head (HB) beats with so-called focal pitch accents (FA) in a corpus of 31 brief news readings from Swedish television (four news anchors, 986 words in total), focusing on effects of position in text, information structure as well as speaker expressivity. Results reveal an inventory of four primary (combinations of) prominence markers in the corpus: FA+HB+EB, FA+HB, FA only (i.e., no gesture), and HB only, implying that eyebrow beats tend to occur only in combination with the other two markers. In addition, head beats occur significantly more frequently in the second than in the first part of a news reading. A functional analysis of the data suggests that the distribution of head beats might to some degree be governed by information structure, as the text-initial clause often defines a common ground or presents the theme of the news story. In the rheme part of the news story, FA, HB, and FA+HB are all common prominence markers. The choice between them is subject to variation which we suggest might represent a degree of freedom for the speaker to use the markers expressively. A second main observation concerns eyebrow beats, which seem to be used mainly as a kind of intensification marker for highlighting not only contrast, but also value, magnitude, or emotionally loaded words; it is applicable in any position in a text. We thus observe largely different patterns of occurrence and usage of head beats on the one hand and eyebrow beats on the other, suggesting that the two represent two separate modalities of visual prominence cuing.

  • 2.
    Ananthakrishnan, Gopal
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Mapping between acoustic and articulatory gestures2011Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 53, nr 4, s. 567-589Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories arc essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.

  • 3. Beaugendre, F.
    et al.
    House, David
    KTH, Tidigare Institutioner                               , Talöverföring och musikakustik.
    Hermes, D. J.
    Accentuation boundaries in Dutch, French and Swedish2001Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 33, nr 4, s. 305-318Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper presents a comparative study investigating the relation between the timing of a rising or falling pitch movement and the temporal structure of the syllable it accentuates for three languages: Dutch, French and Swedish. In a perception experiment, the five-syllable utterances /mamamamama/ and /?a?a?a?a?a/ were provided with a relatively fast rising or falling pitch movement. The timing of the movement was systematically varied so that it accented the third or the fourth syllable, subjects were asked to indicate which syllable they perceived as accented. The accentuation boundary (AB) between the third and the fourth syllable was then defined as the moment before which more than half of the subjects indicated the third syllable as accented and after which more than half of the subjects indicated the fourth syllable. The results show that there are significant differences between the three languages as to the location of the AB. In general, for the rises, well-defined ABs were found. They were located in the middle of the vowel of the third syllable for French subjects, and later in that vowel for Dutch and swedish subjects. For the falls, a clear AB was obtained only for the Dutch and the Swedish listeners. This was located at the end of the third syllable. For the French listeners, the fall did not yield a clear AB, This corroborates the absence of accentuation by means of falls in French. By varying the duration of the pitch movement it could be shown that, in all cases in which a clear AB was found. the cue for accentuation was located at the beginning of the pitch movement.

  • 4. Bimbot, F.
    et al.
    Blomberg, Mats
    KTH, Tidigare Institutioner                               , Tal, musik och hörsel.
    Boves, L.
    Genoud, D.
    Hutter, H. P.
    Jaboulet, C.
    Koolwaaij, J.
    Lindberg, J.
    Pierrot, J. B.
    An overview of the CAVE project research activities in speaker verification2000Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 31, nr 03-feb, s. 155-180Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This article presents an overview of the research activities carried out in the European CAVE project, which focused on text-dependent speaker verification on the telephone network using whole word Hidden Markov Models. It documents in detail various aspects of the technology and the methodology used within the project. In particular, it addresses the issue of model estimation in the context of limited enrollment data and the problem of a posteriori decision threshold setting. Experiments are carried out on the realistic telephone speech database SESP. State-of-the-art performance levers are obtained, which validates the technical approaches developed and assessed during the project as well as the working infrastructure which facilitated cooperation between the partners.

  • 5. Bimbot, F
    et al.
    Blomberg, Mats
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Boves, L
    Genoud, D
    Hutter, H-P
    Jaboulet, C
    Koolwaaij, J
    Lindberg, J
    Pierrot, J-B
    An overwiev of the CAVE project research activities in speaker verification2000Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 31, nr 2-3, s. 155-180Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This article presents an overview of the research activities carried out in the European CAVE project, which focused on text-dependent speaker verification on the telephone network using whole word Hidden Markov Models. It documents in detail various aspects of the technology and the methodology used within the project. In particular, it addresses the issue of model estimation in the context of limited enrollment data and the problem of a posteriori decision threshold setting. Experiments are carried out on the realistic telephone speech database SESP. State-of-the-art performance levels are obtained, which validates the technical approaches developed and assessed during the project as well as the working infrastructure which facilitated cooperation between the partners.

  • 6.
    Blomberg, Mats
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Elenius, Kjell
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Karlsson, Inger A.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Research Challenges in Speech Technology: A Special Issue in Honour of Rolf Carlson and Bjorn Granstrom2009Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, nr 7, s. 563-563Artikel i tidskrift (Refereegranskat)
  • 7. Botinis, A.
    et al.
    Granström, Björn
    KTH, Tidigare Institutioner                               , Talöverföring och musikakustik.
    Mobius, B.
    Developments and paradigms in intonation research2001Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 33, nr 4, s. 263-296Artikel, forskningsöversikt (Refereegranskat)
    Abstract [en]

    The present tutorial paper is addressed to a wide audience with different discipline backgrounds as well as variable expertise on intonation. The paper is structured into five sections. In Section 1, Introduction, basic concepts of intonation and prosody are summarised and cornerstones of intonation research are highlighted. In Section 2, Functions and forms of intonation, a wide range of functions from morpholexical and phrase levels to discourse and dialogue levels are discussed and forms of intonation with examples from different languages are presented. In Section 3, Modelling and labelling of intonation, established models of intonation as well as labelling systems are presented. In Section 4, Applications of intonation the most widespread applications of intonation and especially technological ones are presented and methodological issues are discussed. In Section 5, Research perspective research avenues and ultimate goals as well as the significance and benefits of intonation research in the upcoming years are outlined.

  • 8. Boye, J.
    et al.
    Gustafson, Joakim
    Wiren, M.
    Robust spoken language understanding in a computer game2006Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, nr 03-4, s. 335-353Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    We present and evaluate a robust method for the interpretation of spoken input to a conversational computer game. The scenario- of the game is that of a player interacting with embodied fairy-tale characters in a 3D world via spoken dialogue (supplemented by graphical pointing actions) to solve various problems. The player himself cannot directly perform actions in the world, but interacts with the fairy-tale characters to have them perform various tasks, and to get information about the world and the problems to solve. Hence the role of spoken dialogue as the primary means of control is obvious and natural to the player. Naturally, this means that robust spoken language understanding becomes a critical component. To this end, the paper describes a semantic representation formalism and an accompanying parsing algorithm which works off the output of the speech recogniser's statistical language model. The evaluation shows that the parser is robust in the sense of considerably improving on the noisy output of the speech recogniser.

  • 9.
    Carlson, Rolf
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Data-driven multimodal synthesis2005Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 47, nr 02-jan, s. 182-193Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing different models.

  • 10.
    Carlson, Rolf
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Hirschberg, J.
    Swerts, M.
    Cues to upcoming Swedish prosodic boundaries: Subjective judgment studies and acoustic correlates2005Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 46, nr 3-4, s. 326-333Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Studies of perceptually based predictions of upcoming prosodic boundaries in spontaneous Swedish speech, both by native speakers of Swedish and of native speakers of standard American English reveal marked similarity in judgments. We examined whether Swedish and American listeners were able to predict the occurrence and strength of upcoming boundaries in a series of web-based perceptive experiments. Utterance fragments (in both long and short versions) were selected from a corpus of spontaneous Swedish speech, which was first labeled for boundary presence and strength by expert labelers. These fragments were then presented to listeners, who were instructed to guess whether or not they were followed by a prosodic break, and if so, what the strength of the break was. Results revealed that both Swedish and American listening groups were indeed able to predict whether or not a boundary (of a particular strength) followed the fragment. This suggests that acoustic and prosodic, rather than lexico-grammatical and semantic information was being used by listeners as a primary cue. Acoustic and prosodic correlates of these judgments were then examined, with significant correlations found between judgments and the presence/absence of final creak and phrase-final f0 level and slope.

  • 11.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Heldner, Mattias
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Hjalmarsson, Anna
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Towards human-like spoken dialogue systems2008Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 50, nr 8-9, s. 630-645Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed ill this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirely. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.

  • 12.
    Engwall, Olov
    KTH, Tidigare Institutioner (före 2005), Tal, musik och hörsel.
    Combining MRI, EMA and EPG measurements in a three-dimensional tongue model2003Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 41, nr 2-3, s. 303-329Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A three-dimensional (3D) tongue model has been developed using MR images of a reference subject producing 44 artificially sustained Swedish articulations. Based on the difference in tongue shape between the articulations and a reference, the six linear parameters jaw height, tongue body, tongue dorsum, tongue tip, tongue advance and tongue width were determined using an ordered linear factor analysis controlled by articulatory measures. The first five factors explained 88% of the tongue data variance in the midsagittal plane and 78% in the 3D analysis. The six-parameter model is able to reconstruct the modelled articulations with an overall mean reconstruction error of 0.13 cm, and it specifically handles lateral differences and asymmetries in tongue shape. In order to correct articulations that were hyperarticulated due to the artificial sustaining in the magnetic resonance imaging (MRI) acquisition, the parameter values in the tongue model were readjusted based on a comparison of virtual and natural linguopalatal contact patterns, collected with electropalatography (EPG). Electromagnetic articulography (EMA) data was collected to control the kinematics of the tongue model for vowel-fricative sequences and an algorithm to handle surface contacts has been implemented, preventing the tongue from protruding through the palate and teeth.

  • 13.
    Fant, Gunnar
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    A personal note from Gunnar Fant2009Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, nr 7, s. 564-568Artikel i tidskrift (Övrigt vetenskapligt)
  • 14.
    Granström, Björn
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Audiovisual representation of prosody in expressive speech communication2005Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 46, nr 3-4, s. 473-484Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Prosody in a single speaking style-often read speech-has been studied extensively in acoustic speech. During the past few years we have expanded our interest in two directions: (1) Prosody in expressive speech communication and (2) prosody as an audiovisual expression. Understanding the interactions between visual expressions (primarily in the face) and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons tightly connected to the acoustics (e.g. lip and jaw movements), but there are other articulatory movements that do not show up on the outside of the face. Furthermore, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. In this presentation we will give some examples of recent work, primarily at KTH, addressing these questions. We will report on methods for the acquisition and modeling of visual and acoustic data, and some evaluation experiments in which audiovisual prosody is tested. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher.

  • 15.
    Hjalmarsson, Anna
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The additive effect of turn-taking cues in human and synthetic voice2011Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 53, nr 1, s. 23-35Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners' turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan's findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.

  • 16.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Phrase-final rises as a prosodic feature in wh-questions in Swedish human-machine dialogue2005Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 46, nr 3-4, s. 268-283Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper examines the extent to which optional final rises occur in a set of 200 wh-questions extracted from a large corpus of computer-directed spontaneous speech in Swedish and discusses the function these rises may have in signalling dialogue acts and speaker attitude over and beyond an information question. Final rises occurred in 22% of the utterances, primarily in conjunction with final focal accent. Children exhibited the largest percentage of final rises (32%), with women second (27%) and men lowest (17%). The distribution of the rises in the material is examined and evidence relating to the final rise as a signal of a social interaction oriented dialogue act is gathered from the distribution. Two separate perception tests were carried out to test the hypothesis that high and late focal accent peaks in a wh-question are perceived as friendlier and more socially interested than low and early peaks. Generally, the results were consistent with these hypotheses when the late peaks were in phrase-final position. Finally, the results of this study are discussed in terms of pragmatic and attitudinal meanings and biological codes.

  • 17.
    Jande, Per-Anders
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Spoken language annotation and data-driven modelling of phone-level pronunciation in discourse context2008Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 50, nr 2, s. 126-141Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    A detailed description of the discourse context of a word can be used for predicting word pronunciation in discourse context and also enables studies of the interplay between various types of information on e.g. phone-level pronunciation. The work presented in this paper is aimed at modelling systematic variation in the phone-level realisation of words inherent to a language variety. A data-driven approach based on access to detailed discourse context descriptions is used. The discourse context descriptions are constructed through annotation of spoken language with a large variety of linguistic and related variables in multiple layers. Decision tree pronunciation models are induced from the annotation. The effects of using different types and different amounts of information for model induction are explored. Models generated in a tenfold cross-validation experiment produce on average 8.2% errors on the phone level when they are trained on all available information. Models trained on phoneme level information only have an average phone error rate of 14.2%. This means that including information above the phoneme level in the context description can improve model performance by 42.2%.

  • 18.
    Karlsson, Inger A.
    et al.
    KTH, Tidigare Institutioner                               , Tal, musik och hörsel.
    Banziger, T.
    Dankovicova, J.
    Johnstone, T.
    Lindberg, J.
    Melin, H.
    Nolan, F.
    Scherer, K.
    Speaker verification with elicited speaking styles in the VeriVox project2000Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 31, nr 03-feb, s. 121-129Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Some experiments have been carried out to study and compensate for within-speaker variations in speaker verification. To induce speaker variation, a speaking behaviour elicitation software package has been developed. A 50-speaker database with voluntary and involuntary speech variation has been recorded using this software. The database has been used for acoustic analysis as well as for automatic speaker verification (ASV) tests. The voluntary speech variations are used to form an enrolment set for the ASV system. This set is called structured training and is compared to neutral training where only normal speech is used. Both sets contain the same number of utterances. It is found that the ASV system improves its performance when testing on a mixed speaking style test without decreasing the performance of the tests with normal speech.

  • 19.
    Kjellström, Hedvig
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Datorseende och robotik, CVAP.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Audiovisual-to-articulatory inversion2009Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, nr 3, s. 195-209Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons.

  • 20.
    Koniaris, Christos
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Engwall, Olov
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    On mispronunciation analysis of individual foreign speakers using auditory periphery models2013Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, nr 5, s. 691-706Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.

  • 21.
    Neiberg, Daniel
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Semi-supervised methods for exploring the acoustics of simple productive feedback2013Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 55, nr 3, s. 451-469Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations.

  • 22.
    Nordstrand, Magnus
    et al.
    KTH, Tidigare Institutioner, Tal, musik och hörsel.
    Svanfeldt, Gunilla
    KTH, Tidigare Institutioner, Tal, musik och hörsel.
    Granström, Björn
    KTH, Tidigare Institutioner, Tal, musik och hörsel.
    House, David
    KTH, Tidigare Institutioner, Tal, musik och hörsel.
    Measurements of articulatory variation in expressive speech for set of Swedish vowels2004Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 44, nr 1-4, s. 187-196Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Facial gestures are used to convey e.g. emotions, dialogue states and conversational signals, which support us in the interpretation of other people's feelings and intentions. Synthesising this behaviour with an animated talking head would widen the possibilities of this intuitive interface. The dynamic characteristics of these facial gestures during speech affect articulation. Previously, articulation for neutral speech has been studied and implemented in animation rules. The results obtained in this study show how some articulatory parameters are affected by the influence of expressiveness in speech for a selection of Swedish vowels. Our focus has primarily been on attitudes and emotions conveying information that is intended to make an animated agent more "human-like". A multimodal corpus of acted expressive speech has been collected for this purpose.

  • 23.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Dynamic behaviour of connectionist speech recognition with strong latency constraints2006Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, nr 7, s. 802-818Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

    Ladda ner fulltext (pdf)
    dynamicbehaviour
  • 24.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Segment boundary detection via class entropy measurements in connectionist phoneme recognition2006Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 48, nr 12, s. 1666-1676Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition.The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure.The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.

    Ladda ner fulltext (pdf)
    segmentboundarydetection
  • 25.
    Seward, Alexander
    KTH, Tidigare Institutioner, Tal, musik och hörsel.
    A fast HMM match algorithm for very large vocabulary speech recognition2004Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 42, nr 2, s. 191-206Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The search over context-dependent continuous density Hidden Markov Models (HMMs), including state-likelihood computations, accounts for a considerable part of the total decoding time for a speech recognizer. This is especially apparent in tasks that incorporate large vocabularies and long-dependency n-gram grammars, since these impose a high degree of context dependency and HMMs have to be treated differently in each context. This paper proposes a strategy for acoustic match of typical continuous density HMMs, decoupled from the main search and conducted as a separate component suited for parallelization. Instead of computing a large amount of probabilities for different alignments of each HMM, the proposed method computes all alignments, but more efficiently. Each HMM is matched only once against any time interval, and thus may be instantly looked up by the main search algorithm as required. In order to accomplish this in real time, a fast time-warping match algorithm is proposed, exploiting the specifics of the 3-state left-to-right HMM topology without skips. In proof-of-concept tests, using a highly optimized SIMD-parallel implementation, the algorithm was able to perform time-synchronous decoupled evaluation of a triphone acoustic model, with maximum phone duration of 40 frames, with a real-time factor of 0.83 on one of the CPUs of a Dual-Xeon 2 GHz workstation. The algorithm was able to compute the likelihood for 636,000 locally optimal HMM paths/second, with full state evaluation.

  • 26.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Exploring human error recovery strategies: Implications for spoken dialogue systems2005Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 45, nr 3, s. 325-341Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover from speech recognition errors. This method for studying error handling has the advantages that the level of understanding is transparent to the analyser, and the errors that occur are similar to errors in spoken dialogue systems. The results show that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis about the situation instead of signalling non-understanding. Compared to other strategies, such as asking for a repetition, this strategy leads to better understanding of subsequent utterances, whereas signalling non-understanding leads to decreased experience of task success.

  • 27.
    Skantze, Gabriel
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Hjalmarsson, Anna
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Oertel, Catharine
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Turn-taking, feedback and joint attention in situated human-robot interaction2014Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 65, s. 50-66Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user's and the robot's gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot's speech. By analysing the participants' subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot's gaze when talking about landmarks, and that the robot's verbal and gaze behaviour has a strong effect on the users' turn-taking behaviour. We also present an analysis of the users' gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user's level of uncertainty.

  • 28. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Hennig, Shannon
    Cabral, Joao P
    Carson-Berndsen, Julie
    Predicting synthetic voice style from facial expressions. An application for augmented conversations2014Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 57, s. 63-75Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automat- ically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener. 

  • 29. Vicente-Pena, J.
    et al.
    Diaz-de-Maria, F.
    Kleijn, W. Bastiaan
    KTH, Skolan för elektro- och systemteknik (EES), Ljud- och bildbehandling.
    The synergy between bounded-distance HMM and spectral subtraction for robust speech recognition2010Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 52, nr 2, s. 123-133Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    Additive noise generates important losses in automatic speech recognition systems. In this paper, we show that one of the causes contributing to these losses is the fact that conventional recognisers take into consideration feature values that are outliers. The method that we call bounded-distance HMM is a suitable method to avoid that outliers contribute to the recogniser decision. However, this method just deals with outliers, leaving the remaining features unaltered. In contrast, spectral subtraction is able to correct all the features at the expense of introducing some artifacts that, as shown in the paper, cause a larger number of outliers. As a result, we find that bounded-distance HMM and spectral subtraction complement each other well. A comprehensive experimental evaluation was conducted, considering several well-known ASR tasks (of different complexities) and numerous noise types and SNRs. The achieved results show that the suggested combination generally outperforms both the bounded-distance HMM and spectral subtraction individually. Furthermore, the obtained improvements, especially for low and medium SNRs, are larger than the sum of the improvements individually obtained by bounded-distance HMM and spectral subtraction.

  • 30.
    Wik, Preben
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Hjalmarsson, Anna
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Embodied conversational agents in computer assisted language learning2009Ingår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, nr 10, s. 1024-1037Artikel i tidskrift (Refereegranskat)
    Abstract [en]

    This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.

1 - 30 av 30
RefereraExporteraLänk till träfflistan
Permanent länk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf