kth.sePublications
Change search
Refine search result
12 51 - 60 of 60
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Neiberg, Daniel
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Laukka, P.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Classification of Affective Speech using Normalized Time-Frequency Cepstra2010In: Speech Prosody 2010 Conference Proceedings, Chicago, Illinois, U.S.A, 2010Conference paper (Refereed)
    Abstract [en]

    Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered.Speaker independent classification results using nu-SVMthe the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.

  • 52.
    Pakucs, Botond
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Employing context of use in dialogue processing2004In: Proc CATALOG '04. 8th Workshop on the Semantics and Pragmatics of Dialogue, 2004, p. 162-163Conference paper (Refereed)
    Abstract [en]

    In this paper, a generic solution is presented for capturing, representing and employing the context of use in dialogue processing. The implementation of the solution within the framework of the SesaME dialogue manager and the Butler demonstrator is also described.

  • 53.
    Picard, Sebastien
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Wik, Preben
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Abdou, S.
    Detection of Specific Mispronunciations using Audiovisual Features2010In: Auditory-Visual Speech Processing (AVSP) 2010, The International Society for Computers and Their Applications (ISCA) , 2010Conference paper (Refereed)
    Abstract [en]

    This paper introduces a general approach for binaryclassification of audiovisual data. The intended application ismispronunciation detection for specific phonemic errors, usingvery sparse training data. The system uses a Support VectorMachine (SVM) classifier with features obtained from a TimeVarying Discrete Cosine Transform (TV-DCT) on the audiolog-spectrum as well as on the image sequences. Theconcatenated feature vectors from both the modalities werereduced to a very small subset using a combination of featureselection methods. We achieved 95-100% correctclassification for each pair-wise classifier on a database ofSwedish vowels with an average of 58 instances per vowel fortraining. The performance was largely unaffected when testedon data from a speaker who was not included in the training.

  • 54.
    Schenkman, Bo N.
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Nilsson, Mats E.
    Human echolocation: Blind and sighted persons' ability to detect sounds recorded in the presence of a reflecting object2010In: Perception, ISSN 0301-0066, E-ISSN 1468-4233, Vol. 39, no 4, p. 483-501Article in journal (Refereed)
    Abstract [en]

    Research suggests that blind people are superior to sighted in echolocation, but systematic psychoacoustic studies on environmental conditions such as distance to objects, signal duration, and reverberation are lacking. Therefore, two experiments were conducted. Noise bursts of 5, 50, or 500 ms were reproduced by a loudspeaker on an artificial manikin in an ordinary room and in an anechoic chamber. The manikin recorded the sounds binaurally in the presence and absence of a reflecting 1.5-mm thick aluminium disk, 0.5 in in diameter, placed in front, at distances of 0.5 to 5 m. These recordings were later presented to ten visually handicapped and ten sighted people, 30 62 years old, using a 2AFC paradigm with feedback. The task was to detect which of two sounds that contained the reflecting object. The blind performed better than the sighted participants. All performed well with the object at < 2 in distance. Detection increased with longer signal durations. Performance was slightly better in the ordinary room than in the anechoic chamber. A supplementary experiment on the two best blind persons showed that their superior performance at distances > 2 m was not by chance. Detection thresholds showed that blind participants could detect the object at longer distances in the conference room than in the anechoic chamber, when using the longer-duration sounds and also as compared to the sighted people. Audiometric tests suggest that equal hearing in both ears is important for echolocation. Possible echolocation mechanisms are discussed.

  • 55.
    Schenkman, Bo N.
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Nilsson, Mats E.
    Human echolocation: Pitch versus loudness information2011In: Perception, ISSN 0301-0066, E-ISSN 1468-4233, Vol. 40, no 7, p. 840-852Article in journal (Refereed)
    Abstract [en]

    Blind persons emit sounds to detect objects by echolocation. Both perceived pitch and perceived loudness of the emitted sound change as they fuse with the reflections from nearby objects: Blind persons generally are better than sighted at echolocation, but it is unclear whether this superiority is related to detection of pitch, loudness, or both. We measured the ability of twelve blind and twenty-five sighted listeners to determine which of two sounds, 500 ms noise bursts, that had been recorded in the presence of a reflecting object in a room with reflecting walls using an artificial head. The sound pairs were original recordings differing in both pitch and loudness, or manipulated recordings with either the pitch or the loudness information removed. Observers responded using a 2AFC method with verbal feedback. For both blind and sighted listeners the performance declined more with the pitch information removed than with the loudness information removed. In addition, the blind performed clearly better than the sighted as long as the pitch information was present, but not when it was removed. Taken together, these results show that the ability to detect pitch is a main factor underlying high performance in human echolocation.

  • 56.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Making grounding decisions: Data-driven estimation of dialogue costs and confidence thresholds2007In: Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, 2007, p. 206-210Conference paper (Refereed)
    Abstract [en]

    This paper presents a data-driven decision-theoretic approach to making grounding decisions in spoken dialogue systems, i.e., to decide which recognition hypotheses to consider as correct and which grounding action to take. Based on task analysis of the dialogue domain, cost functions are derived, which take dialogue efficiency, consequence of task failure and information gain into account. Dialogue data is then used to estimate speech recognition confidence thresholds that are dependent on the dialogue context.

  • 57.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can visualization of internal articulators support speech perception?2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, p. 2627-2630Conference paper (Refereed)
    Abstract [en]

    This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality side-view did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters. The results indicate that it cannot be expected that intra-oral animations support speech perception in general, but that information on some articulatory features can be extracted. Animations of tongue movements have hence more potential for use in computer-assisted pronunciation and perception training than as a communication aid for the hearing-impaired.

  • 58.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Looking at tongues – can it help in speech perception?2008In: Proceedings The XXIst Swedish Phonetics Conference, FONETIK 2008, 2008, p. 57-60Conference paper (Other academic)
    Abstract [en]

    This paper describes the contribution to speech perception given by animations of intra-oral articulations. 18 subjects were asked to identify the words in acoustically degraded sentences in three different presentation modes: acoustic signal only, audiovisual with a front view of a synthetic face and an audiovisual with both front face view and a side view, where tongue movements were visible by making parts of the cheek transparent. The augmented reality sideview did not help subjects perform better overall than with the front view only, but it seems to have been beneficial for the perception of palatal plosives, liquids and rhotics, especially in clusters.

  • 59.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Hincks, Rebecca
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Language and Communication.
    Hirschberg, Julia
    Department of Computer Science, Columbia University, USA.
    Responses to Ville: A virtual language teacher for Swedish2009In: Proc. of SLaTE Workshop on Speech and Language Technology in Education, Wroxall, England, 2009Conference paper (Refereed)
    Abstract [en]

    A series of novel capabilities have been designed to extend the repertoire of Ville, a virtual language teacher for Swedish, created at the Centre for Speech technology at KTH. These capabilities were tested by twenty-seven language students at KTH. This paper reports on qualitative surveys and quantitative performance from these sessions which suggest some general lessons for automated language training.

  • 60.
    Wik, Preben
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Embodied conversational agents in computer assisted language learning2009In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, no 10, p. 1024-1037Article in journal (Refereed)
    Abstract [en]

    This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.

12 51 - 60 of 60
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf