Change search
Refine search result
12 1 - 50 of 77
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mirning, N.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Talking with Furhat - multi-party interaction with a back-projected robot head2012In: Proceedings of Fonetik 2012, Gothenberg, Sweden, 2012, p. 109-112Conference paper (Other academic)
    Abstract [en]

    This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum

  • 2. Batliner, A.
    et al.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    D’Arcy, S.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Giuliani, D.
    Gerosa, M.
    Hacker, C.
    Russell, M.
    Steidl, S.
    Wong, M.
    The PF STAR Children’s Speech Corpus2005In: 9th European Conference on Speech Communication and Technology, 2005, p. 3761-3764Conference paper (Refereed)
    Abstract [en]

    This paper describes the corpus of recordings of children's speech which was collected as part of the EU FP5 PF_STAR project. The corpus contains more than 60 hours of speech, including read and imitated native-language speech in British English, German and Swedish, read and imitated non-native-language English speech from German, Italian and Swedish children, and native-language spontaneous and emotional speech in English and German.

  • 3. Bertenstam, J
    et al.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K.O.E
    Granström, B
    Gustafson, J
    Hunnicutt, S
    Högberg, J
    Lindell, R
    Neovius, L
    Nord, L
    De Serpa-Leitao, A
    Ström, N
    THE WAXHOLM APPLICATION DATABASE1995Conference paper (Refereed)
    Abstract [en]

    This paper describes an application database collected in Wizard-of-Oz experiments in a spoken dialogue system, WAXHOLM. The system provides information on boat traffic in the Stockholm archipelago. The database consists of utterance-length speech files, their corressonding transcriptions, and log files of the dialogue sessions. In addition to the spontaneous dialogue speech, the material also comprise recordings of phonetically balanced reference sentences uttered by all 66 subjects. In the paper the recording procedure is described as well as some characteristics of the speech data and the dialogue.

  • 4.
    Bertenstam, Johan
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Mats, Blomberg
    KTH, Superseded Departments, Speech, Music and Hearing.
    Carlson, Rolf
    KTH, Superseded Departments, Speech, Music and Hearing.
    Elenius, Kjell
    KTH, Superseded Departments, Speech, Music and Hearing.
    Granström, Björn
    KTH, Superseded Departments, Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, Superseded Departments, Speech, Music and Hearing.
    Hunnicutt, Sheri
    Högberg, Jesper
    KTH, Superseded Departments, Speech, Music and Hearing.
    Lindell, Roger
    KTH, Superseded Departments, Speech, Music and Hearing.
    Neovius, Lennart
    KTH, Superseded Departments, Speech, Music and Hearing.
    Nord, Lennart
    de Serpa-Leitao, Antonio
    KTH, Superseded Departments, Speech, Music and Hearing.
    Ström, Nikko
    KTH, Superseded Departments, Speech, Music and Hearing.
    Spoken dialogue data collected in the Waxholm project1995In: Quarterly progress and status report: April 15, 1995 /Speech Transmission Laboratory, Stockholm: KTH , 1995, 1, p. 50-73Chapter in book (Other academic)
  • 5. Bimbot, F
    et al.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Boves, L
    Chollet, G
    Jaboulet, C
    Jacob, B
    Kharroubi, J
    Koolwaaij, J
    Lindberg, J
    Mariethoz, J
    Mokbel, C
    Mokbel, H
    An overview of the PICASSO project research activities in speaker verification for telephone applications1999Conference paper (Refereed)
    Abstract [en]

    This paper presents a general overview of the current research activities in the European PICASSO project on speaker verification for telephone applications. First, the general formalism used by the project is described. Then the scientific issues under focus are discussed in detail. Finally, the paper briefly describes the Picassoft research platform. Along the article, entry points to more specific work also published in the Eurospeech’99 proceedings are given.

  • 6. Bimbot, F.
    et al.
    Blomberg, Mats
    KTH, Superseded Departments, Speech, Music and Hearing.
    Boves, L.
    Genoud, D.
    Hutter, H. P.
    Jaboulet, C.
    Koolwaaij, J.
    Lindberg, J.
    Pierrot, J. B.
    An overview of the CAVE project research activities in speaker verification2000In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 31, no 03-feb, p. 155-180Article in journal (Refereed)
    Abstract [en]

    This article presents an overview of the research activities carried out in the European CAVE project, which focused on text-dependent speaker verification on the telephone network using whole word Hidden Markov Models. It documents in detail various aspects of the technology and the methodology used within the project. In particular, it addresses the issue of model estimation in the context of limited enrollment data and the problem of a posteriori decision threshold setting. Experiments are carried out on the realistic telephone speech database SESP. State-of-the-art performance levers are obtained, which validates the technical approaches developed and assessed during the project as well as the working infrastructure which facilitated cooperation between the partners.

  • 7. Bimbot, F
    et al.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Boves, L
    Genoud, D
    Hutter, H-P
    Jaboulet, C
    Koolwaaij, J
    Lindberg, J
    Pierrot, J-B
    An overwiev of the CAVE project research activities in speaker verification2000In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 31, no 2-3, p. 155-180Article in journal (Refereed)
    Abstract [en]

    This article presents an overview of the research activities carried out in the European CAVE project, which focused on text-dependent speaker verification on the telephone network using whole word Hidden Markov Models. It documents in detail various aspects of the technology and the methodology used within the project. In particular, it addresses the issue of model estimation in the context of limited enrollment data and the problem of a posteriori decision threshold setting. Experiments are carried out on the realistic telephone speech database SESP. State-of-the-art performance levels are obtained, which validates the technical approaches developed and assessed during the project as well as the working infrastructure which facilitated cooperation between the partners.

  • 8.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    A common phone model representation for speech recognition and synthesis1994Conference paper (Refereed)
    Abstract [en]

    A combined representation of context-dependent phones at the production parametric and the spectral level is described. The phones are trained in the production domain using analysis-by-synthesis and piece-wise linear approximation of parameter trajectories. For recognition, this representation is transformed to spectral subphones, using a cascade formant synthesis procedure. In a connected-digit recognition task, 99.1% average correct digit rate was achieved in a group of seven male speakers when, for each test speaker, training was done on the other six speakers. Simple rules for male-to-female transformation of the male phone library increased the performance for six female speakers from 88.9% without transformation to 96.3%. In informal listening tests of resynthesised digit strings, the speech has been judged as intelligible, however far from natural.

     

  • 9.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Collection and recognition of children s speech in the PF-Star project2003Conference paper (Refereed)
    Abstract [en]

    This paper reports on the recording and planned research activities on recognition of children’s speech in the EU-project PF-Star. The task is quite more difficult than recognition of adult speech for several reasons. High fundamental frequency and formant frequencies change the spectral shape of the speech signal. Also the pronunciation and the use of language differs from adult speech. One objective in PF-Star is to collect speech data for the project partners’ languages and to detect and analyse major difficulties.Possible ways of reducing these problems will be explored.

  • 10.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Creating unseen triphones by phone concatenation in the spectral, cepstral and formant domains1997Conference paper (Refereed)
    Abstract [en]

    A technique for predicting triphones by concatenation of diphone or monophone models is studied. The models are connected using linear interpolation between endpoints of piece-wise linear parameter trajectories. Three types of spectral representation are compared: formants, filter amplitudes and cepstmm coefficients. The proposed technique lowers the spectral distortion of the phones for all three representations when different speakers are used for training and evaluation. The average error of the created triphones is lower in the filter and cepstmm domains than for formants. This is explained to be caused by limitations in the Analysis-bySynthesis formant tracking algorithm. A small improvement with the proposed technique is achieved for all representations in the task of reordering N-best sentence recognition candidate lists.

  • 11.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Creating unseen triphones by phone concatenation of diphones and monophones in the spectral, cepstral and formant domains1997Conference paper (Refereed)
  • 12.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Creation of unseen triphones from seen triphones, diphones and phones1996In: TMH-QPSR, Vol. 37, no 2, p. 113-116Article in journal (Refereed)
    Abstract [en]

    With limited training data, infrequent triphone models for speech recognition will not be observed in suficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by using a transformation technique in the parametric representation of a formant speech synthesiser. Two techniques are currently tested. In one approach, unseen triphones are created by concatenating monophones and diphones and interpolating the parameter trajectories across the connection points. The second technique combines information from two similar triphones; one with correct context and one with correct midphone identity. Preliminary experiments are performed in the task of rescoring recognition candidates in an N-best list.

  • 13.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Model space size scaling for speaker adaptation2011In: Proceedings of Fonetik 2011, Stockholm: KTH Royal Institute of Technology, 2011, Vol. 51, no 1, p. 77-80Conference paper (Other academic)
    Abstract [en]

    In the current work, instantaneous adaptation in speech recognition is performedby estimating speaker properties, which modify the original trained acousticmodels. We introduce a new property, the size of the model space, which isincluded to the previously used features, VTLN and spectral slope. These arejointly estimated for each test utterance. The new feature has shown to be effectivefor recognition of children’s speech using adult-trained models in TIDIGITS.Adding the feature lowered the error rate by around 10% relative. The overallcombination of VTLN, spectral slope and model space scaling represents asubstantial 31% relative reduction compared with single VTLN. There was noimprovement among adult speakers in TIDIGITS and in TIMIT. Improvement forthis speaker category is expected when the training and test sets are recorded indifferent conditions, such as read and spontaneous speech.

  • 14.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
     Phoneme recognition for the hearing impaired2002Conference paper (Refereed)
    Abstract [en]

    This paper describes an automatic speech recognition system designed to investigate the use of phoneme recognition as a hearing aid in telephone communication. The system was tested in two experiments. The first involved 19 normal hearing subjects with a simulated severe hearing impairment. The second involved 5 hearing impaired subjects. In both studies we used a procedure called Speech Tracking to measure the effective communication speed between two persons. A substantial improvement was found in both cases.

  • 15.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Speech recognition using long-distance relations in an utterance1998Conference paper (Refereed)
  • 16. Blomberg, Mats
    Synthetic phoneme prototypes and dynamic voice source adaptation in speech recognition1993In: STL-QPSR, Vol. 34, no 4, p. 97-140Article in journal (Refereed)
    Abstract [en]

    A speech production oriented technique for generating reference spectral data for speech recognition is presented as an alternative to training to natural speech. The potentials of this approach are discussed. In the presented recognition system, the vocabulary and grammar are described as a finite-state network. Phoneme templates are specified in terms of control parameters to a cascade formant synthesiser. Reduction and coarticulation nzodules modify initial phoneme target values and insert interpolated transition states at phoneme boundaries before computing spectral reference data. The recognition process uses a time-synchronous Viterbi search technique. The ejfect of voice source variation upon speech recognition performance is discussed. Experiments using synthetic data show that normal voice quality variation is large enough to cause vowel confusions in a recogniser using Bark cepstrum representation. An adaptation technique is proposed that models the deviation of the voice source from the reference quality in terms of amplitude and spectral balance. This technique reduces the recogniser sensitivity to voice source deviation. To account for medium-term time correlation, an algorithm for dynamic adaptation of mismatch between reference and test data is described. The algorithm has been implemented to adapt to voice source fluctuations in time. In an isolated-word recognition task, the average recognition for ten male speakers was 88% without dynamic source adaptation using a 26-worvocabulary. Adding the voice source adaptation function raised the performance to 96%. On a vocabulary of three connected digits, the digit recognition rate was maximally 96.1 % as an average for six male speakers. Setting the reference voice source model to contain low high-frequency energy level, as in a breathy voice, resulted in a recogrzition rate of 73% without the source adaptation module. Including source adaptation raised the performance to 91 %, showing the power of this component to compensate for this type of mismatch between reference and test data-

  • 17. Blomberg, Mats
    Synthetic phoneme prototypes and source adaptation in a speech recognition system1989In: STL-QPSR, Vol. 30, no 1, p. 131-135Article in journal (Refereed)
    Abstract [en]

    A recognition system based on a reference library of synthetic phoneme prototypes is described. The phoneme templates are specified in terms of formant synthesis parameters. The vocabulary and grammar is described in a finite-state network where each state represents a phoneme. Atransition between two phonemes in the net is expanded to a number of new states using interpolation on the synthesis parameters or at the spectrum level. Ateach state, a 16channel filter bank section is computed from the synthesis parameters. Adaptation to each speaker's individual voice source spectrum is performed during recognition. Without adaptation, the average recognition for ten male speakers was 88% on an isolated-word task using a 26-word vocabulary. On a vocabulary of3connected digits, !he recognition rate for six male speakers was 87.7%.Adding the voice source adaptation feature raised the performance to 96% and 92.8%, respectively. The improvement varied considerably between the speakers, indicating the usefulness of the voice source adaptation for certain voices

  • 18.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Training production parameters of context-dependent phones for speech recognition1994In: STL-QPSR, Vol. 35, no 1, p. 59-90Article in journal (Refereed)
    Abstract [en]

    A representation form of acoustic information in a trained phone library at the production parametric as well as the spectral level is described. The phones are trained in the parametric domain and are transformed to the spectral domain by means of a synthesis procedure. By this twofold description, potentially more powerful procedures for speaker adaptation and generation of unseen triphones can be explored, while the more robust spectral representation can be used for recognition. Context-dependent phones are represented by control parameters to a cascade formant synthesiser. During training, the parameters are extracted using an analysis-by-synthesis technique and the trajectories are approximated by piece-wise linear segments. For recognition, the parameter tracks are transformed to a sequence of spectral subphone states, similar to a Hidden Markov model. Recognition is performed by Viterbi search in a finitestate network. Recognition experiments have been performed on Swedish connected-digit strings pronounced by seven male speakers. In one experiment, unseen triphones were created by concatenating monophones and diphones and interpolating the parameter trajectories between line endpoints. In another, speaker adaptation was based on generalisation of dzflerences of observed triphones from the phone library. With optimum weighting of duration information, the results for cross-speaker recognition, speaker adaptation, and multi-speaker training were 98.5%, 98.9% and 99.1% correct digit recognition, respectively. Preliminary experiments with created unseen triphones show no improvement. In informal listening tests of resynthesised digit strings from concatenation of trained triphones, the speech has been judged as intelligible, however, far from natural.

  • 19.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Training speech synthesis parameters of allophones for speech recognition1994Conference paper (Refereed)
    Abstract [en]

    A technique for training a speech recognition system at a production parametric level is described. The approach offers potential advantages in the form of small training corpora and fast speaker adaptation. Triphones that have not occurred in the training data can be generated by concatenation and parametric interpolation of diphones or context-free phones. The triphones are represented by a piece-wise linear approximation of the production parameters. For recognition, these are converted to subphone spectral state sequences. A 97.6% connected-digit recognition rate has been achieved when training the system on one male speaker and performing recognition on 6 other male speakers. In preliminary experiments with generation of unseen triphones, the performance is still slightly lower compared to using seen diphones and context-free phones. Experiments with fast speaker adaptation is also going on. Resynthesis of speech by concatenating triphones has been used to verify the quality of the triphone library.

  • 20.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Within-utterance correlation for speech recognition1999Conference paper (Refereed)
    Abstract [en]

    Relations between non-adjacent parts of an utterance are commonly regarded as an important source of information for speech recognition. However, they have not been very much used in speech recognition systems. In this paper, we include this information by joint distributions of pairs of phones occurring in the same utterance. In addition to relations between acoustic events, we also have incorporated relations between spectral and prosodically oriented information, such as phone duration, position in utterance and funda-mental frequency. Preliminary recognition results on N-best rescoring show 10% word error reduction compared to a baseline Viterbi decoder.

  • 21.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Within-utterance correlation in automatic speech recognition1999Conference paper (Refereed)
    Abstract [en]

    Information on relations between separate parts of an utterance can be used to improve the performance of speech recognition systems. In this paper, examples of relations are discussed and some measured data on phone pair correlation is presented. In addition to relations between acoustic events in an utterance, it is also possible to represent relations between acoustic and non-acoustic information. In this way, covariance matrices can express some relations similar to phonetic-acoustic rules. Two alternative recognition methods are proposed to account for these relations. Some correlation data are presented and discussed.

  • 22.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Labeling of speech given its text representation1993Conference paper (Refereed)
  • 23.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Galyas, K
    Granström, B
    Hunnicutt, S
    Neovius, L
    Speech synthesis and recognition in technical aids1986In: STL-QPSR, Vol. 27, no 4, p. 45-65Article in journal (Refereed)
    Abstract [en]

    A number of speech-producing technical aids arenowavailable for use by disabled individuals. One system which produces synthetic speech is described and its application in technical aids discussed. These applications inclde a communication aid, a symbol-to-speech system, talking terminals and a daily newspaper. A pattern-matching speech recognition system is also described and its future in the area of technical aids discussed.

  • 24.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Auditory models as front ends in speech-recognition systems1986Conference paper (Refereed)
    Abstract [en]

    Includes comments by Stefanie Seneff and Nelson Kiang. (PsycINFO Database Record (c) 2010 APA, all rights reserved)

  • 25.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Auditory models in isolated word recognition1984Conference paper (Refereed)
    Abstract [en]

    A straightforward isolated word recognition system has been used to test different auditory models in acoustic front end processing. The models include BARK, PHON and SONE. The PHONTEMP model is based on PHON but also includes temporal forward masking. We also introduce a model, DOMIN, which is intended to measure the dominating frequency at each point along the 'basilar membrane.' All the above models were derived from an FFT-analysis, and the FFT processing is also used as a reference model. One male and one female speaker were used to test the recognition performance of the different models on a difficult vocabulary consisting of 18 Swedish consonants and 9 Swedish vowels. The results indicate that the performance of the models decreases as they become more complex. The overall recognition accuracy of FFT is 97% while it is 87% for SONE. However, the DOMIN model which is sensitive to dominant frequencies (formants) performs very well for vowels.

  • 26.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Experiments with auditory models in speech recognition1982Conference paper (Refereed)
  • 27.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Speech research at KTH - two projects and technology transfer1985Conference paper (Refereed)
  • 28.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Some current projects at KTH related to speech recognition1986Conference paper (Refereed)
    Abstract [en]

    Understanding and modelling the human speech understanding process requires knowledge in several domains, from auditory analysis of speech to higher linguistic processes. Integrating this knowledge into a coherent model is not the scope of this paper. Rather we want to present some projects that may add to the understanding of some components that eventually could be built into a knowledge-based speech recognition system. One project is concerned with a framework to formulate and experiment w i t h the earlier levels of speech analysis. Others deal with different kindsof auditory representationsand methods for comparing speech sounds. S t i l l another project studies the p h e t i c and  ortbgraphic properties of different European  languages.

  • 29.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Taligenkänning baserad på ett text-till-talsystem1987Conference paper (Refereed)
  • 30.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Word recognition using synthesized reference templates1988Conference paper (Refereed)
    Abstract [en]

    A major problem in large‐vocabulary speech recognition is the collection of reference data and speaker normalization. In this paper, the use of synthetic speech is proposed as a means of handling this problem. An experimental scheme for such a speech recognition system will be described. A rule‐based speech synthesis procedure is used for generating the reference data. Ten male subjects participated in an experiment using a 26‐word test vocabulary recorded in a normal office room. The subjects were asked to read the words from a list with little instruction except to pronounce each word separately. The synthesis was used to build the references. No adjustments were done to the synthesis in this first stage. All the human speakers served better as reference than the synthesis. Differences between natural and synthetic speech have been analyzed in detail at the segmental level. Methods for updating the synthetic speech parameters from natural speech templates will be described. [This work has been supported by the Swedish Board of Technical Development.]

  • 31.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Word recognition using synthesized templates1988Conference paper (Refereed)
    Abstract [en]

    With the ultimate aim of creating a knowledge based speech understanding system, we have set up a conceptual framework named NEBULA. In this paper we briefly describe some of the components of this framework and also report on some experiments where we use a production component for generating reference data for the recognition. The production component in the form of a speech synthesis system will ideally make the collection of training data unnecessary. Preliminary results of an isolated word recognition experiment will be presented and discussed. Several methods of interfacing the production component to the recognition/evaluation component have been pursued.

  • 32.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Lindell, R
    Neovius, L
    An experimental dialog system: WAXHOLM1993Conference paper (Refereed)
    Abstract [en]

    Recently we have begun to build the basic tools for a generic speech-dialogue system, WAXHOLM. The main modules, their function and internal communication have been specified. The different components are connected through a computer network. A preliminary version of the system has been tested, using simplified versions of the modules. We will give a general overview of the system and describe some of the components in more detail. Application specific data are collected with the help of Wizard-of-Oz techniques. The dialogue system is used during the data collection and the wizard only replaces the speechrecognition module.

  • 33.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Lindell, R
    Neovius, L
    Speech recognition based on a text-to-speech synthesis system1987Conference paper (Refereed)
    Abstract [en]

    A major problem in large-vocabulary speech recognition is the collection of reference data and speaker normalization. In this paper we propose the use of synthetic speech as a means of handling this problem. An experimental scheme for such a system will be described.

  • 34.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlson, R
    Elenius, K.O.E
    Granström, B
    Auditory models and isolated word recognition1983In: STL-QPSR, Vol. 24, no 4, p. 1-15Article in journal (Refereed)
    Abstract [en]

    A straightforward isolated word recognition system has been used to test different auditory models in acoustic front end processing. The models include BARK, PHON and SONE. The PHONTEMP model is based on PHON but also includes temporal forward masking. We also introduce a model, DOMIN, which is intended to measure the dominating frequency at each point along the 'basilar membrane.' All the above models were derived from an FFT-analysis, and the FFT processing is also used as a reference model. One male and one female speaker were used to test the recognition performance of the different models on a difficult vocabulary consisting of 18 Swedish consonants and 9 Swedish vowels. The results indicate that the performance of the models decreases as they become more complex. The overall recognition accuracy of FFT is 97% while it is 87% for SONE. However, the DOMIN model which is sensitive to dominant frequencies (formants) performs very well for vowels.

  • 35.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Carlsson, R
    Elenius, K
    Granström, B
    Hunnicutt, S
    Word recognition using synthesized templates1988In: STL-QPSR, Vol. 29, no 2-3, p. 069-081Article in journal (Refereed)
    Abstract [en]

    With the ultimate aim of creating a knowledge based speech understanding system, we have set up a conceptual framework named NEBULA. In this paper we briefly describe some of the components of this framework and also report on some experiments where we use a production component for generating reference data for the recognition. The production component in the form of a speech synthesis system will ideally make the collection of training data unnecessary. Preliminary results of an isolated word recognition experiment will be presented and discussed. Several methods of interfacing the production component to the recognition/evaluation component have been pursued.

  • 36.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Estimating speaker characteristics for speech recognition2009In: Proceedings of Fonetik 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, p. 154-158Conference paper (Other academic)
    Abstract [en]

    A speaker-characteristic-based hierarchic tree of speech recognition models is designed. The leaves of the tree contain model sets, which are created by transforming a conventionally trained set using leaf-specific speaker profile vectors. The non-leaf models are formed by merging the models of their child nodes. During recognition, a maximum likelihood criterion is followed to traverse the tree from the root to a leaf. The computational load for estimating one- (vocal tract length) and fourdimensional speaker profile vectors (vocal tractlength, two spectral slope parameters andmodel variance scaling) is reduced to a fraction compared to that of an exhaustive search among all leaf nodes. Recognition experiments on children’s connected digits using adult models exhibit similar recognition performance for the exhaustive and the one-dimensional tree search. Further error reduction is achieved with the four-dimensional tree. The estimated speaker properties are analyzed and discussed.

  • 37.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Investigating Explicit Model Transformations for Speaker Normalization2008In: Proceedings of ISCA ITRW Speech Analysis and Processing for Knowledge Discovery / [ed] Paul Dalsgaard, Christian Fischer Pedersen, Ove Andersen, Aalborg, Denmark: ISCA/AAU , 2008Conference paper (Refereed)
    Abstract [en]

    In this work we extend the test utterance adaptation techniqueused in vocal tract length normalization to a larger number ofspeaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF-Star children database, joint estimation of warping factor and variance scaling lowered the recognition error rate compared to warping factor alone.

  • 38.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Knowledge-Rich Model Transformations for SpeakerKnowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition2008In: Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg / [ed] Anders Eriksson, Jonas Lindh, 2008, p. 37-40Conference paper (Other academic)
    Abstract [en]

    In this work we extend the test utterance adaptationtechnique used in vocal tract length normalizationto a larger number of speaker characteristicfeatures. We perform partially jointestimation of four features: the VTLN warpingfactor, the corner position of the piece-wise linearwarping function, spectral tilt in voicedsegments, and model variance scaling. In experimentson the Swedish PF-Star children database,joint estimation of warping factor andvariance scaling lowers the recognition errorrate compared to warping factor alone.

  • 39.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tree-Based Estimation of Speaker Characteristics for Speech Recognition2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, p. 580-583Conference paper (Refereed)
    Abstract [en]

    Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important advantage compared to conventional adaptation techniques that the adapted models are guaranteed to be realistic if the description of the properties are. One problem with this approach is that the search procedure to estimate them is computationally heavy. We address the problem by using a multi-dimensional, hierarchical tree of acoustic model sets. The leaf sets are created by transforming a conventionally trained model set using leaf-specific speaker profile vectors. The model sets of non-leaf nodes are formed by merging the models of their child nodes, using a computationally efficient algorithm. During recognition, a maximum likelihood criterion is followed to traverse the tree. Studies of one- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) exhibit a reduction of the computational load to a fraction compared to that of an exhaustive grid search. In recognition experiments on children's connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw, respectively, using adult models.

  • 40.
    Blomberg, Mats
    et al.
    KTH, Superseded Departments, Speech, Music and Hearing.
    Elenius, Daniel
    KTH, Superseded Departments, Speech, Music and Hearing.
    Zetterholm, Elisabeth
    Department of Philosophy & Linguistics, Umeå University.
    Speaker verification scores and acoustic analysis of a professional impersonator2004In: Proceedings of Fonetik 2004: The XVIIth Swedish Phonetics Conference / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University , 2004, p. 84-87Conference paper (Other academic)
  • 41.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    A device for automatic speech recognition1982Conference paper (Refereed)
    Abstract [en]

    This paper is a translation of a paper originally published in the proceedings of the 1982 meeting of "Nordiska akustistka sällskapet" (The Nordic Acoustical Society), pp. 383-386. 2 DESCRIPTION OF THE RECOGNITION SYSTEM

  • 42.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Creation of unseen triphones from diphones and monophones using a speech production approach1996Conference paper (Refereed)
    Abstract [en]

    With limited training data, infrequent triphone models for speech recognition will not be observed in sufficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by concatenating diphones and/or monophones in the parametric representation of a formant speech synthesiser. The parameter trajectories are estimated by interpolation between the endpoints of the original units. The spectral states of the created triphone are generated by the speech synthesiser. Evaluation of the proposed technique has been performed using spectral error measurements and recognition candidate rescoring of N-best lists. In both cases, the created triphones are shown to perform better than the shorter units from which they were constructed. 1. INTRODUCTION The triphone unit is the basic phone model in many current phonetic speech recognition systems. The reason for this is that triphones capture the coarticulation effect caused by the immediate pr...

  • 43.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system1982Conference paper (Refereed)
    Abstract [en]

    A pattern matching word recognition system has been modified in order to emphasize the transient parts of speech in the similarity mesure. The technique is to weight the word distances with a normalized spectral change function. A small positive effect is measured. Emphasizing the stationary parts is shown to substantially decrease the performance. Adding the time derivative of the speech parameters to the word patterns improves performance significantly.  This is probaly a consequence of an improvement in the description of the transient segments.

  • 44.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Nonlinear Frequency Warp for Speech Recognition1986Conference paper (Refereed)
    Abstract [en]

    A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowels. A frequency warp between two spectra is computed using a standard dynamic programming algorithm. The frequency distance, defined as the area between the obtained warping function and the diagonal, is contributing to the spectral distance. The distance between two spectra is a weighted sum of the warped amplitude distance and the frequency distance. By changing two weights, we get a gradual shift between non-warped amplitude distance, warped amplitude distance, and frequency distance. In recognition experiments on natural and synthetic vowel spectra, a metric combining the frequency and amplitude distances gave better results than using only amplitude or frequency deviation. Analysis of the results of the synthetic vowels show a reduced sensitivity to voice source and pitch variation. For the natural vowels, the recognition improvement is larger for the male and female speakers separately than for the combined groups.

  • 45.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Optimizing some parameters of a word recognizer used in car noise1990In: STL-QPSR, Vol. 31, no 4, p. 43-52Article in journal (Refereed)
    Abstract [en]

    A speaker-dependent word recognition system has been modified to improve the performance in noise. Problems with word detection and noise compensation have been addressed by using a close-talk microphone and a "noise addition" method. The reference templates are recorded in relative silence. The additional environmental noise during the recognition phase is measured and is "added" to the reference templates before using them for template matching. The recognition performance has been tested in moving cars with references recorded in parked cars. Recordings of six male speakers have been used in this report to rest the sensitivity of the recognition system to some essential parameters. The results from six male speakers and a twenty word vocabulary show that adapting the endpoint detection threshold to the noise level is essential for good performance and that noise compensation is imponant at signal-to-noise ratios below 15 dB.

  • 46.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Statistical analysis of speech signals1970In: STL-QPSR, Vol. 11, no 4, p. 1-8Article in journal (Refereed)
    Abstract [en]

    This is a condensed report of a thesis study carried out at the Department of Speech Communication in 1970, The purpose was to determine, for continuous speech, peakfactor, formfactor, long-time average spcctrum of voiced and voiceless sections separately, spectral density at different voice intensity level s r dist r ib~l t iono f the speech-wave amplitude, statistics on pause lengths and long-time average RMS of the speech wave. All tasks have been solved using the CDC computer of the Department.

  • 47.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Testing some essential parameters of a word recogniser used in car noise1989Conference paper (Refereed)
    Abstract [en]

    A speaker-dependent word recognition system has been modified to improve the performance in noise. Problems with word detection and noise compensation have been addressed by using a close-talk microphone and a "noise addition" method. The reference templates are recorded in relative silence. The additional environmental noise during the recognition phase is measured and is "added" to the reference templates before using them for template matching. The recognition performance has been tested in moving cars with references recorded in parked cars. Recordings of six male speakers have been used in this report to rest the sensitivity of the recognition system to some essential parameters. The results from six male speakers and a twenty word vocabulary show that adapting the endpoint detection threshold to the noise level is essential for good performance and that noise compensation is  imponant at signal-to-noise ratios below 15 dB.

  • 48.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Lundin, F
    Sundmalm, C
    Let your voice do the dialing1983In: Telephony, ISSN 0040-2656, E-ISSN 2161-8690, p. 68-74Article in journal (Refereed)
  • 49.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Elenius, K
    Ström, N
    Speech recognition in the Waxholm dialog system1994Conference paper (Refereed)
    Abstract [en]

    The speech recognition component in the KTH "Waxholm" dialog system is described. It will handle continuous speech with a vocabulary of about 1000 words. The output of the recogniser is fed to a probabilistic, knowledge-based parser, that contains a context-free grammar compiled into an augmented transition network.

  • 50.
    Blomberg, Mats
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Elenius, Kjell
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Karlsson, Inger A.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Research Challenges in Speech Technology: A Special Issue in Honour of Rolf Carlson and Bjorn Granstrom2009In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, no 7, p. 563-563Article in journal (Refereed)
12 1 - 50 of 77
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf