Change search
Refine search result
1 - 16 of 16
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ananthakrishnan, Gopal
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Neiberg, Daniel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Exploring the Predictability of Non-Unique Acoustic-to-Articulatory Mappings2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 10, p. 2672-2682Article in journal (Refereed)
    Abstract [en]

    This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.

  • 2.
    Chatterjee, Saikat
    et al.
    KTH, School of Electrical Engineering (EES), Communication Theory. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Auditory Model-Based Design and Optimization of Feature Vectors for Automatic Speech Recognition2011In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 19, no 6, p. 1813-1825Article in journal (Refereed)
    Abstract [en]

    Using spectral and spectro-temporal auditory models along with perturbation-based analysis, we develop a new framework to optimize a feature vector such that it emulates the behavior of the human auditory system. The optimization is carried out in an offline manner based on the conjecture that the local geometries of the feature vector domain and the perceptual auditory domain should be similar. Using this principle along with a static spectral auditory model, we modify and optimize the static spectral mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We then extend the work to include spectro-temporal auditory properties into designing a new dynamic spectro-temporal feature vector. Using a spectro-temporal auditory model, we design and optimize the dynamic feature vector to incorporate the behavior of human auditory response across time and frequency. We show that a significant improvement in automatic speech recognition (ASR) performance is obtained for any environmental condition, clean as well as noisy.

  • 3. Hendriks, Richard C.
    et al.
    Gerkmann, Timo
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 1, p. 223-233Article in journal (Refereed)
    Abstract [en]

    For multi-channel noise reduction algorithms like the minimum variance distortionless response (MVDR) beamformer, or the multi-channel Wiener filter, an estimate of the noise correlation matrix is needed. For its estimation, it is often proposed in the literature to use a voice activity detector (VAD). However, using a VAD the estimated matrix can only be updated in speech absence. As a result, during speech presence the noise correlation matrix estimate does not follow changing noise fields with an appropriate accuracy. This effect is further increased, as in nonstationary noise voice activity detection is a rather difficult task, and false-alarms are likely to occur. In this paper, we present and analyze an algorithm that estimates the noise correlation matrix without using a VAD. This algorithm is based on measuring the correlation of the noisy input and a noise reference which can be obtained, e. g., by steering a null towards the target source. When applied in combination with an MVDR beamformer, it is shown that the proposed noise correlation matrix estimate results in a more accurate beamformer response, a larger signal-to-noise ratio improvement and a larger instrumentally predicted speech intelligibility when compared to competing algorithms such as the generalized sidelobe canceler, a VAD-based MVDR beamformer, and an MVDR based on the noisy correlation matrix.

  • 4.
    Holzapfel, André
    et al.
    Universitat Pompeu Fabra, Spain .
    Davies, Matthew E. P.
    Zapata, José R.
    Oliveira, Joao Lobato
    Gouyon, Fabien
    Selective sampling for beat tracking evaluation2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 9, p. 2539-2548Article in journal (Refereed)
    Abstract [en]

    In this paper, we propose a method that can identify challenging music samples for beat tracking without ground truth. Our method, motivated by the machine learning method "selective sampling," is based on the measurement of mutual agreement between beat sequences. In calculating this mutual agreement we show the critical influence of different evaluation measures. Using our approach we demonstrate how to compile a new evaluation dataset comprised of difficult excerpts for beat tracking and examine this difficulty in the context of perceptual and musical properties. Based on tag analysis we indicate the musical properties where future advances in beat tracking research would be most profitable and where beat tracking is too difficult to be attempted. Finally, we demonstrate how our mutual agreement method can be used to improve beat tracking accuracy on large music collections.

  • 5.
    Holzapfel, André
    et al.
    Institute of Computer Science.
    Stylianou, Yannis
    Musical genre classification using Nonnegative Matrix Factorization based features2008In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 16, no 2, p. 424-434Article in journal (Refereed)
    Abstract [en]

    Nonnegative matrix factorization (NMF) is used to derive a novel description for the timbre of musical sounds. Using NMF, a spectrogram is factorized providing a characteristic spectral basis. Assuming a set of spectrograms given a musical genre, the space spanned by the vectors of the obtained spectral bases is modeled statistically using mixtures of Gaussians, resulting in a description of the spectral base for this musical genre. This description is shown to improve classification results by up to 23.3% compared to MFCC-based models, while the compression performed by the factorization decreases training time significantly. Using a distance-based stability measure this compression is shown to reduce the noise present in the data set resulting in more stable classification models. In addition, we compare the mean squared errors of the approximation to a spectrogram using independent component analysis and nonnegative matrix factorization, showing the superiority of the latter approach.

  • 6.
    Holzapfel, André
    et al.
    Technological Education Institute, Greece.
    Stylianou, Yannis
    Scale transform in rhythmic similarity of music2011In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 19, no 1, p. 176-185Article in journal (Refereed)
    Abstract [en]

    As a special case of the Mellin transform, the scale transform has been applied in various signal processing areas, in order to get a signal description that is invariant to scale changes. In this paper, the scale transform is applied to autocorrelation sequences derived from music signals. It is shown that two such sequences, when derived from similar rhythms with different tempo, differ mainly by a scaling factor. By using the scale transform, the proposed descriptors are robust to tempo changes, and are specially suited for the comparison of pieces with different tempi but similar rhythm. As music with such characteristics is widely encountered in traditional forms of music, the performance of the descriptors in a classification task of Greek traditional dances and Turkish traditional songs is evaluated. On these datasets accuracies compared to non-tempo robust approaches are improved by more than 20%, while on a dataset of Western music the achieved accuracy improves compared to previously presented results.

  • 7.
    Holzapfel, André
    et al.
    Institute of Computer Science; University of Crete, Greece.
    Stylianou, Yannis
    Gedik, Ali C.
    Bozkurt, Baris
    Three dimensions of pitched instrument onset detection2010In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 18, no 6, p. 1517-1527Article in journal (Refereed)
    Abstract [en]

    In this paper, we suggest a novel group delay based method for the onset detection of pitched instruments. It is proposed to approach the problem of onset detection by examining three dimensions separately: phase (i.e., group delay), magnitude and pitch. The evaluation of the suggested onset detectors for phase, pitch and magnitude is performed using a new publicly available and fully onset annotated database of monophonic recordings which is balanced in terms of included instruments and onset samples per instrument, while it contains different performance styles. Results show that the accuracy of onset detection depends on the type of instruments as well as on the style of performance. Combining the information contained in the three dimensions by means of a fusion at decision level leads to an improvement of onset detection by about 8% in terms of F-measure, compared to the best single dimension.

  • 8.
    Ma, Zhanyu
    et al.
    Beijing University of Posts and Telecommunications, China.
    Leijon, Arne
    KTH, School of Electrical Engineering (EES), Sound and Image Processing (Closed 130101).
    Kleijn, W. Bastiaan
    School of Engineering and Computer Science, Victoria University of Wellington, New Zealand.
    Vector Quantization of LSF Parameters With a Mixture of Dirichlet Distributions2013In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 21, no 9, p. 1777-1790Article in journal (Refereed)
    Abstract [en]

    Quantization of the linear predictive coding parameters is an important part in speech coding. Probability density function (PDF)-optimized vector quantization (VQ) has been previously shown to be more efficient than VQ based only on training data. For data with bounded support, some well-defined bounded-support distributions (e.g., the Dirichlet distribution) have been proven to outperform the conventional Gaussian mixture model (GMM), with the same number of free parameters required to describe the model. When exploiting both the boundary and the order properties of the line spectral frequency (LSF) parameters, the distribution of LSF differences (Delta LSF) can be modelled with a Dirichlet mixture model (DMM). We propose a corresponding DMM based VQ. The elements in a Dirichlet vector variable are highly mutually correlated. Motivated by the Dirichlet vector variable's neutrality property, a practical non-linear transformation scheme for the Dirichlet vector variable can be obtained. Similar to the Karhunen-Loeve transform for Gaussian variables, this non-linear transformation decomposes the Dirichlet vector variable into a set of independent beta-distributed variables. Using high rate quantization theory and by the entropy constraint, the optimal inter-and intra-component bit allocation strategies are proposed. In the implementation of scalar quantizers, we use the constrained-resolution coding to approximate the derived constrained-entropy coding. A practical coding scheme for DVQ is designed for the purpose of reducing the quantization error accumulation. The theoretical and practical quantization performance of DVQ is evaluated. Compared to the state-of-the-art GMM-based VQ and recently proposed beta mixture model (BMM) based VQ, DVQ performs better, with even fewer free parameters and lower computational cost.

  • 9. Mancini, Maurizio
    et al.
    Bresin, Roberto
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Pelachaud, Catherine
    A virtual head driven by music expressivity2007In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 15, no 6, p. 1833-1841Article in journal (Refereed)
    Abstract [en]

    In this paper, we present a system that visualizes the expressive quality of a music performance using a virtual head. We provide a mapping through several parameter spaces: on the input side, we have elaborated a mapping between values of acoustic cues and emotion as well as expressivity parameters; on the output side, we propose a mapping between these parameters and the behaviors of the virtual head. This mapping ensures a coherency between the acoustic source and the animation of the virtual head. After presenting some background information on behavior expressivity of humans, we introduce our model of expressivity. We explain how we have elaborated the mapping between the acoustic and the behavior cues. Then, we describe the implementation of a working system that controls the behavior of a human-like head that varies depending on the emotional and acoustic characteristics of the musical execution. Finally, we present the tests we conducted to validate our mapping between the emotive content of the music performance and the expressivity parameters.

  • 10.
    Mohammadiha, Nasser
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Arne, Leijon
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Nonnegative HMM for Babble Noise Derived from Speech HMM: Application to Speech Enhancement2013In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 21, no 5, p. 998-1011Article in journal (Refereed)
    Abstract [en]

    Deriving a good model for multitalker babble noise can facilitate different speech processing algorithms,e.g. noise reduction, to reduce the so-called cocktail party difficulty. In the available systems, thefact that the babble waveform is generated as a sum of N different speech waveforms is not exploitedexplicitly. In this paper, first we develop a gamma hidden Markov model for power spectra of the speechsignal, and then formulate it as a sparse nonnegative matrix factorization (NMF). Second, the sparse NMFis extended by relaxing the sparsity constraint, and a novel model for babble noise (gamma nonnegativeHMM) is proposed in which the babble basis matrix is the same as the speech basis matrix, and only theactivation factors (weights) of the basis vectors are different for the two signals over time. Finally, a noisereduction algorithm is proposed using the derived speech and babble models. All of the stationary modelparameters are estimated using the expectation-maximization (EM) algorithm, whereas the time-varyingparameters, i.e. the gain parameters of speech and babble signals, are estimated using a recursive EMalgorithm. The objective and subjective listening evaluations show that the proposed babble model andthe final noise reduction algorithm significantly outperform the conventional methods.

  • 11.
    Mohammadiha, Nasser
    et al.
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Smaragdis, Paris
    University of Illinois at Urbana-Champaign.
    Arne, Leijon
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Supervised and unsupervised speech enhancement using nonnegative matrix factorization2013In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 21, no 10, p. 2140-2151Article in journal (Refereed)
    Abstract [en]

    Reducing the interference noise in a monaural noisy speech signal has been a challenging task for many years. Compared to traditional unsupervised speech enhancement methods, e. g., Wiener filtering, supervised approaches, such as algorithms based on hidden Markov models (HMM), lead to higher-quality enhanced speech signals. However, the main practical difficulty of these approaches is that for each noise type a model is required to be trained a priori. In this paper, we investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization (NMF). We propose a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF). To circumvent the mismatch problem between the training and testing stages, we propose two solutions. First, we use an HMM in combination with BNMF (BNMF-HMM) to derive a minimum mean square error (MMSE) estimator for the speech signal with no information about the underlying noise type. Second, we suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system. Extensive experiments are carried out to investigate the performance of the proposed methods under different conditions. Moreover, we compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Our simulations show that the proposed BNMF-based methods outperform the competing algorithms substantially.

  • 12. Mossavat, Iman
    et al.
    Petkov, Petko N.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Amft, Oliver
    A Hierarchical Bayesian Approach to Modeling Heterogeneity in Speech Quality Assessment2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 1, p. 136-146Article in journal (Refereed)
    Abstract [en]

    The development of objective speech quality measures generally involves fitting a model to subjective rating data. A typical data set comprises ratings generated by listening tests performed in different languages and across different laboratories. These factors as well as others, such as the sex and age of the talker, influence the subjective ratings and result in data heterogeneity. We use a linear hierarchical Bayes (HB) structure to account for heterogeneity. To make the structure effective, we develop a variational Bayesian inference for the linear HB structure that approximates not only the posterior over the model parameters, but also the model evidence. Using the approximate model evidence we are able to study and exploit the heterogeneity inducing factors in the Bayesian framework. The new approach yields a simple linear predictor with state-of-the-art predictive performance. Our experiments show that the new method compares favorably with systems based on more complex predictor structures such as ITU-T recommendation P.563, Bayesian MARS, and Gaussian processes.

  • 13.
    Ozerov, Alexey
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Philippe, Pierrick
    Bimbot, Frederic
    Gribonval, Remi
    Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs2007In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 15, no 5, p. 1564-1578Article in journal (Refereed)
    Abstract [en]

    Probabilistic approaches can offer satisfactory solutions to source separation with a single channel, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, it is not always possible to train such models. To overcome this problem, we propose to resort to an adaptation scheme for adjusting the source models with respect to the actual properties of the signals observed in the mix. In this paper; we introduce a general formalism for source model-adaptation which is expressed in the framework of Bayesian models. Particular cases of the proposed approach are then investigated experimentally on the problem of separating voice from music in popular songs. The obtained results show that an adaptation scheme can improve consistently and significantly the separation performance in comparison with nonadapted models.

  • 14.
    Petkov, Petko N.
    et al.
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise2013In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 21, no 5, p. 1035-1045Article in journal (Refereed)
    Abstract [en]

    An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.

  • 15. Taal, Cees H.
    et al.
    Hendriks, R. C.
    Heusdens, R.
    Jensen, J.
    An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech2011In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 19, p. 2125-2136Article in journal (Refereed)
    Abstract [en]

    In the development process of noise-reduction algorithms, an objective machine-driven intelligibility measure which shows high correlation with speech intelligibility is of great interest. Besides reducing time and costs compared to real listening experiments, an objective intelligibility measure could also help provide answers on how to improve the intelligibility of noisy unprocessed speech. In this paper, a short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments. In general, STOI showed better correlation with speech intelligibility compared to five other reference objective intelligibility models. In contrast to other conventional intelligibility models which tend to rely on global statistics across entire sentences, STOI is based on shorter time segments (386 ms). Experiments indeed show that it is beneficial to take segment lengths of this order into account. In addition, a free Matlab implementation is provided.

  • 16.
    Taal, Cees H.
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Hendriks, Richard C.
    Heusdens, Richard
    A Low-Complexity Spectro-Temporal Distortion Measure for Audio Processing Applications2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 5, p. 1553-1564Article in journal (Refereed)
    Abstract [en]

    Perceptual models exploiting auditory masking are frequently used in audio and speech processing applications like coding and watermarking. In most cases, these models only take into account spectral masking in short-time frames. As a consequence, undesired audible artifacts in the temporal domain may be introduced (e.g., pre-echoes). In this article we present a new low-complexity spectro-temporal distortion measure. The model facilitates the computation of analytic expressions for masking thresholds, while advanced spectro-temporal models typically need computationally demanding adaptive procedures to find an estimate of these masking thresholds. We show that the proposed method gives similar masking predictions as an advanced spectro-temporal model with only a fraction of its computational power. The proposed method is also compared with a spectral-only model by means of a listening test. From this test it can be concluded that for non-stationary frames the spectral model underestimates the audibility of introduced errors and therefore overestimates the masking curve. As a consequence, the system of interest incorrectly assumes that errors are masked in a particular frame, which leads to audible artifacts. This is not the case with the proposed method which correctly detects the errors made in the temporal structure of the signal.

1 - 16 of 16
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf