kth.sePublications
Change search
Refine search result
12345 1 - 50 of 214
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Alhaj Moussa, Obada
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Li, Minyue
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Predictive Audio Coding Using Rate-Distortion-Optimal Pre-and-Post-Filtering2011In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop on, IEEE conference proceedings, 2011, p. 213-216Conference paper (Refereed)
    Abstract [en]

    A natural approach to audio coding is to use a rate-distortion optimal design combined with a perceptual model. While this approach is common in transform coding, existing predictive-coding based audio coders are generally not optimal and they benefit from heuristically motivated post-filtering. As delay requirements often force the use of predictive coding, we consider audio coding with a pre- and post-filtered predictive structure that was recently proven to be asymptotically optimal in the rate-distortion sense [1]. We show that this audio coding is efficient in achieving the state-of-the-art performance. We also show that the pre-filter plays a relatively minor role. This leads to an analytic approach for optimizing the post-filter and the predictor at each rate, eliminating the need for manual re-tuning whenever a different rate is called for. In a subjective test, the theoretically optimized post-filter provided a better performance than a conventional post-filter.

  • 2.
    Altosaar, Toomas
    et al.
    Aalto Univ. School of Science and Tech., Dept. of Signal Proc. & Acoustics.
    ten Bosch, Louis
    Radboud University Nijmegen, Language and Speech unit.
    Aimetti, Guillaume
    Univ. of Sheffield, Speech & Hearing group, Dept. of Computer Science.
    Koniaris, Christos
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Demuynck, Kris
    K.U.Leuven - ESAT/PSI.
    van den Heuvel, Henk
    Radboud University Nijmegen, Language and Speech unit.
    A Speech Corpus for Modeling Language Acquisition: CAREGIVER2010In: 7th International Conference on Language Resources and Evaluation (LREC) 2010, Valletta, Malta / [ed] Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias, European Language Resources Association (ELRA) , 2010, p. 1062-1068Conference paper (Refereed)
    Abstract [en]

    A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The paper describes the motivation behind the corpus and its design by relying on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, is covered. The corpus contains nearly 66000 utterance based audio files spoken over a two-year period by 17 male and 17 female native speakers of Dutch, English, Finnish, and Swedish. An orthographical transcription is available for every utterance. Also, time-aligned word and phone annotations for many of the sub-corpora also exist. The CAREGIVER corpus will be published via ELRA.

  • 3.
    Asteborg, Marcus
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Flexible Audio Coder2011Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    As modern communications networks are heterogeneous and, therefore, highly variable, the design of source coders should reflect this network variability. It is desired that source coders are able to adapt instantly to any bit-rate constraint. Source coders that possess this property offer coding flexibility that facilitates optimal utilization of the available communication channel within a heterogeneous network. Flexible coders are able to utilize feedback information and therefore perform an instant re-optimization. This property implies that flexible audio coders are better suited for networks with high variability due to the fact a single configuration of the flexible coder can operate at continuum of bit-rates. The aim of the thesis is to implement a flexible audio coder in a real-time demonstrator (VoIP application) that facilitates instant re-optimization of the flexible coding scheme. The demonstrator provides real-time full-duplex communications over a packet network and the operating bit-rate may be adjusted on the fly. The coding performance of the flexible audio coding scheme should remain comparable to non-flexible schemes optimized at their operating bitrates. The report provides a background for the thesis work and describes the real-time implementation of the demonstrator. Finally, test results are provided. The coder is evaluated by means of a subjective MUSHRA test. The performance of the flexible audio coder is compared to relevant state-of-the-art codecs.

    Download full text (pdf)
    fulltext
  • 4.
    Bai, Hequn
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Mobile 3D Visual Search based on Local Stereo Image Features2012Independent thesis Advanced level (degree of Master (Two Years)), 80 credits / 120 HE creditsStudent thesis
    Abstract [en]

    Many recent applications using local image features focus on 2D image recognition. Such applications can not distinguish between real objects and photos of objects. In this project, we present a 3D object recognition method using stereo images. Using the 3D information of the objects obtained from stereo images, objects with similar image description but different 3D shapes can be distinguished, such as real objects and photos of objects. Besides, the feature matching performance is improved compared with the method using only local image features. Knowing the fact that local image features may consume higher bitrates than transmitting the compressed images itself, we evaluate the performance of a recently proposed low-bitrate local image feature descriptor CHoG in 3D object reconstruction and recognition, and propose a difference compression method based on the quantized CHoG descriptor, which further reduces bitrates.

    Download full text (pdf)
    fulltext
  • 5.
    Barry, Ousmane
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Semi-Automatic Extraction of Information from Satellite Images2011Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This master thesis project deals with the semi-automatic extraction of information on satellite images. Some Geographic information systems (GIS) are dedicated to the issue of data production. The graphical user interface of these GIS is essentially passive, and only provides basic CAD tools for intelligence information mapping such as geometric and semantic capture of spatial objects and semantics improvement of geographic objects. As well as CAD software, they improve the operator productivity in certain limits that of ergonomics. Thus, by combining some generic image processing algorithms, we have implemented a component of semi-automatic extraction of features on satellite images. We gave a priority on the interaction between a user and the component. The user will be only focused on theinterpretation of the images and the component will perform the repetitive task for him. The addressed features were suburban roads, hydrographic area boundaries and shorelines. This system based on powerful tools such as the Orfeo Toolbox (core of the) and Qt (for the GUI) has been tested on images from different stellites and the results are quite satisfactory. This opens perspectives to improve and optimize this system in the aim to integrate it into a GIS solution.

    Download full text (pdf)
    fulltext
  • 6.
    Barry, Ousmane
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Liu, Du
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Richter, Stefan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Flierl, Markus
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Robust Motion-Compensated Orthogonal Video Coding Using EBCOT2010In: Proceedings - 4th Pacific-Rim Symposium on Image and Video Technology, PSIVT 2010, IEEE , 2010, p. 264-269Conference paper (Refereed)
    Abstract [en]

    This paper proposes a rate-distortion control for motion-compensatedorthogonal video coding schemes and evaluates its robustness to packet loss as faced in, e.g., IP networks. The robustness of standard hybrid video coding is extensively studied in the literature. In contrast, motion-compensated orthogonal subbands offer important advantages and new features for robust video transmission. In this work, we utilize so-called uni-directional motioncompensated orthogonal transforms in combination with entropy coding similar to EBCOT known from JPEG2000.The approach provides a flexible embedded structure and allows flexible rate-distortion optimization. Moreover, it may even permit separate encoding and rate control. The proposed rate-distortion control takes channel coding into account and obtains a preemptively protected representation. Our implementation is based on repetition codes, adapted to the channel condition, and improves the PSNR significantly. The optimization requires an estimate of the packet loss rate at the encoder and shows moderate sensitivity to estimation errors.

  • 7.
    Blomqvist, Andreas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Modelling pulse timing patterns with a GMM/HMM- framework2011Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Download (pdf)
    sammanfattning
  • 8.
    Chatterjee, Saikat
    et al.
    KTH, School of Electrical Engineering (EES), Communication Theory.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    AUDITORY MODEL BASED MODIFIED MFCC FEATURES2010In: 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, p. 4590-4593Conference paper (Refereed)
    Abstract [en]

    Using spectral and spectro-temporal auditory models, we develop a computationally simple feature vector based on the design architecture of existing mel frequency cepstral coefficients (MFCCs). Along with the use of an optimized static function to compress a set of filter bank energies, we propose to use a memory-based adaptive compression function to incorporate the behavior of human auditory response across time and frequency. We show that a significant improvement in automatic speech recognition (ASR) performance is obtained for any environmental condition, clean as well as noisy.

  • 9.
    Chatterjee, Saikat
    et al.
    KTH, School of Electrical Engineering (EES), Communication Theory. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Auditory Model-Based Design and Optimization of Feature Vectors for Automatic Speech Recognition2011In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 19, no 6, p. 1813-1825Article in journal (Refereed)
    Abstract [en]

    Using spectral and spectro-temporal auditory models along with perturbation-based analysis, we develop a new framework to optimize a feature vector such that it emulates the behavior of the human auditory system. The optimization is carried out in an offline manner based on the conjecture that the local geometries of the feature vector domain and the perceptual auditory domain should be similar. Using this principle along with a static spectral auditory model, we modify and optimize the static spectral mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We then extend the work to include spectro-temporal auditory properties into designing a new dynamic spectro-temporal feature vector. Using a spectro-temporal auditory model, we design and optimize the dynamic feature vector to incorporate the behavior of human auditory response across time and frequency. We show that a significant improvement in automatic speech recognition (ASR) performance is obtained for any environmental condition, clean as well as noisy.

  • 10.
    Chatterjee, Saikat
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Koniaris, Christos
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Baastian
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Auditory model based optimization of MFCCs improves automatic speech recognition performance2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, 2009, p. 2943-2946Conference paper (Refereed)
    Abstract [en]

    Using a spectral auditory model along with perturbation based analysis, we develop a new framework to optimize a set of features such that it emulates the behavior of the human auditory system. The optimization is carried out in an off-line manner based on the conjecture that the local geometries of the feature domain and the perceptual auditory domain should be similar. Using this principle, we modify and optimize the static mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We show that improved recognition performance is obtained for any environmental condition, clean as well as noisy.

  • 11. Dahlquist, M.
    et al.
    Lutman, M. E.
    Wood, S.
    Leijon, Arne
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Methodology for quantifying perceptual effects from noise suppression systems2005In: International Journal of Audiology, ISSN 1499-2027, E-ISSN 1708-8186, Vol. 44, no 12, p. 721-732Article in journal (Refereed)
    Abstract [en]

    Methodology is proposed for perceptual assessment of both subjective sound quality and speech recognition in such way that results can be compared between these two aspects, Validation is performed with a noise suppression system applied to hearing instruments. A method termed Interpolated Paired Comparison Rating (IPCR) was developed for time efficient assessment of subjective impression of different aspects of sound quality for a variety of noise conditions. The method is based on paired comparisons between processed and unprocessed stimuli, and the results are expressed as the difference in signal-to-noise ratio (dB) between these that give equal subjective impression. For tests of speech recognition in noise, validated adaptive test methods can be used that give results in terms of speech-to-noise ratio. The methodology was shown to be sensitive enough to detect significant mean differences between processed and unprocessed speech in noise, both regarding subjective sound quality and speech recognition ability in groups consisting of 30 subjects. An effect on sound quality from the noise suppression equivalent to about 3-4 dB is required to be statistically significant for a single subject. A corresponding effect of 3-6 dB is required for speech recognition (one-sided test). The magnitude of difference that occurred in the present study for sound quality was sufficient to show significant differences for sound quality within individuals, but this was not the case for speech recognition.

  • 12. Driesen, J.
    et al.
    Van Hamme, H.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Learning from images and speech with non-negative matrix factorization enhanced by input space scaling2010In: 2010 IEEE Workshop on Spoken Language Technology, SLT 2010 - Proceedings, IEEE , 2010, p. 1-6Conference paper (Refereed)
    Abstract [en]

    Computional learning from multimodal data is often done with matrix factorization techniques such as NMF (Non-negative Matrix Factorization), pLSA (Probabilistic Latent Semantic Analysis) or LDA (Latent Dirichlet Allocation). The different modalities of the input are to this end converted into features that are easily placed in a vectorized format. An inherent weakness of such a data representation is that only a subset of these data features actually aids the learning. In this paper, we first describe a simple NMF-based recognition framework operating on speech and image data. We then propose and demonstrate a novel algorithm that scales the inputs of this framework in order to optimize its recognition performance.

  • 13.
    Dubois, Julien
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Integration of speech recognition into Air traffic Control2011Student paper other, 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    We study here the feasibility and relevance of integrating speech recognition technologies into air traffic controller workstations, with the purpose of adding safety and reduce workload, while the expected growth of air traffic density tends to add more constraints and stress to these critical jobs.

    Download (pdf)
    sammanfattning
  • 14.
    Ekman, Anders
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Grancharov, Volodya
    Multimedia Technologies, Ericsson Research.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Double-Ended Quality Assessment System for Super-Wideband Speech2011In: IEEE TRANS AUDIO SPEECH LANG, ISSN 1558-7916, Vol. 19, no 3, p. 558-569Article in journal (Refereed)
    Abstract [en]

    This paper describes a double-ended quality assessment system for speech with a bandwidth of up to 14 kHz (so-called super-wideband speech). The quality assessment system is based on a combination of local and global features, where the local features are dependent on a time alignment procedure and the global features are not. The system is evaluated over a large set of subjectively scored narrowband, wideband and super-wideband speech databases. The system performs similarly to PESQ for narrowband speech and significantly better for wideband speech.

  • 15. Ekman, L. A.
    et al.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Murthi, M. N.
    Regularized linear prediction of speech2008Article in journal (Refereed)
    Abstract [en]

    All-pole spectral envelope estimates based on linear prediction (LP) for speech signals often exhibit unnaturally sharp peaks, especially for high-pitch speakers. In this paper, regularization is used to penalize rapid changes in the spectral envelope, which improves the spectral envelope estimate. Based on extensive experimental evidence, we conclude that regularized linear prediction outperforms bandwidth-expanded linear prediction. The regularization approach gives lower spectral distortion on average, and fewer outliers, while maintaining a very low computational complexity.

  • 16.
    Ekman, L. Anders
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Murthi, Manohar N.
    Spectral envelope estimation and regularization2006In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006, p. 245-248Conference paper (Refereed)
    Abstract [en]

    A well-known problem with linear prediction is that its estimate of the spectral envelope often has sharp peaks for high-pitch speakers. These peaks are anomalies resulting from contamination of the spectral envelope by the spectral fine structure. We investigate the method of regularized linear prediction to find a better estimate of the spectral envelope and compare the method to the commonly used approach of bandwidth expansion. We present simulations over voiced frames of female speakers from the TINUT database, where the envelope modeling accuracy is measured using a log spectral distortion measure. We also investigate the coding properties of the methods. The results indicate that the new regularized LP method is superior to bandwidth expansion, with an insignificant increase in computational complexity.

  • 17. Eneman, K.
    et al.
    Luts, H.
    Wouters, J.
    Büchler, M.
    Dillier, N.
    Dreschler, W.
    Froehlich, M.
    Grimm, G.
    Hohmann, V.
    Houben, Rolph
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Leijon, Arne
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Lombard, A.
    Mauler, D.
    Moonen, M.
    Puder, H.
    Schulte, M.
    Spriet, A.
    Vormann, M.
    Evaluation of signal enhancement algorithms for hearing instruments2008Conference paper (Refereed)
    Abstract [en]

    In the frame of the HearCom1 project five promising signal enhancement algorithms are validated for future use in hearing instrument devices. To assess the algorithm performance solely based on simulation experiments, a number of physical evaluation measures have been proposed that incorporate basic aspects of normal and impaired human hearing. Additionally, each of the algorithms has been implemented on a common real-time hardware/software platform, which facilitates a profound subjective validation of the algorithm performance. Recently, a multicenter study has been set up across five different test centers in Belgium, the Netherlands, Germany and Switzerland to perceptually evaluate the selected signal enhancement approaches with normally hearing and hearing impaired listeners.

  • 18.
    Eneman, Koen
    et al.
    Katholieke Univ Leuven.
    Leijon, Arne
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Doclo, Simon
    Univ. Oldenburg.
    Spriet, Ann
    Katholieke Univ Leuven.
    Moonen, Marc
    Kathlieke Univ Leuven.
    Wouters, Jan
    Katholieke Univ Leuven.
    Auditory-Profile-Based Physical Evaluation of Multi-Microphone Noise Reduction Techniques in Hearing Instruments2008In: Advances in Digital Speech Transmission / [ed] Rainer Martin, Ulrich Heute, Christiane Antweiler, John Wiley & Sons, 2008, p. 431-458Chapter in book (Other academic)
  • 19.
    Enhorn, Jack
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Efficient Video Coding Beyond High-Definition Resolution2011Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    For good quality Internet video, digital TV, or even mobile video, efficient coding of video signals is required to meet given bandwidth constraints. Today, MPEG-2 and MPEG-4 coding standards are widely used. But to cope with higher than HD resolution more efficient coding is needed. A successor to the MPEG-4 codec is now developed by JCT-VC, a group formed by MPEG and VCEG, with the goal to reduce bit rate by 50%. The working name for this emerging standard is H.265/HEVC (High Efficiency Video Coding). In this thesis three algorithms associated with the loop-filters have been developed and evaluated. The goal is to improve a HEVC test model and possibly contribute to this emerging standard. Results show that deblocking and adaptive Loop filter may be run in parallel, a gain could be achieved by using more input to ALF and fixed filters may considerably reduce complexity of ALF. This thesis was done in cooperation with Ericsson Research.

    Download full text (pdf)
    fulltext
  • 20. Falk, Tiago H.
    et al.
    Stadler, Svante
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Chan, Wai-Yip
    Noise Suppression Based on Extending a Speech-Dominated Modulation Band2007In: INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, p. 1469-1472Conference paper (Refereed)
    Abstract [en]

    Previous work on bandpass modulation filtering for noise suppression has resulted in unwanted perceptual artifacts and decreased speech clarity. Artifacts are introduced mainly due to half-wave rectification, which is employed to correct for negative power spectral values resultant from the filtering process. In this paper, modulation frequency estimation (i.e., bandwidth extension) is used to improve perceptual quality. Experiments demonstrate that speech-component lowpass modulation content can be reliably estimated from bandpass modulation content of speech-plus-noise components. Subjective listening tests corroborate that improved quality is attained when the removed speech lowpass modulation content is compensated for by the estimate.

  • 21. Faundez-Zanuy, M
    et al.
    Hagmuller, M
    Kubin, G
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    The COST-277 speech database2005In: NONLINEAR ANALYSES AND ALGORITHMS FOR SPEECH PROCESSING, 2005, Vol. 3817, p. 100-107Conference paper (Refereed)
    Abstract [en]

    Databases are fundamental for research investigations. This paper presents the speech database generated in the framework of COST-277 "Nonlinear speech processing" European project, as a result of European collaboration. This database lets to address two main problems: the relevance of bandwidth extension, and the usefulness of a watermarking with perceptual shaping at different Watermark to Signal ratios. It will be public available after the end of the COST-277 action, in January 2006.

  • 22. Feldbauer, C.
    et al.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Scalable Coding with Side Information for Packet Loss Recovery2009In: IEEE Transactions on Communications, ISSN 0090-6778, E-ISSN 1558-0857, Vol. 57, no 8, p. 2309-2319Article in journal (Refereed)
    Abstract [en]

    This paper presents a packet loss recovery method that uses an incomplete secondary encoding based on scalar quantization as redundancy. The method is redundancy bit rate scalable and allows an adaptation to varying loss scenarios and a varying packeting strategy. The recovery is performed by minimum mean squared error estimation incorporating a statistical model for the quantizers to facilitate real.-time adaptation. A bit allocation algorithm is proposed that extends 'reverse water filling' to the problem of scalar encoding dependent variables for a decoder with a final estimation stage and available side information. We apply the method to the encoding of line-spectral frequencies (LSFs), which are commonly used in speech coding, illustrating the good performance of the method.

  • 23. Feldbauer, C.
    et al.
    Kubin, G.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Anthropomorphic coding of speech and audio: A model inversion approach2005In: EURASTP journal an applied signal processing, ISSN 1110-8657, E-ISSN 1687-0433, Vol. 2005, no 9, p. 1334-1349Article in journal (Refereed)
    Abstract [en]

    Auditory modeling is a well-established methodology that provides insight into human perception and that facilitates the extraction of signal features that are most relevant to the listener. The aim of this paper is to provide a tutorial on perceptual speech and audio coding using an invertible auditory model. In this approach, the audio signal is converted into an auditory representation using an invertible auditory model. The auditory representation is quantized and coded. Upon decoding, it is then transformed back into the acoustic domain. This transformation converts a complex distortion criterion into a simple one, thus facilitating quantization with low complexity. We briefly review past work on auditory models and describe in more detail the components of our invertible model and its inversion procedure, that is, the method to reconstruct the signal from the output of the auditory model. We summarize attempts to use the auditory representation for low-bit-rate coding. Our approach also allows the exploitation of the inherent redundancy of the human auditory system for the purpose of multiple description (joint source-channel) coding.

  • 24.
    Feldbauer, Christian
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleun, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    An adaptive, scalable packet loss recovery method2007In: 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007, p. 1117-1120Conference paper (Other academic)
    Abstract [en]

    We propose a packet loss recovery method that uses an incomplete secondary encoding as redundancy. The recovery is performed by minimum mean squared error estimation. The method adapts to the loss scenario and is rate scalable. It incorporates a statistical model for the quantizers to facilitate real-time adaptation. We apply the method to the encoding of line-spectral frequencies, which are commonly used in speech coding, illustrating the good performance of the method.

  • 25.
    Flierl, Markus
    KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre. KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    A l(1)-NORM PRESERVING MOTION-COMPENSATED TRANSFORM FOR SPARSE APPROXIMATION OF IMAGE SEQUENCES2010Conference paper (Refereed)
    Abstract [en]

    This paper discusses an adaptive non-linear transform for image sequences that aims to generate a l(1)-norm preserving sparse approximation for efficient coding. Most sparse approximation problems employ a linear model where images are represented by a basis and a sparse set of coefficients. In this work, however, we consider image sequences where linear measurements are of limited use due to motion. We present a motion-adaptive non-linear transform for a group of pictures that outputs common and detail coefficients and that minimizes the l(1) norm of the detail coefficients while preserving the overall l(1) norm. We demonstrate that we can achieve a smaller l(1) norm of the detail coefficients when compared to that of motion-adaptive linear measurements. Further, the decay of normalized absolute coefficients is faster than that of motion-adaptive linear measurements.

  • 26.
    Flierl, Markus
    KTH, School of Electrical Engineering (EES), Sound and Image Processing. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Adaptive spatial wavelets for motion-compensated orthogonal video transforms2009In: 2009 16th IEEE International Conference on Image Processing (ICIP), IEEE , 2009, p. 1045-1048Conference paper (Refereed)
    Abstract [en]

    This paper discusses adaptive spatial wavelets for the class of motion-compensated orthogonal video transforms. Motion-compensated orthogonal transforms (MCOT) are temporal transforms for video sequences that maintain orthonormality while permitting flexible motion compensation. Orthogonality is maintained for arbitrary integer-pixel or sub-pixel motion compensation by cascading a sequence of incremental orthogonal transforms and updating so-called scale counters for each pixel. The energy of the input pictures is accumulated in a temporal low-band while the temporal high-bands are zero if the input pictures are identical after motion compensation. For efficient coding, the temporal subbands should be further spatially decomposed to exploit the spatial correlation within each temporal subband. In this paper, we discuss adaptive spatial wavelets that maintain the orthogonal representation of the temporal transforms. Similar to the temporal transforms, they update scale counters for efficient energy concentration. The type-1 adaptive wavelet is a Haar-like wavelet. The type-2 considers three pixels at a time and achieves better energy compaction than the type-1.

  • 27.
    Flierl, Markus
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Vandergheynst, Pierre
    EPFL.
    Method for spatially scalable video coding2004Patent (Other (popular science, discussion, etc.))
    Abstract [en]

    A method for decomposing a digital image at resolution R and MR into a set of spatial sub-bands of resolution R and MR where MR>R and where the high-band at resolution MR is calculated by subtracting the filtered and up-sampled image at resolution R from the image at resolution MR and where the spatial low-band at resolution R is calculated by adding the filtered and down-sampled spatial high-band to the image at resolution R and where a rational factor for up-and down-sampling M is determined by the resolution ratio.

  • 28.
    Georgakis, Apostolos
    et al.
    Ericsson.
    Rana, Pravin Kumar
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Radulovic, Ivana
    Ericsson.
    3DTV Exploration Experiments (EE1 & EE4) on the Lovebird1 Data Set2009Report (Other (popular science, discussion, etc.))
    Abstract [en]

    This contribution describes the results to two sets of 3DTV exploration experiments undertaken by Ericsson using the Lovebird 1 sequence defined in the last MPEG meeting in London (see w10720). These sets cover both EE1 for depth estimation and view synthesis and EE4 for coding efficiency.

  • 29.
    Gerkmann, Timo
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Cepstral Weighting for Speech Dereverberation Without Musical Noise2011In: Proceedings European Signal Processing Conference, 2011, p. 2309-2313Conference paper (Refereed)
    Abstract [en]

    We present an effective way to reduce musical noise in binaural speech dereverberation algorithms based on an instantaneous weighting of the cepstrum. We propose this instantaneous technique, as temporal smoothing techniques result ina smearing of the signal over time and are thus expected to reduce the dereverberation performance. For the instantaneousweighting function we compute the a posteriori probabilitythat a cepstral coefficient represents the speech spectral structure. The proposed algorithm incorporates a priori knowledge about the speech spectral structure by training the parameters of the respective likelihood function offline using aspeech database. The proposed algorithm employs neither avoiced/unvoiced detection nor a fundamental period estimator and is shown to outperform an algorithm without cepstralprocessing in terms of a higher signal-to-interference ratio, alower bark spectral distortion, and a lower log kurtosis ratio, indicating a reduction of musical noise.

  • 30.
    Grancharov, Volodya
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Georgiev, Alexander
    Sub-pixel registration of noisy images2006In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing: image and multidimensional signal processing, signal processing education, bioimaging and signal processing, NEW YORK, NY: IEEE , 2006, p. 273-Conference paper (Refereed)
    Abstract [en]

    The accurate registration of images observed in additive noise is a challenging task. The noise increases the number of misregistered regions, and decreases the accuracy of subpixel registration. To address this problem, we propose an intensity-based algorithm that performs registration based only on regions that are least affccted by noise. We select these regions with a signal-to-noise ratio estimate that is obtained from an initial, less-accurate registration. Clur simulations demonstrate that the proposed noise-adaptive scheme significantly outperforms the conventional registration approach.

  • 31.
    Grancharov, Volodya
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Plasberg, Jan H.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Samuelsson, Jonas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Generalized postfilter for speech quality enhancement2008In: IEEE Transactions on Audio, Speech and Language Processing, ISSN 1558-7916, Vol. 16, no 1, p. 57-64Article in journal (Refereed)
    Abstract [en]

    Postfilters are commonly used in speech coding for the attenuation of quantization noise. In the presence of acoustic background noise or distortion due to tandeming operations, the postfilter parameters are not adjusted and the performance is, therefore, not optimal. We propose a modification that consists of replacing the nonadaptive postfilter parameters with parameters that adapt to variations in spectral flatness, obtained from the noisy speech. This generalization of the postfiltering concept can handle a larger range of noise conditions, but has the same computational complexity and memory requirements as the conventional postfilter. Test results indicate that the presented algorithm improves on the standard postfilter, as well as on the combination of a noise attenuation preprocessor and the conventional postfilter.

  • 32.
    Grancharov, Volodya
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Samuelsson, Jonas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    On causal algorithms for speech enhancement2006In: IEEE Transactions on Speech and Audio Processing., ISSN 1558-7916, Vol. 14, p. 764-773Article in journal (Refereed)
    Abstract [en]

    Kalman filtering is a powerful technique for the estimation of a signal, observed in noise that can be used to enhance speech observed in the presence of acoustic background noise. In a speech communication system, the speech signal is typically buffered for a period of 10-40 ms and, therefore, the use of either a causal or a noncausal filter is possible. We show that the causal Kalman algorithm is in conflict with the basic properties of human perception and address the problem of improving its perceptual quality. We discuss two approaches to improve perceptual performance. The first is based on a new method that combines the causal Kalman algorithm with pre- and postfiltering to introduce perceptual shaping of the residual noise. The second is based on the conventional Kalman smoother. We show that a short lag removes the conflict resulting from the causality constraint and we quantify the minimum lag required for this purpose. The results of our objective and subjective evaluations confirm that both approaches significantly outperform the conventional causal implementation. Of the two approaches, the Kalman smoother performs better if the signal statistics are precisely known, if this is not the case the perceptually weighted Kalman filter performs better.

  • 33.
    Grancharov, Volodya
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Samuelsson, Jonas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Improved Kalman filtering for speech enhancement2005In: 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, p. 1109-1112Conference paper (Refereed)
    Abstract [en]

    The Kalman recursion is a powerful technique for reconstruction of tire speech signal observed in additive background noise. In contrast to Wiener filtering and spectral subtraction schemes, the Kalman algorithm can be easily implemented in both causal and noncausal form. After studying the perceptual differences between these two implementations we propose a novel algorithm that combines the low complexity and the robustness of the Kalman filter and the proper noise shaping of the Kalman smoother.

  • 34. Grancharov, Volodya
    et al.
    Zhao, David
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Lindblom, Jonas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, Bastiaan
    KTH, School of Electrical Engineering (EES).
    Low-complexity, non-intrusive speech quality assessment2006In: IEEE Transactions on Speech and Audio Processing., ISSN 1558-7916, Vol. 14, no 6, p. 1948-1956Article in journal (Refereed)
    Abstract [en]

    Monitoring of speech quality in emerging heterogeneous networks is of great interest to network operators. The most efficient way to satisfy such a need is through nonintrusive, objective speech quality assessment. In this paper, we describe a low-complexity algorithm for monitoring the speech quality over a network. The features used in the proposed algorithm can be computed from commonly used speech-coding parameters. Reconstruction and perceptual transformation of the signal is not performed. The critical advantage of the approach lies in generating quality assessment ratings without explicit distortion modeling. The results from the performed experiments indicate that the proposed nonintrusive objective quality measure performs better than the ITU-T P.563 standard.

  • 35.
    Grancharov, Volodya
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Zhao, David Yuheng
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Lindblom, Jonas
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Non-Intrusive Speech Quality Assessment with Low Computational Complexity2006In: INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2006, p. 189-192Conference paper (Refereed)
    Abstract [en]

    We describe an algorithm for monitoring subjective speech quality without access to the original signal that has very low computational and memory requirements. The features used in the proposed algorithm can be computed from commonly used speech-coding parameters. Reconstruction and perceptual transformation of the signal are not performed. The algorithm generates quality assessment ratings without explicit distortion modeling. The simulation results indicate that the proposed non-intrusive objective quality measure performs better than the ITU-T P.563 standard despite its very low computational complexity.

  • 36.
    Guan, Nan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Bayesian Optimal Pure Tone Audiometry with Prior Knowledge2011Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Pure tone hearing threshold measurement is the most basic and common test for diagnosis of hearing loss and for compensation of the loss with hearing instruments. Pure-tone hearing thresholds are usually assessed using a simple standardized method. By employing an optimal strategy, the thresholds can be determined with the same accuracy as the standard method with much less presentations. With prior knowledge extracted from the Beltone’s extra database, which contains over 400,000 audiograms including age and gender information, a more detailed prior knowledge will help improve the optimal strategy and efficiency of PTA process. Meanwhile, a graphical user interface is implementing the method with a more direct way to the users (doctors and patients), which makes the optimal process more accessible and easy to control.

    Download full text (pdf)
    fulltext
  • 37.
    Guo, Ziyuan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Objective Audio Quality Assessment Based on Spectro-Temporal Modulation Analysis2011Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Objective audio quality assessment is an interdisciplinary research area that incorporates audiology and machine learning. Although much work has been made on the machine learning aspect, the audiology aspect also deserves investigation. This thesis proposes a non-intrusive audio quality assessment algorithm, which is based on an auditory model that simulates human auditory system. The auditory model is based on spectro-temporal modulation analysis of spectrogram, which has been proven to be effective in predicting the neural activities of human auditory cortex. The performance of an implementation of the algorithm shows the effectiveness of the spectro-temporal modulation analysis in audio quality assessment. 

    Download full text (pdf)
    fulltext
  • 38.
    Guoqiang, Zhang
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Dán, György
    KTH, School of Electrical Engineering (EES), Communication Networks.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Lundin, Henrik
    Adaptive Playout Scheduling for Voice over IP: Event-Triggered Control PolicyIn: IEEE Multimedia, ISSN 1070-986X, E-ISSN 1941-0166Article in journal (Other academic)
    Abstract [en]

    We study adaptive-playout scheduling for Voice over IP using the frame-work of stochastic impulse control theory. We use the Wiener process tomodel the fluctuation of the buffer length in the absence of control. In thiscontext, the control signal consists of length units that correspond to insert-ing or dropping a pitch cycle. We define an optimality criterion that hasan adjustable trade-off between average buffering delay and average controlsignal (the length of the pitch cycles added plus the length of the pitch cyclesdropped), and show that a band control policy is optimal for this criterion.The band control policy maintains the buffer length within a band regionby imposing impulse control (inserted or dropped pitch cycles) whenever thebounds of the band are reached. One important property of the band controlpolicy is that it incurs no packet-loss through buffering if there are no out-of-order packet-arrivals. Experiments performed on both synthetic and realnetwork-delay traces show that the proposed playout scheduling algorithmoutperforms two recent algorithms in most cases.

  • 39.
    Guoqiang, Zhang
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    W. Bastiaan, Kleijn
    KTH, School of Electrical Engineering (EES), Sound and Image Processing. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Autoregressive Model-based Speech Packet-Loss Concealment2008In: 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2008, p. 4797-4800Conference paper (Refereed)
    Abstract [en]

    We study packet-loss concealment for speech based on autoregressivemodelling using a rigorous minimum mean square error (MMSE) approach.The effect of the model estimation error on predicting the missing segment isstudied and an upper bound on the mean square error is derived. Our exper-iments show that the upper bound is tight when the estimation error is lessthan the signal variance. We also consider the usage of perceptual weightingon prediction to improve speech quality. A rigorous argument is presentedto show that perceptual weighting is not useful in this context. We createsimple and practical MMSE-based systems using two signal models: a basicmodel capturing the short-term correlation and a more sophisticated modelthat also captures the long-term correlation. Subjective quality comparisontests show that the proposed MMSE-based system provides state-of-the-artperformance.

  • 40.
    Helgason, Hannes
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Abry, P.
    Goncalves, P.
    Gharib, Cl.
    Gaucherand, P.
    Doret, M.
    Adaptive Multiscale Complexity Analysis of Fetal Heart Rate2011In: IEEE Transactions on Biomedical Engineering, ISSN 0018-9294, E-ISSN 1558-2531, Vol. 58, no 8, p. 2186-2193Article in journal (Refereed)
    Abstract [en]

    Per partum fetal asphyxia is a major cause of neonatal morbidity and mortality. Fetal heart rate monitoring plays an important role in early detection of acidosis, an indicator for asphyxia. This problem is addressed in this paper by introducing a novel complexity analysis of fetal heart rate data, based on producing a collection of piecewise linear approximations of varying dimensions from which a measure of complexity is extracted. This procedure specifically accounts for the highly nonstationary context of labor by being adaptive and multiscale. Using a reference dataset, made of real per partum fetal heart rate data, collected in situ and carefully constituted by obstetricians, the behavior of the proposed approach is analyzed and illustrated. Its performance is evaluated in terms of the rate of correct acidosis detection versus the rate of false detection, as well as how early the detection is made. Computational cost is also discussed. The results are shown to be extremely promising and further potential uses of the tool are discussed. MATLAB routines implementing the procedure will be made available at the time of publication.

  • 41.
    Helgason, Hannes
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Bartroff, Jay
    Abry, Patrice
    Framework for Adaptive Multiscale Analysis of Nonhomogeneous Point Processes2011In: 2011 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), New York: IEEE , 2011, p. 7727-7730Conference paper (Refereed)
    Abstract [en]

    We develop the methodology for hypothesis testing and model selection in nonhomogeneous Poisson processes, with an eye toward the application of modeling and variability detection in heart beat data. Modeling the process' non-constant rate function using templates of simple basis functions, we develop the generalized likelihood ratio statistic for a given template and a multiple testing scheme to model-select from a family of templates. A dynamic programming algorithm inspired by network flows is used to compute the maximum likelihood template in a multiscale manner. In a numerical example, the proposed procedure is nearly as powerful as the super-optimal procedures that know the true template size and true partition, respectively. Extensions to general history-dependent point processes is discussed.

  • 42.
    Helgason, Hannes
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Li, Haopeng
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Flierl, Markus
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Multiscale framework for adaptive and robust enhancement of depth in multi-view imagery2012In: Image Processing (ICIP), 2012 19th IEEE International Conference on, IEEE , 2012, p. 13-16Conference paper (Refereed)
    Abstract [en]

    Depth Image Based Rendering (DIBR) is a standard technique in free viewpoint television for rendering virtual camera views. For synthesis it utilizes one or several reference texture images and associated depth images, which contain information about the 3D structure of the scene. Many popular depth estimation methods infer the depth information by considering texture images in pairs. This often leads to severe inconsistencies among multiple reference depth images, resulting in poor rendering quality. We propose a method which takes as input a set of depth images and returns an enhanced depth map to be used for rendering at the virtual viewpoint. Our framework is data-driven and based on a simple geometric multiscale model of the underlying depth. Inconsistencies and errors in the inputted depth images are handled locally using tools from the field of robust statistics. Numerical comparison shows the method outperform standard MPEG DIBR software.

  • 43.
    Helgason, Hannes
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Pipiras, Vladas
    Abry, Patrice
    Synthesis of multivariate stationary series with prescribed marginal distributions and covariance using circulant matrix embedding2011In: Signal Processing, ISSN 0165-1684, E-ISSN 1872-7557, Vol. 91, no 8, p. 1741-1758Article in journal (Refereed)
    Abstract [en]

    The problem of synthesizing multivariate stationary series Y[n] = (Y-1[n],...,Y-p[n](T), n is an element of Z, with prescribed non-Gaussian marginal distributions, and a targeted covariance structure, is addressed. The focus is on constructions based on a memoryless transformation Y-p[n] = f(p)(X-p[n]) of a multivariate stationary Gaussian series X[n] = (X-1[n],...,X-p[n])(T). The mapping between the targeted covariance and that of the Gaussian series is expressed via Hermite expansions. The various choices of the transforms f(p) for a prescribed marginal distribution are discussed in a comprehensive manner. The interplay between the targeted marginal distributions, the choice of the transforms f(p), and on the resulting reachability of the targeted covariance, is discussed theoretically and illustrated on examples. Also, an original practical procedure warranting positive definiteness for the transformed covariance at the price of approximating the targeted covariance is proposed, based on a simple and natural modification of the popular circulant matrix embedding technique. The applications of the proposed methodology are also discussed in the context of network traffic modeling. MATIAB codes implementing the proposed synthesis procedure are publicly available at http://www.hermir.org.

  • 44. Hendriks, Richard C.
    et al.
    Gerkmann, Timo
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement2012In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 20, no 1, p. 223-233Article in journal (Refereed)
    Abstract [en]

    For multi-channel noise reduction algorithms like the minimum variance distortionless response (MVDR) beamformer, or the multi-channel Wiener filter, an estimate of the noise correlation matrix is needed. For its estimation, it is often proposed in the literature to use a voice activity detector (VAD). However, using a VAD the estimated matrix can only be updated in speech absence. As a result, during speech presence the noise correlation matrix estimate does not follow changing noise fields with an appropriate accuracy. This effect is further increased, as in nonstationary noise voice activity detection is a rather difficult task, and false-alarms are likely to occur. In this paper, we present and analyze an algorithm that estimates the noise correlation matrix without using a VAD. This algorithm is based on measuring the correlation of the noisy input and a noise reference which can be obtained, e. g., by steering a null towards the target source. When applied in combination with an MVDR beamformer, it is shown that the proposed noise correlation matrix estimate results in a more accurate beamformer response, a larger signal-to-noise ratio improvement and a larger instrumentally predicted speech intelligibility when compared to competing algorithms such as the generalized sidelobe canceler, a VAD-based MVDR beamformer, and an MVDR based on the noisy correlation matrix.

  • 45.
    Hendriks, Richard
    et al.
    Delft University of Technology.
    Gerkmann, Timo
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Estimation of the Noise Correlation Matrix2011In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011Conference paper (Refereed)
    Abstract [en]

    To harvest the potential of multi-channel noise reduction methods, it is crucial to have an accurate estimate of the noise correlation matrix. Existing algorithms either assume speech absence and exploit a voice activity detector (VAD), or make use of additional assumptions like a diffuse noise field. Therefore, these algorithms are limited with respect to their tracking speed and the type of noise fields for which they can estimate the correlation matrix. In this paper we present a new method for noise correlation matrix estimation that makes no assumptions about the type of noise field, nor uses a VAD. The presented method exploits the existence of accurate single-channel noise PSD estimators, as well as the availability of one noise reference per microphone pair. For spatially and temporally non-stationary noise fields, the proposed method leads to improved performance compared to widely used state-of-the-art reference methods in terms of both segmental SNR and beamformer response error.

  • 46.
    Henter, Gustav Eje
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Intermediate-State HMMs to Capture Continuously-Changing Signal Features2011In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2011, p. 1828-1831Conference paper (Refereed)
    Abstract [en]

    Traditional discrete-state HMMs are not well suited for describing steadily evolving, path-following natural processes like motion capture data or speech. HMMs cannot represent incremental progress between behaviors, and sequences sampled from the models have unnatural segment durations, unsmooth transitions, and excessive rapid variation. We propose to address these problems by permitting the state variable to occupy positions between the discrete states, and present a concrete left-right model incorporating this idea. We call this intermediate-state HMMs. The state evolution remains Markovian. We describe training using the generalized EM-algorithm and present associated update formulas. An experiment shows that the intermediate-state model is capable of gradual transitions, with more natural durations and less noise in sampled sequences compared to a conventional HMM.

  • 47.
    Henter, Gustav Eje
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Picking up the pieces: Causal states in noisy data, and how to recover them2013In: Pattern Recognition Letters, ISSN 0167-8655, E-ISSN 1872-7344, Vol. 34, no 5, p. 587-594Article in journal (Refereed)
    Abstract [en]

    Automatic structure discovery is desirable in many Markov model applications where a good topology (states and transitions) is not known a priori. CSSR is an established pattern discovery algorithm for stationary and ergodic stochastic symbol sequences that learns a predictively optimal Markov representation consisting of so-called causal states. By means of a novel algebraic criterion, we prove that the causal states of a simple process disturbed by random errors frequently are too complex to be learned fully, making CSSR diverge. In fact, the causal state representation of many hidden Markov models, representing simple but noise-disturbed data, has infinite cardinality. We also report that these problems can be solved by endowing CSSR with the ability to make approximations. The resulting algorithm, robust causal states (RCS), is able to recover the underlying causal structure from data corrupted by random substitutions, as is demonstrated both theoretically and in an experiment. The algorithm has potential applications in areas such as error correction and learning stochastic grammars.

  • 48.
    Henter, Gustav Eje
    et al.
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Simplified Probability Models for Generative Tasks: a Rate-Distortion Approach2010In: Proceedings of the European Signal Processing Conference, EUROPEAN ASSOC SIGNAL SPEECH & IMAGE PROCESSING-EURASIP , 2010, Vol. 18, p. 1159-1163Conference paper (Refereed)
    Abstract [en]

    We consider using sparse simplifications to denoise probabilistic sequence models for generative tasks such as speech synthesis. Our proposal is to find the least random model that remains close to the original one according to a KL-divergence constraint, a technique we call minimum entropy rate simplification (MERS). This produces a representation-independent framework for trading off simplicity and divergence, similar to rate-distortion theory. Importantly, MERS uses the cleaned model rather than the original one for the underlying probabilities in the KL-divergence, effectively reversing the conventional argument order. This promotes rather than penalizes sparsity, suppressing uncommon outcomes likely to be errors. We write down the MERS equations for Markov chains, and present an iterative solution procedure based on the Blahut-Arimoto algorithm and a bigram matrix Markov chain representation. We apply the procedure to a music-based Markov grammar, and compare the results to a simplistic thresholding scheme.

  • 49. Heusdens, R.
    et al.
    Jensen, J.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing.
    Kot, V.
    Niamut, O. A.
    Van De Par, S.
    Van Schijndel, N. H.
    Vafin, R.
    Bit-rate scalable intraframe sinusoidal audio coding based on rate-distortion optimization2006In: Journal of The Audio Engineering Society, ISSN 1549-4950, Vol. 54, no 3, p. 167-188Article in journal (Refereed)
    Abstract [en]

    A coding methodology that aims at rate-distortion optimal sinusoid + noise coding of audio and speech signals is presented. The coder divides the input signal into variable-length time segments and distributes sinusoidal components over the segments such that the resulting distortion (as measured by a perceptual distortion measure) is minimized subject to a prespecified rate constraint. The coder is bit-rate scalable. For a given target bit budget it automatically adapts the segmentation and distribution of sinusoids in a rate-distortion optimal manner. The coder uses frequency-differential coding techniques in order to exploit intrasegment correlations for efficient quantization and encoding of the sinusoidal model parameters. This technique makes the coder more robust toward packet losses when used in a lossy-packet channel environment as compared to time-differential coding techniques, which are commonly used in audio or speech coders. In a subjective listening experiment the present coder showed similar or better performance than a set of four MPEG-4 coders operating at bit rates of 16, 24, 32, and 48 kbit/s, each of which was state of the art for the given target bit rate.

  • 50. Heusdens, Richard
    et al.
    Kleijn, W. Bastiaan
    KTH, School of Electrical Engineering (EES), Sound and Image Processing. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Ozerov, Alexey
    KTH, School of Electrical Engineering (EES), Sound and Image Processing. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Entropy-constrained high-resolution lattice vector quantization using a perceptually relevant distortion measure2007In: CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, NEW YORK: IEEE , 2007, p. 2075-2079Conference paper (Other academic)
    Abstract [en]

    In this paper we study high-resolution entropy-constrained coding using multidimensional companding. To account for auditory perception, we introduce a perceptual relevant distortion measure. We will derive a multidimensional companding function which is asymptotically optimal in the sense that the rate loss introduced by the compander will vanish with increasing vector dimension. We compare the companding scheme to a scheme which is based on a perceptual weighting of the source, thereby transforming the perceptual distortion measure into a mean-squared error distortion measure. Experimental results show that even at low vector dimension, the rate loss introduced by the compander is low (less than 0.05 bit per dimension in case of two-dimensional vectors).

12345 1 - 50 of 214
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf