Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards conversational speech synthesis: Experiments with data quality, prosody modification, and non-verbal signals
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. Aalto University, Department of Signal Processing and Acoustics. (Speech Communication)
2017 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The aim of a text-to-speech synthesis (TTS) system is to generate a human-like speech waveform from a given input text. Current TTS sys- tems have already reached a high degree of intelligibility, and they can be readily used to read aloud a given text. For many applications, e.g. public address systems, reading style is enough to convey the message to the people. However, more recent applications, such as human-machine interaction and speech-to-speech translation, call for TTS systems to be increasingly human- like in their conversational style. The goal of this thesis is to address a few issues involved in a conversational speech synthesis system.

First, we discuss issues involve in data collection for conversational speech synthesis. It is very important to have data with good quality as well as con- tain more conversational characteristics. In this direction we studied two methods 1) harvesting the world wide web (WWW) for the conversational speech corpora, and 2) imitation of natural conversations by professional ac- tors. In former method, we studied the effect of compression on the per- formance of TTS systems. It is often the case that speech data available on the WWW is in compression form, mostly use the standard compression techniques such as MPEG. Thus in paper 1 and 2, we systematically stud- ied the effect of MPEG compression on TTS systems. Results showed that the synthesis quality indeed affect by the compression, however, the percep- tual differences are strongly significant if the compression rate is less than 32kbit/s. Even if one is able to collect the natural conversational speech it is not always suitable to train a TTS system due to problems involved in its production. Thus in later method, we asked the question that can we imi- tate the conversational speech by professional actors in recording studios. In this direction we studied the speech characteristics of acted and read speech. Second, we asked a question that can we borrow a technique from voice con- version field to convert the read speech into conversational speech. In paper 3, we proposed a method to transform the pitch contours using artificial neu- ral networks. Results indicated that neural networks are able to transform pitch values better than traditional linear approach. Finally, we presented a study on laughter synthesis, since non-verbal sounds particularly laughter plays a prominent role in human communications. In paper 4 we present an experimental comparison of state-of-the-art vocoders for the application of HMM-based laughter synthesis. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2017. , p. 39
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2017:04
Keywords [en]
Speech synthesis, MPEG compression, Voice Conversion, Artificial Neural Net- works, Laughter synthesis, HTS
National Category
Language Technology (Computational Linguistics)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-198100ISBN: 978-91-7729-235-7 (print)OAI: oai:DiVA.org:kth-198100DiVA, id: diva2:1055532
Presentation
2017-01-19, Fantum, Lindstedtsvägen 24, 5tr, Stockholm, 15:00 (English)
Opponent
Supervisors
Funder
Swedish Research Council, 2013-4935
Note

QC 20161213

Available from: 2016-12-19 Created: 2016-12-12 Last updated: 2018-01-13Bibliographically approved
List of papers
1. Effect of MPEG audio compression on HMM-based speech synthesis
Open this publication in new window or tab >>Effect of MPEG audio compression on HMM-based speech synthesis
2013 (English)In: Proceedings of the 14th Annual Conference of the International Speech Communication Association: Interspeech 2013. International Speech Communication Association (ISCA), 2013, 2013, p. 1062-1066Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, the effect of MPEG audio compression on HMMbased speech synthesis is studied. Speech signals are encoded with various compression rates and analyzed using the GlottHMM vocoder. Objective evaluation results show that the vocoder parameters start to degrade from encoding with bitrates of 32 kbit/s or less, which is also confirmed by the subjective evaluation of the vocoder analysis-synthesis quality. Experiments with HMM-based speech synthesis show that the subjective quality of a synthetic voice trained with 32 kbit/s speech is comparable to a voice trained with uncompressed speech, but lower bit rates induce clear degradation in quality.

Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keywords
GlottHMM, HMM, MP3, Speech synthesis, Audio signal processing, Motion Picture Experts Group standards, Vocoders, Analysis-synthesis, HMM-based speech synthesis, Objective evaluation, Subjective evaluations, Subjective quality, Quality control
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-150864 (URN)2-s2.0-84906262154 (Scopus ID)
Conference
14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013, 25 August 2013 through 29 August 2013, Lyon, France
Note

QC 20140911

Available from: 2014-09-11 Created: 2014-09-11 Last updated: 2018-01-11Bibliographically approved
2. Effect of MPEG audio compression on vocoders used in statistical parametric speech synthesis
Open this publication in new window or tab >>Effect of MPEG audio compression on vocoders used in statistical parametric speech synthesis
2014 (English)In: 2014 Proceedings of the 22nd European Signal Processing Conference (EUSIPCO), European Signal Processing Conference, EUSIPCO , 2014, p. 1237-1241Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the effect of MPEG audio compression on HMM-based speech synthesis using two state-of-the-art vocoders. Speech signals are first encoded with various compression rates and analyzed using the GlottHMM and STRAIGHT vocoders. Objective evaluation results show that the parameters of both vocoders gradually degrade with increasing compression rates, but with a clear increase in degradation with bit-rates of 32 kbit/s or less. Experiments with HMM-based synthesis with the two vocoders show that the degradation in quality is already perceptible with bit-rates of 32 kbit/s and both vocoders show similar trend in degradation with respect to compression ratio. The most perceptible artefacts induced by the compression are spectral distortion and reduced bandwidth, while prosody is better preserved.

Place, publisher, year, edition, pages
European Signal Processing Conference, EUSIPCO, 2014
Series
European Signal Processing Conference, ISSN 2219-5491
Keywords
GlottHMM, HMM, MP3, MPEG, Statistical parametric speech synthesis, STRAIGHT
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-157960 (URN)2-s2.0-84911897440 (Scopus ID)978-099286261-9 (ISBN)
Conference
22nd European Signal Processing Conference, EUSIPCO 2014, 1 September 2014 through 5 September 2014, Lisbon; Portugal
Note

QC 20141219

Available from: 2014-12-19 Created: 2014-12-18 Last updated: 2016-12-12Bibliographically approved
3. Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks
Open this publication in new window or tab >>Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks
2013 (English)In: Advances in nonlinear speech processing: 6th International Conference, NOLISP 2013, Mons, Belgium, June 19-21, 2013 : proceedings, Springer Berlin/Heidelberg, 2013, p. 97-103Conference paper, Published paper (Refereed)
Abstract [en]

Majority of the current voice conversion methods do not focus on the modelling local variations of pitch contour, but only on linear modification of the pitch values, based on means and standard deviations. However, a significant amount of speaker related information is also present in pitch contour. In this paper we propose a non-linear pitch modification method for mapping the pitch contours of the source speaker according to the target speaker pitch contours. This work is done within the framework of Artificial Neural Networks (ANNs) based voice conversion. The pitch contours are represented with Discrete Cosine Transform (DCT) coefficients at the segmental level. The results evaluated using subjective and objective measures confirm that the proposed method performed better in mimicking the target speaker's speaking style when compared to the linear modification method.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7911
Keywords
Discrete cosine transform coefficients, Local variations, Modification methods, Pitch modification, Speaking styles, Standard deviation, Subjective and objective measures, Voice conversion
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-137386 (URN)2-s2.0-84888246669 (Scopus ID)
Conference
6th International Conference on Advances in Nonlinear Speech Processing, NOLISP 2013; Mons; Belgium; 19 June 2013 through 21 June 2013
Note

QC 20140116

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2018-01-11Bibliographically approved
4. A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS
Open this publication in new window or tab >>A COMPARATIVE EVALUATION OF VOCODING TECHNIQUES FOR HMM-BASED LAUGHTER SYNTHESIS
Show others...
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an experimental comparison of various leading vocoders for the application of HMM-based laughter synthesis. Four vocoders, commonly used in HMM-based speech synthesis, are used in copy-synthesis and HMM-based synthesis of both male and female laughter. Subjective evaluations are conducted to assess the performance of the vocoders. The results show that all vocoders perform relatively well in copy-synthesis. In HMM-based laughter synthesis using original phonetic transcriptions, all synthesized laughter voices were significantly lower in quality than copy-synthesis, indicating a challenging task and room for improvements. Interestingly, two vocoders using rather simple and robust excitation modeling performed the best, indicating that robustness in speech parameter extraction and simple parameter representation in statistical modeling are key factors in successful laughter synthesis.

Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Laughter synthesis, vocoder, mel-cepstrum, STRAIGHT, DSM, GlottHMM, HTS, HMM
National Category
Fluid Mechanics and Acoustics
Identifiers
urn:nbn:se:kth:diva-158336 (URN)10.1109/ICASSP.2014.6853597 (DOI)000343655300052 ()2-s2.0-84905269196 (Scopus ID)978-1-4799-2893-4 (ISBN)978-147992892-7 (ISBN)
Conference
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 04-09, 2014, Florence, ITALY
Note

QC 20150123

Available from: 2015-01-23 Created: 2015-01-07 Last updated: 2016-12-12Bibliographically approved

Open Access in DiVA

No full text in DiVA

Search in DiVA

By author/editor
Bollepalli, Bajibabu
By organisation
Speech Communication and Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1087 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf