Automatic speaker verification on site and by telephone: methods, applications and assessment
2006 (English)Doctoral thesis, monograph (Other scientific)
Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant's voice. The present thesis deals with various topics related to automatic speaker verification (ASV) in the context of its commercial applications, characterized by co-operative users, user-friendly interfaces, and requirements for small amounts of enrollment and test data.
A text-dependent system based on hidden Markov models (HMM) was developed and used to conduct experiments, including a comparison between visual and aural strategies for prompting claimants for randomized digit strings. It was found that aural prompts lead to more errors in spoken responses and that visually prompted utterances performed marginally better in ASV, given that enrollment data were visually prompted. High-resolution flooring techniques were proposed for variance estimation in the HMMs, but results showed no improvement over the standard method of using target-independent variances copied from a background model. These experiments were performed on Gandalf, a Swedish speaker verification telephone corpus with 86 client speakers.
A complete on-site application (PER), a physical access control system securing a gate in a reverberant stairway, was implemented based on a combination of the HMM and a Gaussian mixture model based system. Users were authenticated by saying their proper name and a visually prompted, random sequence of digits after having enrolled by speaking ten utterances of the same type. An evaluation was conducted with 54 out of 56 clients who succeeded to enroll. Semi-dedicated impostor attempts were also collected. An equal error rate (EER) of 2.4% was found for this system based on a single attempt per session and after retraining the system on PER-specific development data. On parallel telephone data collected using a telephone version of PER, 3.5% EER was found with landline and around 5% with mobile telephones. Impostor attempts in this case were same-handset attempts. Results also indicate that the distribution of false reject and false accept rates over target speakers are well described by beta distributions. A state-of-the-art commercial system was also tested on PER data with similar performance as the baseline research system.
Place, publisher, year, edition, pages
Stockholm: KTH , 2006. , xvii, 332 p.
Trita-CSC-A, ISSN 1653-5723 ; 2006:26
speaker recognition, speaker verification, speech technology, biometrics, access control, speech corpus, variance estimation
Language Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:kth:diva-4242ISBN: 91-7178-531-0ISBN: 978-91-7178-531-2OAI: oai:DiVA.org:kth-4242DiVA: diva2:11420
2006-12-19, F3, Lindstedtsvägen 26, Stockholm, 14:00
Bigün, Josef, Professor
Granström, BjörnBlomberg, Mats
QC 201009102006-12-152006-12-152010-09-10Bibliographically approved