Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 61) Show all publications
Zhang, C., Oztireli, C., Mandt, S. & Salvi, G. (2019). Active Mini-Batch Sampling Using Repulsive Point Processes. In: : . Paper presented at 33rd AAAI Conference on Artificial Intelligence / 31st Innovative Applications of Artificial Intelligence Conference / 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Location: Honolulu, HI, JAN 27-FEB 01, 2019 (pp. 5741-5748). ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE
Open this publication in new window or tab >>Active Mini-Batch Sampling Using Repulsive Point Processes
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

The convergence speed of stochastic gradient descent (SGD) can be improved by actively selecting mini-batches. We explore sampling schemes where similar data points are less likely to be selected in the same mini-batch. In particular, we prove that such repulsive sampling schemes lower the variance of the gradient estimator. This generalizes recent work on using Determinantal Point Processes (DPPs) for mini-batch diversification (Zhang et al., 2017) to the broader class of repulsive point processes. We first show that the phenomenon of variance reduction by diversified sampling generalizes in particular to non-stationary point processes. We then show that other point processes may be computationally much more efficient than DPPs. In particular, we propose and investigate Poisson Disk sampling-frequently encountered in the computer graphics community-for this task. We show empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance.

Place, publisher, year, edition, pages
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE, 2019
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-261984 (URN)000486572500033 ()
Conference
33rd AAAI Conference on Artificial Intelligence / 31st Innovative Applications of Artificial Intelligence Conference / 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Location: Honolulu, HI, JAN 27-FEB 01, 2019
Note

QC 20191011

Available from: 2019-10-11 Created: 2019-10-11 Last updated: 2019-10-11Bibliographically approved
Saponaro, G., Jamone, L., Bernardino, A. & Salvi, G. (2019). Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions. IEEE Transactions on Cognitive and Developmental Systems
Open this publication in new window or tab >>Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions
2019 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed) Published
Abstract [en]

We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans.

National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-241540 (URN)10.1109/TCDS.2018.2882140 (DOI)
Note

QC 20190125

Available from: 2019-01-23 Created: 2019-01-23 Last updated: 2019-09-04Bibliographically approved
Stefanov, K., Salvi, G., Kontogiorgos, D., Kjellström, H. & Beskow, J. (2019). Modeling of Human Visual Attention in Multiparty Open-World Dialogues. ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, 8(2), Article ID UNSP 8.
Open this publication in new window or tab >>Modeling of Human Visual Attention in Multiparty Open-World Dialogues
Show others...
2019 (English)In: ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed) Published
Abstract [en]

This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Human-human interaction, open-world dialogue, eye-gaze direction, head orientation, multiparty
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-255203 (URN)10.1145/3323231 (DOI)000472066800003 ()
Note

QC 20190904

Available from: 2019-09-04 Created: 2019-09-04 Last updated: 2019-10-15Bibliographically approved
Stefanov, K. (2019). Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition. IEEE Transactions on Cognitive and Developmental Systems
Open this publication in new window or tab >>Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
2019 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed) Published
Abstract [en]

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-260126 (URN)10.1109/TCDS.2019.2927941 (DOI)2-s2.0-85069908129 (Scopus ID)
Note

QC 20191011

Available from: 2019-09-25 Created: 2019-09-25 Last updated: 2019-10-11Bibliographically approved
Salvi, G. (2016). An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words. In: Recent Advances in Nonlinear Speech Processing: . Paper presented at 7th International Workshop on Nonlinear Speech Processing (NOLISP), May 18-20, 2015, Vietri sul Mare, Italy (pp. 151-157). Springer, 48
Open this publication in new window or tab >>An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words
2016 (English)In: Recent Advances in Nonlinear Speech Processing, Springer, 2016, Vol. 48, p. 151-157Conference paper, Published paper (Refereed)
Abstract [en]

We analyse the properties of shallow and deep representa-tions of speech. Mel frequency cepstral coefficients (MFCC) are compared to representations learned by a four layer Deep Belief Network (DBN) in terms of discriminative power and invariance to irrelevant factors such as speaker identity or gender. To avoid the influence of supervised statistical modelling, an unsupervised isolated word classification task is used for the comparison. The deep representations are also obtained with unsupervised training (no back-propagation pass is performed). The results show that DBN features provide a more concise clustering and higher match between clusters and word categories in terms of adjusted Rand score. Some of the confusions present with the MFCC features are, however, retained even with the DBN features.

Place, publisher, year, edition, pages
Springer, 2016
Series
Smart Innovation Systems and Technologies, ISSN 2190-3018 ; 48
Keywords
Deep learning, Representations, Hierarchical clustering
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180414 (URN)10.1007/978-3-319-28109-4_15 (DOI)000417253600015 ()2-s2.0-84955471729 (Scopus ID)978-3-319-28109-4 (ISBN)978-3-319-28107-0 (ISBN)
Conference
7th International Workshop on Nonlinear Speech Processing (NOLISP), May 18-20, 2015, Vietri sul Mare, Italy
Note

QC 20160615

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Strömbergsson, S., Salvi, G. & House, D. (2015). Acoustic and perceptual evaluation of category goodness of /t/ and /k/ in typical and misarticulated children's speech. Journal of the Acoustical Society of America, 137(6), 3422-3435
Open this publication in new window or tab >>Acoustic and perceptual evaluation of category goodness of /t/ and /k/ in typical and misarticulated children's speech
2015 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 137, no 6, p. 3422-3435Article in journal (Refereed) Published
Abstract [en]

This investigation explores perceptual and acoustic characteristics of children's successful and unsuccessful productions of /t/ and /k/, with a specific aim of exploring perceptual sensitivity to phonetic detail, and the extent to which this sensitivity is reflected in the acoustic domain. Recordings were collected from 4- to 8-year-old children with a speech sound disorder (SSD) who misarticulated one of the target plosives, and compared to productions recorded from peers with typical speech development (TD). Perceptual responses were registered with regards to a visual-analog scale, ranging from "clear [t]" to "clear [k]." Statistical models of prototypical productions were built, based on spectral moments and discrete cosine transform features, and used in the scoring of SSD productions. In the perceptual evaluation, " clear substitutions" were rated as less prototypical than correct productions. Moreover, target-appropriate productions of /t/ and /k/ produced by children with SSD were rated as less prototypical than those produced by TD peers. The acoustical modeling could to a large extent discriminate between the gross categories /t/ and /k/, and scored the SSD utterances on a continuous scale that was largely consistent with the category of production. However, none of the methods exhibited the same sensitivity to phonetic detail as the human listeners.

National Category
Fluid Mechanics and Acoustics
Identifiers
urn:nbn:se:kth:diva-171155 (URN)10.1121/1.4921033 (DOI)000356622400057 ()26093431 (PubMedID)2-s2.0-84935019965 (Scopus ID)
Note

QC 20150720

Available from: 2015-07-20 Created: 2015-07-20 Last updated: 2017-12-04Bibliographically approved
Lopes, J., Salvi, G., Skantze, G., Abad, A., Gustafson, J., Batista, F., . . . Trancoso, I. (2015). Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances. In: INTERSPEECH-2015: . Paper presented at INTERSPEECH-2015, Dresden, Germany (pp. 1805-1809).
Open this publication in new window or tab >>Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances
Show others...
2015 (English)In: INTERSPEECH-2015, 2015, p. 1805-1809Conference paper, Published paper (Refereed)
Abstract [en]

Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180405 (URN)000380581600375 ()2-s2.0-84959138120 (Scopus ID)978-1-5108-1790-6 (ISBN)
Conference
INTERSPEECH-2015, Dresden, Germany
Note

QC 20160216

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Pieropan, A., Salvi, G., Pauwels, K. & Kjellström, H. (2014). A dataset of human manipulation actions. In: ICRA 2014 Workshop on Autonomous Grasping and Manipulation: An Open Challenge, 2014. Paper presented at IEEE International Conference on Robotics and Automation: International Workshop on Autonomous Grasping and Manipulation - An Open Challenge, Hong Kong, China, 2014. Hong Kong, China
Open this publication in new window or tab >>A dataset of human manipulation actions
2014 (English)In: ICRA 2014 Workshop on Autonomous Grasping and Manipulation: An Open Challenge, 2014, Hong Kong, China, 2014Conference paper, Published paper (Refereed)
Abstract [en]

We present a data set of human activities that includes both visual data (RGB-D video and six Degrees Of Freedom (DOF) object pose estimation) and acoustic data. Our vision is that robots need to merge information from multiple perceptional modalities to operate robustly and autonomously in an unstructured environment.

Place, publisher, year, edition, pages
Hong Kong, China: , 2014
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-158174 (URN)
Conference
IEEE International Conference on Robotics and Automation: International Workshop on Autonomous Grasping and Manipulation - An Open Challenge, Hong Kong, China, 2014
Note

QC 20150205

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2018-01-11Bibliographically approved
Pieropan, A., Salvi, G., Pauwels, K. & Kjellström, H. (2014). Audio-Visual Classification and Detection of Human Manipulation Actions. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014): . Paper presented at 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2014, Palmer House Hilton Hotel Chicago, United States, 14 September 2014 through 18 September 2014 (pp. 3045-3052). IEEE conference proceedings
Open this publication in new window or tab >>Audio-Visual Classification and Detection of Human Manipulation Actions
2014 (English)In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), IEEE conference proceedings, 2014, p. 3045-3052Conference paper, Published paper (Refereed)
Abstract [en]

Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Series
IEEE International Conference on Intelligent Robots and Systems, ISSN 2153-0858
Keywords
Acoustic data, Audio-visual, Audio-visual data, Coherent representations, Human manipulation, Multiple source, Unstructured environments, Visual data
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-158004 (URN)10.1109/IROS.2014.6942983 (DOI)000349834603023 ()2-s2.0-84911478073 (Scopus ID)978-1-4799-6934-0 (ISBN)
Conference
2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2014, Palmer House Hilton Hotel Chicago, United States, 14 September 2014 through 18 September 2014
Note

QC 20150122

Available from: 2014-12-18 Created: 2014-12-18 Last updated: 2018-01-11Bibliographically approved
Vanhainen, N. & Salvi, G. (2014). Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14): . Paper presented at Ninth International Conference on Language Resources and Evaluation (LREC'14), May, 26-31, 2014, Reykjavik, Iceland.
Open this publication in new window or tab >>Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish
2014 (English)In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 2014Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents results for large vocabulary continuous speech recognition (LVCSR) in Swedish. We trained acoustic models on the public domain NST Swedish corpus and made them freely available to the community. The training procedure corresponds to the reference recogniser (RefRec) developed for the SpeechDat databases during the COST249 action. We describe the modifications we made to the procedure in order to train on the NST database, and the language models we created based on the N-gram data available at the Norwegian Language Council. Our tests include medium vocabulary isolated word recognition and LVCSR. Because no previous results are available for LVCSR in Swedish, we use as baseline the performance of the SpeechDat models on the same tasks. We also compare our best results to the ones obtained in similar conditions on resource rich languages such as American English. We tested the acoustic models with HTK and Julius and plan to make them available in CMU Sphinx format as well in the near future. We believe that the free availability of these resources will boost research in speech and language technology in Swedish, even in research groups that do not have resources to develop ASR systems.

Keywords
Speech Resource, Database, Language Modelling
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-158151 (URN)000355611001171 ()978-2-9517408-8-4 (ISBN)
Conference
Ninth International Conference on Language Resources and Evaluation (LREC'14), May, 26-31, 2014, Reykjavik, Iceland
Note

QC 20150211

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3323-5311

Search in DiVA

Show all publications