kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (9 of 9) Show all publications
Penamakuri, A. S., Chhatre, K. & Jain, A. (2025). Audiopedia: Audio QA with Knowledge. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025: . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025, Hyderabad, India, April 06-11 2025. Hyderabad India
Open this publication in new window or tab >>Audiopedia: Audio QA with Knowledge
2025 (English)In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025, Hyderabad India, 2025Conference paper, Published paper (Refereed)
Abstract [en]

Audiopedia is introduced (with 3 subtasks, s-AQA, m-AQA and r-AQA), a novel Audio QA task, requiring audio comprehension and external knowledge reasoning. Additionally, a framework that combines Audio Entity Linking (AEL) and a Knowledge-Augmented Audio Multimodal Model (KA2LM) is proposed to enhance large audio language models for knowledge-intensive tasks.

Place, publisher, year, edition, pages
Hyderabad India: , 2025
Keywords
audio question answering, knowledge-intensive questions, audio entity linking
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-358060 (URN)
Conference
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025, Hyderabad, India, April 06-11 2025
Available from: 2025-01-05 Created: 2025-01-05 Last updated: 2025-01-14Bibliographically approved
Chhatre, K., Guarese, R., Matviienko, A. & Peters, C. (2025). Evaluating Speech and Video Models for Face-Body Congruence. In: I3D Companion '25: Companion Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games: . Paper presented at ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games-I3D 2025, NJIT, Jersey City, NJ, USA, 7-9 May 2025. Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Evaluating Speech and Video Models for Face-Body Congruence
2025 (English)In: I3D Companion '25: Companion Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, Association for Computing Machinery (ACM) , 2025Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Animations produced by generative models are often evaluated using objective quantitative metrics that do not fully capture perceptual effects in immersive virtual environments. To address this gap, we present a preliminary perceptual evaluation of generative models for animation synthesis, conducted via a VR-based user study (N = 48). Our investigation specifically focuses on animation congruency—ensuring that generated facial expressions and body gestures are both congruent with and synchronized to driving speech. We evaluated two state-of-the-art methods: a speech-driven full-body animation model and a video-driven full-body reconstruction model, assessing their capability to produce congruent facial expressions and body gestures. Our results demonstrate a strong user preference for combined facial and body animations, highlighting that congruent multimodal animations significantly enhance perceived realism compared to animations featuring only a single modality. By incorporating VR-based perceptual feedback into training pipelines, our approach provides a foundation for developing more engaging and responsive virtual characters.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2025
Keywords
Computer graphics, Animation
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-363248 (URN)10.1145/3722564.3728374 (DOI)001502592200005 ()
Conference
ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games-I3D 2025, NJIT, Jersey City, NJ, USA, 7-9 May 2025
Note

Part of ISBN 9798400718335

QC 20250509

Available from: 2025-05-09 Created: 2025-05-09 Last updated: 2025-08-15Bibliographically approved
Chhatre, K., Guarese, R., Matviienko, A. & Peters, C. (2025). Evaluation of generative models for emotional 3D animation generation in VR. Frontiers in Computer Science, 7, Article ID 1598099.
Open this publication in new window or tab >>Evaluation of generative models for emotional 3D animation generation in VR
2025 (English)In: Frontiers in Computer Science, E-ISSN 2624-9898, Vol. 7, article id 1598099Article in journal (Refereed) Published
Abstract [en]

Introduction: Social interactions incorporate various nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of the effectiveness of these models. Methods: To address this, we evaluate emotional 3D animation generative models within an immersive Virtual Reality (VR) environment, emphasizing user—centric metrics-emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality—in a real-time human-agent interaction scenario. Through a user study (N = 48), we systematically examine perceived emotional quality for three state-of-the-art speech-driven 3D animation methods across two specific emotions: happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Results: Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Discussion: Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.

Place, publisher, year, edition, pages
Frontiers Media SA, 2025
Keywords
3D emotional animation, generative models, nonverbal communication, user-centric evaluation, virtual reality
National Category
Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-369923 (URN)10.3389/fcomp.2025.1598099 (DOI)001549678200001 ()2-s2.0-105013367950 (Scopus ID)
Note

QC 20250918

Available from: 2025-09-18 Created: 2025-09-18 Last updated: 2025-09-18Bibliographically approved
Chhatre, K., Daněček, R., Athanasiou, N., Becherini, G., Peters, C., Black, M. J. & Bolkart, T. (2024). Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 16-22 2024, Seattle, WA, USA (pp. 1942-1953). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1942-1953Conference paper, Published paper (Refereed)
Abstract [en]

Existing methods for synthesizing 3D human gestures from speech have shown promising results but they do not explicitly model the impact of emotions on the generated gestures. Instead these methods directly output animations from speech without control over the expressed emotion. To address this limitation we present AMUSE an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e. gestures related to speech rhythm and word utterances) emotion and personal style are separable. To account for this AMUSE maps the driving audio to three disentangled latent vectors: one for content one for emotion and one for personal style. A latent diffusion model trained to generate gesture motion sequences is then conditioned on these latent vectors. Once trained AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative quantitative and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-354048 (URN)10.1109/CVPR52733.2024.00190 (DOI)001322555902029 ()2-s2.0-85202286367 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 16-22 2024, Seattle, WA, USA
Note

Part of ISBN 979-8-3503-5300-6

QC 20240930

Available from: 2024-09-26 Created: 2024-09-26 Last updated: 2025-01-20Bibliographically approved
Daněček, R., Chhatre, K., Tripathi, S., Wen, Y., Black, M. & Bolkart, T. (2023). Emotional Speech-Driven Animation with Content-Emotion Disentanglement. In: Proceedings - SIGGRAPH Asia 2023 Conference Papers, SA 2023: . Paper presented at 2023 SIGGRAPH Asia 2023 Conference Papers, SA 2023, Sydney, Australia, Dec 12 2023 - Dec 15 2023. Association for Computing Machinery (ACM), Article ID 41.
Open this publication in new window or tab >>Emotional Speech-Driven Animation with Content-Emotion Disentanglement
Show others...
2023 (English)In: Proceedings - SIGGRAPH Asia 2023 Conference Papers, SA 2023, Association for Computing Machinery (ACM) , 2023, article id 41Conference paper, Published paper (Refereed)
Abstract [en]

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
Computer Graphics, Computer Vision, Deep learning, Facial Animation, Speech-driven Animation
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-347500 (URN)10.1145/3610548.3618183 (DOI)001278296700041 ()2-s2.0-85180390692 (Scopus ID)
Conference
2023 SIGGRAPH Asia 2023 Conference Papers, SA 2023, Sydney, Australia, Dec 12 2023 - Dec 15 2023
Note

Part of ISBN 9798400703157

QC 20240619

Available from: 2024-06-19 Created: 2024-06-19 Last updated: 2025-02-07Bibliographically approved
Chhatre, K., Feygin, S., Sheppard, C. & Waraich, R. (2022). Parallel Bayesian Optimization of Agent-Based Transportation Simulation. In: International Conference on Machine Learning, Optimization, and Data Science: . Paper presented at 8th International Conference, LOD 2022, Certosa di Pontignano, Italy, September 18–22, 2022. Tuscany Italy: Springer Nature
Open this publication in new window or tab >>Parallel Bayesian Optimization of Agent-Based Transportation Simulation
2022 (English)In: International Conference on Machine Learning, Optimization, and Data Science, Tuscany Italy: Springer Nature , 2022Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

MATSim (Multi-Agent Transport Simulation Toolkit) is an open source large-scale agent-based transportation planning project applied to various areas like road transport, public transport, freight transport, regional evacuation, etc. BEAM (Behavior, Energy, Autonomy, and Mobility) framework extends MATSim to enable powerful and scalable analysis of urban transportation systems. The agents from the BEAM simulation exhibit 'mode choice' behavior based on multinomial logit model. In our study, we consider eight mode choices viz. bike, car, walk, ride hail, driving to transit, walking to transit, ride hail to transit, and ride hail pooling. The 'alternative specific constants' for each mode choice are critical hyperparameters in a configuration file related to a particular scenario under experimentation. We use the 'Urbansim-10k' BEAM scenario (with 10,000 population size) for all our experiments. Since these hyperparameters affect the simulation in complex ways, manual calibration methods are time consuming. We present a parallel Bayesian optimization method with early stopping rule to achieve fast convergence for the given multi-in-multi-out problem to its optimal configurations. Our model is based on an open source HpBandSter package. This approach combines hierarchy of several 1D Kernel Density Estimators (KDE) with a cheap evaluator (Hyperband, a single multidimensional KDE). Our model has also incorporated extrapolation based early stopping rule. With our model, we could achieve a 25% L1 norm for a large-scale BEAM simulation in fully autonomous manner. To the best of our knowledge, our work is the first of its kind applied to large-scale multi-agent transportation simulations. This work can be useful for surrogate modeling of scenarios with very large populations. 

Place, publisher, year, edition, pages
Tuscany Italy: Springer Nature, 2022
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13810
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-354049 (URN)10.1007/978-3-031-25599-1_35 (DOI)000995530700035 ()2-s2.0-85151051710 (Scopus ID)
Conference
8th International Conference, LOD 2022, Certosa di Pontignano, Italy, September 18–22, 2022
Note

QC 20240930

Available from: 2024-09-26 Created: 2024-09-26 Last updated: 2024-09-30Bibliographically approved
Stojanovski, T., Zhang, H., Frid, E., Chhatre, K., Peters, C., Samuels, I., . . . Lefosse, D. (2022). Rethinking Computer-Aided Architectural Design (CAAD) - From Generative Algorithms and Architectural Intelligence to Environmental Design and Ambient Intelligence. In: Gerber, D Pantazis, E Bogosian, B Nahmad, A Miltiadis, C (Ed.), Computer-Aided Architectural Design: Design Imperatives: The Future Is Now. Paper presented at 19th International Conference on Computer-Aided Architectural Design (CAAD) - Design Imperatives - The Future is Now, JUL 16-18, 2021, Univ So Calif, Viterbi Sch Engn, ELECTR NETWORK (pp. 62-83). Springer Nature, 1465
Open this publication in new window or tab >>Rethinking Computer-Aided Architectural Design (CAAD) - From Generative Algorithms and Architectural Intelligence to Environmental Design and Ambient Intelligence
Show others...
2022 (English)In: Computer-Aided Architectural Design: Design Imperatives: The Future Is Now / [ed] Gerber, D Pantazis, E Bogosian, B Nahmad, A Miltiadis, C, Springer Nature , 2022, Vol. 1465, p. 62-83Conference paper, Published paper (Refereed)
Abstract [en]

Computer-Aided Architectural Design (CAAD) finds its historical precedents in technological enthusiasm for generative algorithms and architectural intelligence. Current developments in Artificial Intelligence (AI) and paradigms in Machine Learning (ML) bring new opportunities for creating innovative digital architectural tools, but in practice this is not happening. CAAD enthusiasts revisit generative algorithms, while professional architects and urban designers remain reluctant to use software that automatically generates architecture and cities. This paper looks at the history of CAAD and digital tools for Computer Aided Design (CAD), Building Information Modeling (BIM) and Geographic Information Systems (GIS) in order to reflect on the role of AI in future digital tools and professional practices. Architects and urban designers have diagrammatic knowledge and work with design problems on symbolic level. The digital tools gradually evolved from CAD to BIM software with symbolical architectural elements. The BIM software works like CAAD (CAD systems for Architects) or digital board for drawing and delivers plans, sections and elevations, but without AI. AI has the capability to process data and interact with designers. The AI in future digital tools for CAAD and Computer-Aided Urban Design (CAUD) can link to big data and develop ambient intelligence. Architects and urban designers can harness the benefits of analytical ambient intelligent AIs in creating environmental designs, not only for shaping buildings in isolated virtual cubicles. However there is a need to prepare frameworks for communication between AIs and professional designers. If the cities of the future integrate spatially analytical AI, are to be made smart or even ambient intelligent, AI should be applied to improving the lives of inhabitants and help with their daily living and sustainability.

Place, publisher, year, edition, pages
Springer Nature, 2022
Series
Communications in Computer and Information Science, ISSN 1865-0929
Keywords
Artificial Intelligence (AI), Computer-Aided Architectural Design (CAAD), Architectural intelligence, Generative algorithms, Environmental design, Ambient intelligence
National Category
Architecture Architectural Engineering Computer Sciences
Identifiers
urn:nbn:se:kth:diva-312194 (URN)10.1007/978-981-19-1280-1_5 (DOI)000787752600005 ()2-s2.0-85127649084 (Scopus ID)
Conference
19th International Conference on Computer-Aided Architectural Design (CAAD) - Design Imperatives - The Future is Now, JUL 16-18, 2021, Univ So Calif, Viterbi Sch Engn, ELECTR NETWORK
Note

QC 20220518

Available from: 2022-05-18 Created: 2022-05-18 Last updated: 2025-02-24Bibliographically approved
Stojanovski, T., Zhang, H., Peters, C., Frid, E., Lefosse, D. & Chhatre, K. (2021). Architecture, urban design and Artificial Intelligence (AI) - Intersection of practices and approaches. In: : . Paper presented at SimAUD 2021 April 15-17, Virtually Hosted. Society for Modeling & Simulation International (SCS)
Open this publication in new window or tab >>Architecture, urban design and Artificial Intelligence (AI) - Intersection of practices and approaches
Show others...
2021 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

The new developments in Information and Communication Technologies (ICT) and Artificial Intelligence (AI) bring revelations of emerging smart cities. However, AI has not yet been integrated in Computer Aided Design (CAD), Building Information Modelling (BIM) or Geographic Information Systems (GIS) software. There are experiments with AI in urban modelling and simulations and AI techniques are applied in procedural generation of buildings and cities, but software based on simulations, procedural models and generative algorithms is seldom used in architectural and urbanist practices. In this paper, we look at the history of smart cities and digitization in architecture and urbanism in relation to new potentials brought by AI. We map and juxtapose future urban visions, digital tools and AI techniques to discuss the role of architects and urban designers in an emerging world of smart cities. The purpose is to inspire a debate about the possibility to make a convergence of architectural and urbanist practices with new AI developments.

Place, publisher, year, edition, pages
Society for Modeling & Simulation International (SCS), 2021
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-354051 (URN)
Conference
SimAUD 2021 April 15-17, Virtually Hosted
Note

QC 20241001

Available from: 2024-09-26 Created: 2024-09-26 Last updated: 2024-10-01Bibliographically approved
Chhatre, K., Deichler, A., Peters, C. & Beskow, J. (2021). Spatio-temporal priors in 3D human motion. In: IEEE ICDL Workshop on Spatio-temporal Aspects of Embodied Predictive Processing: . Paper presented at IEEE ICDL Workshop on Spatio-temporal Aspects of Embodied Predictive Processing, Online, 22 Aug 2021.
Open this publication in new window or tab >>Spatio-temporal priors in 3D human motion
2021 (English)In: IEEE ICDL Workshop on Spatio-temporal Aspects of Embodied Predictive Processing, 2021Conference paper, Oral presentation only (Refereed)
Abstract [en]

When we practice a movement, human brains creates a motor memory of it. These memories are formed and stored in the brain as representations which allows us to perform familiar tasks faster than new movements. From a developmental robotics and embodied artificial agent perspective it could be also beneficial to exploit the concept of these motor representations in the form of spatial-temporal motion priors for complex, full-body motion synthesis. Encoding such priors in neural networks in a form of inductive biases inherit essential spatio-temporality aspect of human motion. In our current work we examine and compare recent approaches for capturing spatial and temporal dependencies with machine learning algorithms that are used to model human motion.

National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-354050 (URN)10.13140/RG.2.2.28042.80327 (DOI)
Conference
IEEE ICDL Workshop on Spatio-temporal Aspects of Embodied Predictive Processing, Online, 22 Aug 2021
Note

QC 20240930

Available from: 2024-09-26 Created: 2024-09-26 Last updated: 2024-09-30Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-7414-845X

Search in DiVA

Show all publications