kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (8 of 8) Show all publications
Willemsen, B. & Skantze, G. (2024). Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding. In: : . Paper presented at 17th International Natural Language Generation Conference (INLG) (pp. 453-469). Association for Computational Linguistics
Open this publication in new window or tab >>Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-355343 (URN)
Conference
17th International Natural Language Generation Conference (INLG)
Projects
tmh_grounding
Note

QC 20241105

Available from: 2024-10-29 Created: 2024-10-29 Last updated: 2025-02-07Bibliographically approved
Willemsen, B. (2023). On Referring Language Use in Visually Grounded Dialogue. In: YRRSDS 2023 - 19th Annual Meeting of the Young Researchers' Roundtable on Spoken Dialogue Systems, Proceedings of the Workshop: . Paper presented at 19th Annual Meeting of the Young Researchers' Roundtable on Spoken Dialogue Systems, YRRSDS 2023, Prague, Czechia, Sep 11 2023 - Sep 12 2023 (pp. 21-23). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>On Referring Language Use in Visually Grounded Dialogue
2023 (English)In: YRRSDS 2023 - 19th Annual Meeting of the Young Researchers' Roundtable on Spoken Dialogue Systems, Proceedings of the Workshop, Association for Computational Linguistics (ACL) , 2023, p. 21-23Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2023
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-343734 (URN)2-s2.0-85184797319 (Scopus ID)
Conference
19th Annual Meeting of the Young Researchers' Roundtable on Spoken Dialogue Systems, YRRSDS 2023, Prague, Czechia, Sep 11 2023 - Sep 12 2023
Note

Part of ISBN 9781952148255

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2025-02-07Bibliographically approved
Willemsen, B., Qian, L. & Skantze, G. (2023). Resolving References in Visually-Grounded Dialogue via Text Generation. In: David Schlangen, Svetlana Stoyanchev, Shafiq Joty, Ondrej Dusek, Casey Kennington, Malihe Alikhani (Ed.), Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue: . Paper presented at The 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2023), Prague, Czechia, 11 - 15 September (pp. 457-469). Prague, Czechia: Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Resolving References in Visually-Grounded Dialogue via Text Generation
2023 (English)In: Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue / [ed] David Schlangen, Svetlana Stoyanchev, Shafiq Joty, Ondrej Dusek, Casey Kennington, Malihe Alikhani, Prague, Czechia: Association for Computational Linguistics (ACL) , 2023, p. 457-469Conference paper, Published paper (Refereed)
Abstract [en]

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

Place, publisher, year, edition, pages
Prague, Czechia: Association for Computational Linguistics (ACL), 2023
National Category
Natural Language Processing
Research subject
Computer Science; Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-339204 (URN)001274996900041 ()
Conference
The 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2023), Prague, Czechia, 11 - 15 September
Projects
tmh_grounding
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20231106

Part of ISBN 979-8-89176-028-8

Available from: 2023-11-04 Created: 2023-11-04 Last updated: 2025-02-07Bibliographically approved
Willemsen, B., Kalpakchi, D. & Skantze, G. (2022). Collecting Visually-Grounded Dialogue with A Game Of Sorts. In: Calzolari, N Bechet, F Blache, P Choukri, K Cieri, C Declerck, T Goggi, S Isahara, H Maegaard, B Mazo, H Odijk, H Piperidis, S (Ed.), Proceedings of the 13th Conference on Language Resources and Evaluation: . Paper presented at 13th Conference on Language Resources and Evaluation, 20-25 June, Marseille, France, 2022 (pp. 2257-2268). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Collecting Visually-Grounded Dialogue with A Game Of Sorts
2022 (English)In: Proceedings of the 13th Conference on Language Resources and Evaluation / [ed] Calzolari, N Bechet, F Blache, P Choukri, K Cieri, C Declerck, T Goggi, S Isahara, H Maegaard, B Mazo, H Odijk, H Piperidis, S, European Language Resources Association (ELRA) , 2022, p. 2257-2268Conference paper, Published paper (Refereed)
Abstract [en]

An idealized, though simplistic, view of the referring expression production and grounding process in (situated) dialogue assumes that a speaker must merely appropriately specify their expression so that the target referent may be successfully identified by the addressee. However, referring in conversation is a collaborative process that cannot be aptly characterized as an exchange of minimally-specified referring expressions. Concerns have been raised regarding assumptions made by prior work on visually-grounded dialogue that reveal an oversimplified view of conversation and the referential process. We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call “A Game Of Sorts”. In our game, players are tasked with reaching agreement on how to rank a set of images given some sorting criterion through a largely unrestricted, role-symmetric dialogue. By putting emphasis on the argumentation in this mixed-initiative interaction, we collect discussions that involve the collaborative referential process. We describe results of a small-scale data collection experiment with the proposed task. All discussed materials, which includes the collected data, the codebase, and a containerized version of the application, are publicly available.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2022
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-323116 (URN)000889371702039 ()2-s2.0-85144393073 (Scopus ID)
Conference
13th Conference on Language Resources and Evaluation, 20-25 June, Marseille, France, 2022
Projects
tmh_grounding
Note

Part of proceedings: ISBN 9791095546726

QC 20230125

Available from: 2023-01-16 Created: 2023-01-16 Last updated: 2025-02-07Bibliographically approved
Skantze, G. & Willemsen, B. (2022). CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings. The journal of artificial intelligence research, 74, 1201-1223
Open this publication in new window or tab >>CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings
2022 (English)In: The journal of artificial intelligence research, ISSN 1076-9757, E-ISSN 1943-5037, Vol. 74, p. 1201-1223Article in journal (Refereed) Published
Abstract [en]

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model's performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.

Place, publisher, year, edition, pages
AI Access Foundation, 2022
National Category
Software Engineering Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-315872 (URN)10.1613/JAIR.1.13689 (DOI)000825139300002 ()2-s2.0-85136141290 (Scopus ID)
Projects
tmh_grounding
Note

QC 20220728

Available from: 2022-07-28 Created: 2022-07-28 Last updated: 2025-02-01Bibliographically approved
de Haas, M., Vogt, P., van den Berghe, R., Leseman, P., Oudgenoeg-Paz, O., Willemsen, B., . . . Krahmer, E. (2022). Engagement in longitudinal child-robot language learning interactions: Disentangling robot and task engagement. International Journal of Child-Computer Interaction, 33, Article ID 100501.
Open this publication in new window or tab >>Engagement in longitudinal child-robot language learning interactions: Disentangling robot and task engagement
Show others...
2022 (English)In: International Journal of Child-Computer Interaction, ISSN 2212-8689, E-ISSN 2212-8697, Vol. 33, article id 100501Article in journal (Refereed) Published
Abstract [en]

This study investigated a seven sessions interaction between a peer-tutor robot and Dutch preschoolers (5 years old) during which the children learned English. We examined whether children's engagement differed when interacting with a tablet and a robot using iconic gestures, with a tablet and a robot using no iconic gestures and with only a tablet. Two engagement types were annotated (task engagement and robot engagement) using a novel coding scheme based on an existing coding scheme used in kindergartens. The findings revealed that children's task engagement dropped over time in all three conditions, consistent with the novelty effect. However, there were no differences between the different conditions for task engagement. Interestingly, robot engagement showed a difference between conditions. Children were more robot engaged when interacting with a robot using iconic gestures than without iconic gestures. Finally, when comparing children's word knowledge with their engagement, we found that both task engagement and robot engagement were positively correlated with children's word retention. 

Place, publisher, year, edition, pages
Elsevier B.V., 2022
Keywords
Child-robot interaction, Engagement, Preschool children, Robot tutor, Second language learning
National Category
Human Computer Interaction Educational Sciences
Identifiers
urn:nbn:se:kth:diva-325269 (URN)10.1016/j.ijcci.2022.100501 (DOI)2-s2.0-85133223692 (Scopus ID)
Note

QC 20230404

Available from: 2023-04-04 Created: 2023-04-04 Last updated: 2025-02-18Bibliographically approved
De Wit, J., Willemsen, B., De Haas, M., Van Den Berghe, R., Leseman, P., Oudgenoeg-Paz, O., . . . Krahmer, E. (2021). Designing and Evaluating Iconic Gestures for Child-Robot Second Language Learning. Interacting with computers, 33(6), 596-626
Open this publication in new window or tab >>Designing and Evaluating Iconic Gestures for Child-Robot Second Language Learning
Show others...
2021 (English)In: Interacting with computers, ISSN 0953-5438, E-ISSN 1873-7951, Vol. 33, no 6, p. 596-626Article in journal (Refereed) Published
Abstract [en]

In this paper, we examine the process of designing robot-performed iconic hand gestures in the context of a long-Term study into second language tutoring with children of approximately 5 years old. We explore four factors that may relate to their efficacy in supporting second language tutoring: The age of participating children; differences between gestures for various semantic categories, e.g. measurement words, such as small, versus counting words, such as five; the quality (comprehensibility) of the robot's gestures; and spontaneous reenactment or imitation of the gestures. Age was found to relate to children's learning outcomes, with older children benefiting more from the robot's iconic gestures than younger children, particularly for measurement words. We found no conclusive evidence that the quality of the gestures or spontaneous reenactment of said gestures related to learning outcomes. We further propose several improvements to the process of designing and implementing a robot's iconic gesture repertoire.

Place, publisher, year, edition, pages
Oxford University Press (OUP), 2021
Keywords
Human-robot interaction, Nonverbal communication, Second language learning, Social robotics
National Category
General Language Studies and Linguistics Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-329052 (URN)10.1093/iwc/iwac013 (DOI)000811056300001 ()2-s2.0-85136619318 (Scopus ID)
Note

QC 20230614

Available from: 2023-06-14 Created: 2023-06-14 Last updated: 2023-07-23Bibliographically approved
van den Berghe, R., Oudgenoeg-Paz, O., Verhagen, J., Brouwer, S., de Haas, M., de Wit, J., . . . Leseman, P. (2021). Individual Differences in Children's (Language) Learning Skills Moderate Effects of Robot-Assisted Second Language Learning. Frontiers in Robotics and AI, 8
Open this publication in new window or tab >>Individual Differences in Children's (Language) Learning Skills Moderate Effects of Robot-Assisted Second Language Learning
Show others...
2021 (English)In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 8Article in journal (Refereed) Published
Abstract [en]

The current study investigated how individual differences among children affect the added value of social robots for teaching second language (L2) vocabulary to young children. Specifically, we investigated the moderating role of three individual child characteristics deemed relevant for language learning: first language (L1) vocabulary knowledge, phonological memory, and selective attention. We expected children low in these abilities to particularly benefit from being assisted by a robot in a vocabulary training. An L2 English vocabulary training intervention consisting of seven sessions was administered to 193 monolingual Dutch five-year-old children over a three- to four-week period. Children were randomly assigned to one of three experimental conditions: 1) a tablet only, 2) a tablet and a robot that used deictic (pointing) gestures (the no-iconic-gestures condition), or 3) a tablet and a robot that used both deictic and iconic gestures (i.e., gestures depicting the target word; the iconic-gestures condition). There also was a control condition in which children did not receive a vocabulary training, but played dancing games with the robot. L2 word knowledge was measured directly after the training and two to four weeks later. In these post-tests, children in the experimental conditions outperformed children in the control condition on word knowledge, but there were no differences between the three experimental conditions. Several moderation effects were found. The robot's presence particularly benefited children with larger L1 vocabularies or poorer phonological memory, while children with smaller L1 vocabularies or better phonological memory performed better in the tablet-only condition. Children with larger L1 vocabularies and better phonological memory performed better in the no-iconic-gestures condition than in the iconic-gestures condition, while children with better selective attention performed better in the iconic-gestures condition than the no-iconic-gestures condition. Together, the results showed that the effects of the robot and its gestures differ across children, which should be taken into account when designing and evaluating robot-assisted L2 teaching interventions.

Place, publisher, year, edition, pages
Frontiers Media SA, 2021
Keywords
social robots, second language learning, child-robot interaction, individual differences, (language) learning skills
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-303366 (URN)10.3389/frobt.2021.676248 (DOI)000697277200001 ()34504871 (PubMedID)2-s2.0-85114430135 (Scopus ID)
Note

QC 20211015

Available from: 2021-10-15 Created: 2021-10-15 Last updated: 2022-06-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-2140-0612

Search in DiVA

Show all publications