kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Crowdsourcing a self-evolving dialog graph
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3687-6189
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1262-4876
KTH.
Show others and affiliations
2019 (English)In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2019. article id 14
Series
ACM International Conference Proceeding Series
Keywords [en]
Crowdsourcing, Datasets, Dialog systems, Human-computer interaction
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-266061DOI: 10.1145/3342775.3342790ISI: 000525446900014Scopus ID: 2-s2.0-85075882531OAI: oai:DiVA.org:kth-266061DiVA, id: diva2:1385502
Conference
1st International Conference on Conversational User Interfaces, CUI 2019; Dublin; Ireland; 22 August 2019 through 23 August 2019
Note

QC 20200114

Part of ISBN 9781450371872

Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2024-10-18Bibliographically approved
In thesis
1. Found speech and humans in the loop: Ways to gain insight into large quantities of speech
Open this publication in new window or tab >>Found speech and humans in the loop: Ways to gain insight into large quantities of speech
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

Abstract [sv]

Funnet data - data som används för något annat än det syfte som det först samlades in för - är värdefullt i många avseenden. Det reflekterar vanligtvis hög ekologisk validitet, det har ett starkt kulturellt värde, och det finns stora mängder att ta del av. Det är dock fyllt av brus, svårt att få en överblick av, och ofta är innehållet inte tydligt. Denna avhandling utforskar metoder som ger insikt i dessa datasamlingar, specifikt vad gäller tal och ljud.

På senare tid har djupinlärning producerat oöverträffade resultat inom tal och språkteknologi. Många av dessa metoder behöver dock väldiga mängder annoterat data, vilket är kostsamt att skapa. Dessutom är maskininlärningsmodeller vanligtvis tränade med väldefinierade problem i åtanke, och presterar sämre inom mer generella uppgifter - såsom att tillhandahålla en övergripande beskrivning av innehållet i en stor ljudfil. Denna observation visar på en brist inom existerande metodologier, således finns det ett behov av vidare tekniker vilket är något som denna avhandling syftar till att täcka.

Ett idealt angreppsätt för dessa problem kombinerar flexibiliteten och den generella intelligensen hos en människa med beräkningskraften och mönsterigenkänningsförmågan hos en maskin. Utifrån dessa idéer utforskar avhandlingen värdet av att inkludera människan i loopen, specifikt utifrån hur insikter om stora insamlingar av funnet tal kan skapas. Huvudidén är således att kombinera tekniker från talteknologi, maskininlärningsparadigm, samt människa-i-loopen-metoder, med det övergripande målet att utveckla och utvärdera nya metoder för utforskandet av stora mängder funnet taldata.

Ett primärt bidrag är Edyson, ett verktyg för snabb genomlyssning och annotering av ljud. Det bygger på tidsmässig isärtagning av ljud i kombination med särdragsextraktion, dimensionsreduceringsalgoritmer, samt en flexibel lyssningsfunktion, vilket ger en användare en informativ överblick av innehållet.

Vidare undersöks crowdsourcing inom kontexten av storskaliga perceptionsstudier och datainsamling av tal och språkdata. Tidigare rapporter som visar på användbarheten av crowd workers är styrkta av avhandlingens bidrag.

Avhandlingsbidragen visar att de undersökta metoderna är lovande alternativ för utforskandet av stora mängder funnet ljuddata och förtjänar vidare uppmärksamhet.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2022. p. 83
Series
TRITA-EECS-AVL ; 2022:13
Keywords
Found data, found speech, human-in-the-loop, sound browsing, dimensionality reduction, visualization, crowdsourcing
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-309031 (URN)978-91-8040-143-2 (ISBN)
Public defence
2022-03-18, Kollegiesalen, https://kth-se.zoom.us/j/62813774919, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20220222

Available from: 2022-02-22 Created: 2022-02-18 Last updated: 2025-02-07Bibliographically approved
2. Scalable Methods for Developing Interlocutor-aware Embodied Conversational Agents: Data Collection, Behavior Modeling, and Evaluation Methods
Open this publication in new window or tab >>Scalable Methods for Developing Interlocutor-aware Embodied Conversational Agents: Data Collection, Behavior Modeling, and Evaluation Methods
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This work presents several methods, tools, and experiments that contribute to the development of interlocutor-aware Embodied Conversational Agents (ECAs). Interlocutor-aware ECAs take the interlocutor's behavior into consideration when generating their own non-verbal behaviors. This thesis targets the development of such adaptive ECAs by identifying and contributing to three important and related topics:

1) Data collection methods are presented, both for large scale crowdsourced data collection and in-lab data collection with a large number of sensors in a clinical setting. Experiments show that experts deemed dialog data collected using a crowdsourcing method to be better for dialog generation purposes than dialog data from other commonly used sources. 2) Methods for behavior modeling are presented, where machine learning models are used to generate facial gestures for ECAs. Both methods for single speaker and interlocutor-aware generation are presented. 3) Evaluation methods are explored and both third-party evaluation of generated gestures and interaction experiments of interlocutor-aware gestures generation are being discussed. For example, an experiment is carried out investigating the social influence of a mimicking social robot. Furthermore, a method for more efficient perceptual experiments is presented. This method is validated by replicating a previously conducted perceptual experiment on virtual agents, and shows that the results obtained using this new method provide similar insights (in fact, it provided more insights) into the data, simultaneously being more efficient in terms of time evaluators needed to spend participating in the experiment. A second study compared the difference between performing subjective evaluations of generated gestures in the lab vs. using crowdsourcing, and showed no difference between the two settings. A special focus in this thesis is given to using scalable methods, which allows for being able to efficiently and rapidly collect interaction data from a broad range of people and efficiently evaluate results produced by the machine learning methods. This in turn allows for fast iteration when developing interlocutor-aware ECAs behaviors.

Abstract [sv]

Det här arbetet presenterar ett flertal metoder, verktyg och experiment som alla bidrar till utvecklingen av motparts-medvetna förkloppsligade konversationella agenter, dvs agenter som kommunicerar med språk, har en kroppslig representation (avatar eller robot) och tar motpartens beteenden i beaktande när de genererar sina egna icke-verbala beteenden. Den här avhandlingen ämnar till att bidra till utvecklingen av sådana agenter genom att identifiera och bidra till tre viktiga områden:

Datainstamlingsmetoder  både för storskalig datainsamling med hjälp av så kallade "crowdworkers" (en stor mängd personer på internet som används för att lösa ett problem) men även i laboratoriemiljö med ett stort antal sensorer. Experiment presenteras som visar att t.ex. dialogdata som samlats in med hjälp av crowdworkers är bedömda som bättre ur dialoggenereringspersiktiv av en grupp experter än andra vanligt använda datamängder som används inom dialoggenerering. 2) Metoder för beteendemodellering, där maskininlärningsmodeller används för att generera ansiktsgester. Såväl metoder för att generera ansiktsgester för en ensam agent och för motparts-medvetna agenter presenteras, tillsammans med experiment som validerar deras funktionalitet. Vidare presenteras även ett experiment som undersöker en agents sociala påverkan på sin motpart då den imiterar ansiktsgester hos motparten medan de samtalar. 3) Evalueringsmetoder är utforskade och en metod för mer effektiva perceptuella experiment presenteras. Metoden är utvärderad genom att återskapa ett tidigare genomfört experiment med virtuella agenter, och visar att resultaten som fås med denna nya metod ger liknande insikter (den ger faktiskt fler insikter), samtidigt som den är effektivare när det kommer till hur mycket tid utvärderarna behövde spendera. En andra studie studerar skillnaden mellan att utföra subjektiva utvärderingar av genererade gester i en laboratoriemiljö jämfört med att använda crowdworkers, och visade att ingen skillnad kunde uppmätas. Ett speciellt fokus ligger på att använda skalbara metoder, då detta möjliggör effektiv och snabb insamling av mångfasetterad interaktionsdata från många olika människor samt evaluaring av de beteenden som genereras från maskininlärningsmodellerna, vilket i sin tur möjliggör snabb iterering i utvecklingen.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2022. p. 77
Series
TRITA-EECS-AVL ; 2022:15
Keywords
non-verbal behavior generation, interlocutor-aware, data collection, behavior modeling, evaluation methods
National Category
Computer Systems
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-309467 (URN)978-91-8040-151-7 (ISBN)
Public defence
2022-03-25, U1, https://kth-se.zoom.us/j/62813774919, Brinellvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20220307

Available from: 2022-03-07 Created: 2022-03-03 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Jonell, PatrikFallgren, PerWennberg, UlmeDoğan, Fethiye IrmakSkantze, Gabriel

Search in DiVA

By author/editor
Jonell, PatrikFallgren, PerWennberg, UlmeDoğan, Fethiye IrmakSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMHKTHRobotics, Perception and Learning, RPL
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 217 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf