kth.sePublications
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 13) Show all publications
Fallgren, P. & Edlund, J. (2023). Crowdsource-based validation of the audio cocktail as a sound browsing tool. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 2178-2182). International Speech Communication Association
Open this publication in new window or tab >>Crowdsource-based validation of the audio cocktail as a sound browsing tool
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 2178-2182Conference paper, Published paper (Refereed)
Abstract [en]

We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
annotation, exploration, found speech, hearing, human-in-the-loop
National Category
Language Technology (Computational Linguistics) Other Humanities not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-337834 (URN)10.21437/Interspeech.2023-2473 (DOI)2-s2.0-85171584146 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Fallgren, P. (2022). Found speech and humans in the loop: Ways to gain insight into large quantities of speech. (Doctoral dissertation). KTH Royal Institute of Technology
Open this publication in new window or tab >>Found speech and humans in the loop: Ways to gain insight into large quantities of speech
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

Abstract [sv]

Funnet data - data som används för något annat än det syfte som det först samlades in för - är värdefullt i många avseenden. Det reflekterar vanligtvis hög ekologisk validitet, det har ett starkt kulturellt värde, och det finns stora mängder att ta del av. Det är dock fyllt av brus, svårt att få en överblick av, och ofta är innehållet inte tydligt. Denna avhandling utforskar metoder som ger insikt i dessa datasamlingar, specifikt vad gäller tal och ljud.

På senare tid har djupinlärning producerat oöverträffade resultat inom tal och språkteknologi. Många av dessa metoder behöver dock väldiga mängder annoterat data, vilket är kostsamt att skapa. Dessutom är maskininlärningsmodeller vanligtvis tränade med väldefinierade problem i åtanke, och presterar sämre inom mer generella uppgifter - såsom att tillhandahålla en övergripande beskrivning av innehållet i en stor ljudfil. Denna observation visar på en brist inom existerande metodologier, således finns det ett behov av vidare tekniker vilket är något som denna avhandling syftar till att täcka.

Ett idealt angreppsätt för dessa problem kombinerar flexibiliteten och den generella intelligensen hos en människa med beräkningskraften och mönsterigenkänningsförmågan hos en maskin. Utifrån dessa idéer utforskar avhandlingen värdet av att inkludera människan i loopen, specifikt utifrån hur insikter om stora insamlingar av funnet tal kan skapas. Huvudidén är således att kombinera tekniker från talteknologi, maskininlärningsparadigm, samt människa-i-loopen-metoder, med det övergripande målet att utveckla och utvärdera nya metoder för utforskandet av stora mängder funnet taldata.

Ett primärt bidrag är Edyson, ett verktyg för snabb genomlyssning och annotering av ljud. Det bygger på tidsmässig isärtagning av ljud i kombination med särdragsextraktion, dimensionsreduceringsalgoritmer, samt en flexibel lyssningsfunktion, vilket ger en användare en informativ överblick av innehållet.

Vidare undersöks crowdsourcing inom kontexten av storskaliga perceptionsstudier och datainsamling av tal och språkdata. Tidigare rapporter som visar på användbarheten av crowd workers är styrkta av avhandlingens bidrag.

Avhandlingsbidragen visar att de undersökta metoderna är lovande alternativ för utforskandet av stora mängder funnet ljuddata och förtjänar vidare uppmärksamhet.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2022. p. 83
Series
TRITA-EECS-AVL ; 2022:13
Keywords
Found data, found speech, human-in-the-loop, sound browsing, dimensionality reduction, visualization, crowdsourcing
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-309031 (URN)978-91-8040-143-2 (ISBN)
Public defence
2022-03-18, Kollegiesalen, https://kth-se.zoom.us/j/62813774919, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20220222

Available from: 2022-02-22 Created: 2022-02-18 Last updated: 2022-06-25Bibliographically approved
Fallgren, P. & Edlund, J. (2021). Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 30 August 2021, through 3 September 2021 (pp. 3685-3689). International Speech Communication Association
Open this publication in new window or tab >>Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
2021 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2021, p. 3685-3689Conference paper, Published paper (Refereed)
Abstract [en]

Edyson is a human-in-the-loop (HITL) tool for browsing and annotating large amounts of audio data quickly. It builds on temporally disassembled audio and massively multi-component audio environments to overcome the cumbersome time con- straints that come with linear exploration of large audio data. This study adds the following contributions to Edyson: 1) We add the new use case of HITL binary classification by sample; 2) We explore the new domain oceanic hydrophone recordings with whale song, along with speech activity detection in noisy audio; 3) We propose a repeatable method of analysing the effi- ciency of HITL in Edyson for binary classification, specifically designed to measure the return on human time spent in a given domain. We exemplify this method on two domains, and show that for a manageable initial cost in terms of HITL, it does dif- ferentiate between suitable and unsuitable domains for our new use case - a valuable insight when working with large collections of audio.

Place, publisher, year, edition, pages
International Speech Communication Association, 2021
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X ; 6
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-309024 (URN)10.21437/interspeech.2021-45 (DOI)000841879503159 ()2-s2.0-85119260766 (Scopus ID)
Conference
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 30 August 2021, through 3 September 2021
Note

QC 20221108

Part of proceedings: ISBN 978-171383690-2

Available from: 2022-02-18 Created: 2022-02-18 Last updated: 2022-11-08Bibliographically approved
Domeij, R., Edlund, J., Eriksson, G., Fallgren, P., House, D., Lindström, E., . . . Öqvist, J. (2020). Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research. In: CEUR Workshop Proceedings: . Paper presented at 2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020 (pp. 45-55). CEUR-WS
Open this publication in new window or tab >>Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research
Show others...
2020 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2020, p. 45-55Conference paper, Published paper (Refereed)
Abstract [en]

Tilltal (Tillgängligt kulturarv för forskning i tal, 'Accessible cultural heritage for speech research') is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use - and would like to be able to use - the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks [2]. In this paper we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.

Place, publisher, year, edition, pages
CEUR-WS, 2020
Keywords
Archive speech, Found data, Interdisciplinary collaboration, Participatory design, Digital devices, Cultural heritages, Interaction analysis, Interdisciplinary collaborations, Language technology, Personal experience, Royal Institute of Technology, Theoretical approach, Audio recordings
National Category
Arts
Identifiers
urn:nbn:se:kth:diva-290852 (URN)2-s2.0-85095968481 (Scopus ID)
Conference
2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020
Note

QC 20210322

Available from: 2021-03-22 Created: 2021-03-22 Last updated: 2022-06-25Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data. In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018): . Paper presented at 11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018 (pp. 4307-4311). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data
2019 (English)In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper, Published paper (Refereed)
Abstract [en]

We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2019
Keywords
Data visualisation, Found data, Speech archives
National Category
Media Engineering
Identifiers
urn:nbn:se:kth:diva-241799 (URN)000725545004063 ()2-s2.0-85059880464 (Scopus ID)
Conference
11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018
Note

Part of proceedings: ISBN 979-10-95546-00-9

QC 20230206

Available from: 2019-01-25 Created: 2019-01-25 Last updated: 2023-02-06Bibliographically approved
Jonell, P., Lopes, J., Fallgren, P., Wennberg, U., Doğan, F. I. & Skantze, G. (2019). Crowdsourcing a self-evolving dialog graph. In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces: . Paper presented at 1st International Conference on Conversational User Interfaces, CUI 2019; Dublin; Ireland; 22 August 2019 through 23 August 2019. Association for Computing Machinery (ACM), Article ID 14.
Open this publication in new window or tab >>Crowdsourcing a self-evolving dialog graph
Show others...
2019 (English)In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2019
Series
ACM International Conference Proceeding Series
Keywords
Crowdsourcing, Datasets, Dialog systems, Human-computer interaction
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-266061 (URN)10.1145/3342775.3342790 (DOI)000525446900014 ()2-s2.0-85075882531 (Scopus ID)9781450371872 (ISBN)
Conference
1st International Conference on Conversational User Interfaces, CUI 2019; Dublin; Ireland; 22 August 2019 through 23 August 2019
Note

QC 20200114

Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2024-03-15Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). How to annotate 100 hours in 45 minutes. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at Interspeech 2019 15-19 September 2019, Graz (pp. 341-345). ISCA
Open this publication in new window or tab >>How to annotate 100 hours in 45 minutes
2019 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper, Published paper (Refereed)
Abstract [en]

Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

Place, publisher, year, edition, pages
ISCA, 2019
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-268304 (URN)10.21437/Interspeech.2019-1648 (DOI)000831796400069 ()2-s2.0-85074718085 (Scopus ID)
Conference
Interspeech 2019 15-19 September 2019, Graz
Note

QC 20200310

Available from: 2020-03-10 Created: 2020-03-10 Last updated: 2022-09-23Bibliographically approved
Tånnander, C., Fallgren, P., Edlund, J. & Gustafson, J. (2019). Spot the pleasant people! Navigating the cocktail party buzz. In: Proceedings Interspeech 2019, 20th Annual Conference of the International Speech Communication Association: . Paper presented at Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. (pp. 4220-4224).
Open this publication in new window or tab >>Spot the pleasant people! Navigating the cocktail party buzz
2019 (English)In: Proceedings Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, 2019, p. 4220-4224Conference paper, Published paper (Refereed)
Abstract [en]

We present an experimental platform for making voice likability assessments that are decoupled from individual voices, and instead capture voice characteristics over groups of speakers. We employ methods that we have previously used for other purposes to create the Cocktail platform, where respondents navigate in a voice buzz made up of about 400 voices on a touch screen. They then choose the location where they find the voice buzz most pleasant. Since there is no image or message on the screen, the platform can be used by visually impaired people, who often need to rely on spoken text, on the same premises as seeing people. In this paper, we describe the platform and its motivation along with our analysis method. We conclude by presenting two experiments in which we verify that the platform behaves as expected: one simple sanity test, and one experiment with voices grouped according to their mean pitch variance.

National Category
Other Engineering and Technologies not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-273022 (URN)10.21437/Interspeech.2019-1553 (DOI)000831796404073 ()2-s2.0-85074709608 (Scopus ID)
Conference
Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019.
Note

QC 20200507

Available from: 2020-05-05 Created: 2020-05-05 Last updated: 2022-09-23Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). Towards fast browsing of found audio data: 11 presidents. In: CEUR Workshop Proceedings: . Paper presented at 4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, 5-8 March 2019, Copenhagen, Denmark (pp. 133-142). CEUR-WS
Open this publication in new window or tab >>Towards fast browsing of found audio data: 11 presidents
2019 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 133-142Conference paper, Published paper (Refereed)
Abstract [en]

Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.

Place, publisher, year, edition, pages
CEUR-WS, 2019
Keywords
Dimensionality reduction, Found data, Self-organizing maps, Speech processing, Visualisation, Conformal mapping, Visualization, Audio data, Proof of concept, Soundscapes, Spread outs, Self organizing maps
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-280816 (URN)2-s2.0-85066015671 (Scopus ID)
Conference
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, 5-8 March 2019, Copenhagen, Denmark
Note

QC 20200914

Available from: 2020-09-14 Created: 2020-09-14 Last updated: 2022-06-25Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2018). A tool for exploring large amounts of found audio data. In: CEUR Workshop Proceedings: . Paper presented at 3rd Conference on Digital Humanities in the Nordic Countries, DHN 2018, 7 March 2018 through 9 March 2018 (pp. 499-503). CEUR-WS
Open this publication in new window or tab >>A tool for exploring large amounts of found audio data
2018 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper, Published paper (Refereed)
Abstract [en]

We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

Place, publisher, year, edition, pages
CEUR-WS, 2018
Keywords
Found data, Machine learning, Speech processing, Visualization, Flow visualization, Learning systems, Audio data, Large amounts, Nonsequential, Open source tools, Data visualization
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-227479 (URN)2-s2.0-85045345183 (Scopus ID)
Conference
3rd Conference on Digital Humanities in the Nordic Countries, DHN 2018, 7 March 2018 through 9 March 2018
Note

Conference code: 135422; Export Date: 9 May 2018; Conference Paper; Funding details: 2013-02003, TRC, The Research Council; Funding text: The project described here is funded in full by Riksbankens Jubileumsfond (SAF16-0917: 1). Its results will be made more widely accessible through the infrastructure supported by SWE-CLARIN (Swedish research Council 2013-02003). QC 20180516

Available from: 2018-05-16 Created: 2018-05-16 Last updated: 2024-03-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1262-4876

Search in DiVA

Show all publications