kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards fast browsing of found audio data: 11 presidents
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1262-4876
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-5953-7310
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9327-9482
2019 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 133-142Conference paper, Published paper (Refereed)
Abstract [en]

Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.

Place, publisher, year, edition, pages
CEUR-WS , 2019. p. 133-142
Keywords [en]
Dimensionality reduction, Found data, Self-organizing maps, Speech processing, Visualisation, Conformal mapping, Visualization, Audio data, Proof of concept, Soundscapes, Spread outs, Self organizing maps
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-280816Scopus ID: 2-s2.0-85066015671OAI: oai:DiVA.org:kth-280816DiVA, id: diva2:1466903
Conference
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, 5-8 March 2019, Copenhagen, Denmark
Note

QC 20200914

Available from: 2020-09-14 Created: 2020-09-14 Last updated: 2022-06-25Bibliographically approved
In thesis
1. Found speech and humans in the loop: Ways to gain insight into large quantities of speech
Open this publication in new window or tab >>Found speech and humans in the loop: Ways to gain insight into large quantities of speech
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

Abstract [sv]

Funnet data - data som används för något annat än det syfte som det först samlades in för - är värdefullt i många avseenden. Det reflekterar vanligtvis hög ekologisk validitet, det har ett starkt kulturellt värde, och det finns stora mängder att ta del av. Det är dock fyllt av brus, svårt att få en överblick av, och ofta är innehållet inte tydligt. Denna avhandling utforskar metoder som ger insikt i dessa datasamlingar, specifikt vad gäller tal och ljud.

På senare tid har djupinlärning producerat oöverträffade resultat inom tal och språkteknologi. Många av dessa metoder behöver dock väldiga mängder annoterat data, vilket är kostsamt att skapa. Dessutom är maskininlärningsmodeller vanligtvis tränade med väldefinierade problem i åtanke, och presterar sämre inom mer generella uppgifter - såsom att tillhandahålla en övergripande beskrivning av innehållet i en stor ljudfil. Denna observation visar på en brist inom existerande metodologier, således finns det ett behov av vidare tekniker vilket är något som denna avhandling syftar till att täcka.

Ett idealt angreppsätt för dessa problem kombinerar flexibiliteten och den generella intelligensen hos en människa med beräkningskraften och mönsterigenkänningsförmågan hos en maskin. Utifrån dessa idéer utforskar avhandlingen värdet av att inkludera människan i loopen, specifikt utifrån hur insikter om stora insamlingar av funnet tal kan skapas. Huvudidén är således att kombinera tekniker från talteknologi, maskininlärningsparadigm, samt människa-i-loopen-metoder, med det övergripande målet att utveckla och utvärdera nya metoder för utforskandet av stora mängder funnet taldata.

Ett primärt bidrag är Edyson, ett verktyg för snabb genomlyssning och annotering av ljud. Det bygger på tidsmässig isärtagning av ljud i kombination med särdragsextraktion, dimensionsreduceringsalgoritmer, samt en flexibel lyssningsfunktion, vilket ger en användare en informativ överblick av innehållet.

Vidare undersöks crowdsourcing inom kontexten av storskaliga perceptionsstudier och datainsamling av tal och språkdata. Tidigare rapporter som visar på användbarheten av crowd workers är styrkta av avhandlingens bidrag.

Avhandlingsbidragen visar att de undersökta metoderna är lovande alternativ för utforskandet av stora mängder funnet ljuddata och förtjänar vidare uppmärksamhet.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2022. p. 83
Series
TRITA-EECS-AVL ; 2022:13
Keywords
Found data, found speech, human-in-the-loop, sound browsing, dimensionality reduction, visualization, crowdsourcing
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-309031 (URN)978-91-8040-143-2 (ISBN)
Public defence
2022-03-18, Kollegiesalen, https://kth-se.zoom.us/j/62813774919, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20220222

Available from: 2022-02-22 Created: 2022-02-18 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Fallgren, PerMalisz, ZofiaEdlund, Jens

Search in DiVA

By author/editor
Fallgren, PerMalisz, ZofiaEdlund, Jens
By organisation
Speech, Music and Hearing, TMH
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 210 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf