kth.sePublications
Change search
Refine search result
1 - 13 of 13
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Domeij, R.
    et al.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Eriksson, G.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    House, David
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lindström, E.
    Skog, S. N.
    Öqvist, J.
    Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research2020In: CEUR Workshop Proceedings, CEUR-WS , 2020, p. 45-55Conference paper (Refereed)
    Abstract [en]

    Tilltal (Tillgängligt kulturarv för forskning i tal, 'Accessible cultural heritage for speech research') is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use - and would like to be able to use - the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks [2]. In this paper we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.

  • 2.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 3.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Found speech and humans in the loop: Ways to gain insight into large quantities of speech2022Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

    In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

    An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

    One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

    Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

    The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

    Download full text (pdf)
    Kappa
  • 4.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Crowdsource-based validation of the audio cocktail as a sound browsing tool2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 2178-2182Conference paper (Refereed)
    Abstract [en]

    We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

  • 5.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edyson: rapid human-in-the-loop browsing, exploration and annotation of large speech and audio dataManuscript (preprint) (Other academic)
    Abstract [en]

    The audio exploration tool Edyson integrates a variety of techniques to achieve the efficient exploration of otherwise prohibitively large collections of speech and other sounds. A main strength is that this combination of techniques allows us to place a human-in-the-loop in a coherent and operationalised manner. 

    The two most prominent techniques that we incorporate are temporally dis- assembled audio (TDA) and massively multi-component audio environments (MMAE). The first allows us to decouple input audio from the temporal dimen- sion by segmenting it into sound snippets of short duration, akin to the frames used in signal processing. These snippets are organised and visualised in an interactive interface where the investigator can navigate through the snippets freely while providing labels and judgements that are not tied to the tempo- ral context of the original audio. This, in turn, removes the real-time or near real-time requirement associated with temporally linear audio browsing. 

    We further argue that a human-in-the-loop inclusion, as opposed to fully automated black-box approaches, is valuable and perhaps necessary to understand and fully exploit larger quantities of found speech. 

    We describe in this paper the details of the tool and its underlying method- ologies, and provide a summary of results and findings that has come out of our efforts to validate and quantify the characteristics of this new type of audio browsing to date. 

    Download full text (pdf)
    fulltext
  • 6.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson2021In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2021, p. 3685-3689Conference paper (Refereed)
    Abstract [en]

    Edyson is a human-in-the-loop (HITL) tool for browsing and annotating large amounts of audio data quickly. It builds on temporally disassembled audio and massively multi-component audio environments to overcome the cumbersome time con- straints that come with linear exploration of large audio data. This study adds the following contributions to Edyson: 1) We add the new use case of HITL binary classification by sample; 2) We explore the new domain oceanic hydrophone recordings with whale song, along with speech activity detection in noisy audio; 3) We propose a repeatable method of analysing the effi- ciency of HITL in Edyson for binary classification, specifically designed to measure the return on human time spent in a given domain. We exemplify this method on two domains, and show that for a manageable initial cost in terms of HITL, it does dif- ferentiate between suitable and unsuitable domains for our new use case - a valuable insight when working with large collections of audio.

    Download full text (pdf)
    fulltext
  • 7.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The audio cocktail as a sound browsing tool - a crowdsourcing based validationManuscript (preprint) (Other academic)
    Abstract [en]

    We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

    Download full text (pdf)
    fulltext
  • 8.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data2019In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper (Refereed)
    Abstract [en]

    We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

  • 9.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    How to annotate 100 hours in 45 minutes2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper (Refereed)
    Abstract [en]

    Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

  • 10.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Towards fast browsing of found audio data: 11 presidents2019In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 133-142Conference paper (Refereed)
    Abstract [en]

    Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.

  • 11.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bystedt, Mattias
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    David Aguas Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Mascarenhas, Samuel
    GAIPS INESC-ID, Lisbon, Portugal.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Eran, Raveh
    Multimodal Computing and Interaction, Saarland University, Germany.
    Shore, Todd
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)
    Abstract [en]

    In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

    Download full text (pdf)
    fulltext
  • 12.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lopes, J.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wennberg, Ulme
    KTH.
    Doğan, Fethiye Irmak
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Crowdsourcing a self-evolving dialog graph2019In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper (Refereed)
    Abstract [en]

    In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

  • 13.
    Tånnander, Christina
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Myndigheten för tillgängliga medier, MTM.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Spot the pleasant people! Navigating the cocktail party buzz2019In: Proceedings Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, 2019, p. 4220-4224Conference paper (Refereed)
    Abstract [en]

    We present an experimental platform for making voice likability assessments that are decoupled from individual voices, and instead capture voice characteristics over groups of speakers. We employ methods that we have previously used for other purposes to create the Cocktail platform, where respondents navigate in a voice buzz made up of about 400 voices on a touch screen. They then choose the location where they find the voice buzz most pleasant. Since there is no image or message on the screen, the platform can be used by visually impaired people, who often need to rely on spoken text, on the same premises as seeing people. In this paper, we describe the platform and its motivation along with our analysis method. We conclude by presenting two experiments in which we verify that the platform behaves as expected: one simple sanity test, and one experiment with voices grouped according to their mean pitch variance.

    Download full text (pdf)
    fulltext
1 - 13 of 13
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf