kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
NAAQA: A Neural Architecture for Acoustic Question Answering
NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.ORCID iD: 0000-0002-7931-5966
NECOTIS Dept. of Electrical and Computer Engineering, Sherbrooke University, Canada.ORCID iD: 0000-0002-9306-426X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Department of Electronic Systems, Norwegian University of Science and Technology, Norway.ORCID iD: 0000-0002-3323-5311
2023 (English)In: IEEE Transactions on Pattern Analysis and Machine Intelligence, ISSN 0162-8828, E-ISSN 1939-3539, Vol. 45, no 4, p. 4997-5009Article in journal (Refereed) Published
Abstract [en]

The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ∼17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ∼four times fewer parameters than the previously explored VQA model. We evaluate the performance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2023. Vol. 45, no 4, p. 4997-5009
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-324766DOI: 10.1109/tpami.2022.3194311ISI: 000947840300064PubMedID: 36121954Scopus ID: 2-s2.0-85139450848OAI: oai:DiVA.org:kth-324766DiVA, id: diva2:1743465
Projects
IGLU
Note

QC 20250611

Available from: 2023-03-15 Created: 2023-03-15 Last updated: 2025-06-11Bibliographically approved

Open Access in DiVA

fulltext(1863 kB)362 downloads
File information
File name FULLTEXT01.pdfFile size 1863 kBChecksum SHA-512
b674917ed46a4f66c665f1387c7b96483a18c157503eee127cf541213f04e9e95e875a361a951c5723b433ec52e4e333d6f9f3187c29632dced82ef083c29e1d
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMedScopus

Authority records

Salvi, Giampiero

Search in DiVA

By author/editor
Abdelnour, JeromeRouat, JeanSalvi, Giampiero
By organisation
Speech, Music and Hearing, TMH
In the same journal
IEEE Transactions on Pattern Analysis and Machine Intelligence
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 362 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 842 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf