kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs
NYU, Courant Inst Math Sci, New York, NY 10003 USA..
Princeton Univ, PACM, Princeton, NJ 08544 USA..
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Mathematics (Div.).ORCID iD: 0000-0002-3377-813x
Lawrence Berkeley Natl Lab, Natl Energy Res Sci Comp Ctr, Berkeley, CA USA..
Show others and affiliations
2021 (English)In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 688-697Conference paper, Published paper (Refereed)
Abstract [en]

Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around 109 nonuniform points per second, and (even including hostdevice transfer) is typically 4-10x faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90x faster at high accuracy and/or type 1 clustered point distributions. Finally we demonstrate a 5-12x speedup versus CPU in an X-ray diffraction 3D iterative reconstruction task at 10(-12) accuracy, observing excellent multi-GPU weak scaling up to one rank per GPU.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2021. p. 688-697
Keywords [en]
Nonuniform FFT, GPU, load balancing
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-302675DOI: 10.1109/IPDPSW52791.2021.00105ISI: 000689576200084Scopus ID: 2-s2.0-85114440464OAI: oai:DiVA.org:kth-302675DiVA, id: diva2:1599130
Conference
35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), JUN 17-21, 2021, Portland, OR
Note

QC 20210930

Available from: 2021-09-30 Created: 2021-09-30 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Andén, Joakim

Search in DiVA

By author/editor
Andén, Joakim
By organisation
Mathematics (Div.)
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 59 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf