Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
All–to–all Communication Algorithms for Distributed BLAS
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC. (Parallelldatorcentrum)
1993 (English)Conference paper, Published paper (Refereed)
Abstract [en]

based on all--to--all broadcast and all--to--all reduce are presented. For DBLAS, at each all--to--all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all--to--all broadcast (such as a N--body solver) where the identity of the data values is not of much interest. Detailed schedules for all--to--all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary--reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM--200, Hamiltonian cycle based all--to--all communication algorithms improve the performance by a factor of two to ten over a combination of tree, butterfly network, and router based algorithms. The data rate achieved for all--to--all broadcast on a 256 node Connection Machine system CM--200 is 0.3 Gbytes/sec. The data motion rate for all--to--all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all--to--all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM--200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops/s respectively.

Place, publisher, year, edition, pages
1993.
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-65264OAI: oai:DiVA.org:kth-65264DiVA: diva2:483284
Conference
6th SIAM Conference on Parallel Processing for Scientific Computing
Note
NR 20140805Available from: 2012-01-25 Created: 2012-01-25Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Johnsson, Lennart
By organisation
Centre for High Performance Computing, PDC
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 18 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf