Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
All–to–All Communication on the Connection Machine system CM–200
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC. (Parallelldatorcentrum)
1995 (English)In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 4, no 4, p. 251-273Article in journal (Refereed) Published
Abstract [en]

based on all--to--all broadcast and all--to--all reduce are presented. For DBLAS, at each all--to--all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all--to--all broadcast (such as a N--body solver) where the identity of the data values is not of much interest. Detailed schedules for all--to--all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary--reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM--200, Hamiltonian cycle based all--to--all communication algorithms improve the performance by a factor of two to ten over a combination of tree, butterfly network, and router based algorithms. The data rate achieved for all--to--all broadcast on a 256 node Connection Machine system CM--200 is 0.3 Gbytes/sec. The data motion rate for all--to--all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all--to--all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM--200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops/s respectively

Place, publisher, year, edition, pages
1995. Vol. 4, no 4, p. 251-273
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-90981OAI: oai:DiVA.org:kth-90981DiVA, id: diva2:507631
Note
NR 20140805Available from: 2012-03-05 Created: 2012-03-05 Last updated: 2018-01-12Bibliographically approved

Open Access in DiVA

No full text in DiVA

Search in DiVA

By author/editor
Johnsson, Lennart
By organisation
Centre for High Performance Computing, PDC
In the same journal
Scientific Programming
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 37 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf