Change search
Refine search result
1 - 9 of 9
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Dongarra, Jack
    et al.
    Johnsson, Lennart
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Solving Banded Systems on Parallel Architectures1987In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 5, no 2, p. 219-246Article in journal (Refereed)
  • 2. Ho, Ching-Tien
    et al.
    Johnsson, Lennart
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Embedding Meshes in Boolean Cubes by Graph Decomposition1990In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 8, no 4, p. 325-339Article in journal (Refereed)
    Abstract [en]

    This paper explores the embeddings of multidimensional meshes into minimal Boolean cubes by graph decomposition. The dilation and the congestion of the product graph (G1 × G2) → (H1 × H2) is the maximum of the dilation and congestion for the two embeddings G1H1 and G2H2. The graph decomposition technique can be used to improve the average dilation and average congestion. The graph decomposition technique combined with some particular two-dimensional embeddings allows for minimal-expansion, dilation-two, congestion-two embeddings of about 87% of all two-dimensional meshes, with a significantly lower average dilation and congestion than by modified line compression. For three-dimensional meshes we show that the graph decomposition technique, together with two three-dimensional mesh embeddings presented in this paper and modified line compression, yields dilation-two embeddings of more than 96% of all three-dimensional meshes contained in a 512 × 512 × 512 mesh. The graph decomposition technique is also used to generalize the embeddings to meshes with wrap-around. The dilation increases by at most one compared to a mesh without wraparound. The expansion is preserved for the majority of meshes, if a wraparound feature is added to the mesh.

  • 3.
    Johnsson, Lennart
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Minimizing the Communication Time for Matrix Multiplication on Multiprocessors1993In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 19, no 11, p. 1235-1257Article in journal (Refereed)
    Abstract [en]

    We present one matrix multiplication algorithm for two-dimensional arrays of processing nodes, and one algorithm for three-dimensional nodal arrays. One-dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one-, two-, or three-dimensional arrays may be embedded. For binary n-cubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square two-dimensional arrays by a factor of n/2. Configuring the N= 2(n) processing nodes as a three-dimensional array may reduce the communication complexity by a factor of N-1/6 compared to a two-dimensional nodal array. The three-dimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal two-dimensional nodal array shape with respect to communication has a ratio between the numbers of node rows and columns equal to the ratio between the numbers of matrix rows and columns of the product matrix, with the product matrix accumulated in-place. The optimal three-dimensional nodal array shape has a ratio between the lengths of the machine axes equal approximately to the ratio between the lengths of the three axes in matrix multiplication. For product matrices of extreme shape, one-dimensional nodal array shapes are optimal when N/n less than or similar to 2 P/R for P > R, or N/n less than or similar to 2R/P for R greater than or equal to P, where P is the number of rows and R the number of columns of the product matrix. All our algorithms use standard communication functions.

  • 4.
    Johnsson, Lennart
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Krawitz, Robert L.
    Cooley–Tukey FFT on the Connection Machine1992In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 18, no 11, p. 1201-1221Article in journal (Refereed)
    Abstract [en]

    We describe an implementation of the Cooley-Tukey complex-to-complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for a one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is P/2N + log2 N for an FFT on P complex points uniformly distributed among N processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order.

  • 5. Mathur, Kapil K
    et al.
    Johnsson, Lennart
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer1994In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 20, no 7, p. 919-951Article in journal (Refereed)
    Abstract [en]

    Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been implemented on the Connection Machine system CM-200 are described. No assumption is made on the shape or size of the operands. For matrix-matrix multiplication, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm that computes the product matrix in-place is described in detail. We show that a level-3 DBLAS yields better performance than a level-2 DBLAS. On the Connection Machine system CM-200, blocking yields a performance improvement by a factor of up to three over level-2 DBLAS. For certain matrix shapes the systolic algorithms offer both improved performance and significantly reduced temporary storage requirements compared to the nonsystolic block algorithms.

    We show that, in order to minimize the communication time, an algorithm that leaves the largest operand matrix stationary should be chosen for matrix-matrix multiplication. Furthermore, it is shown both analytically and experimentally that the optimum shape of the processor array yields square stationary submatrices in each processor, i.e. the ratio between the length of the axes of the processing array must be the same as the ratio between the corresponding axes of the stationary matrix. The optimum processor array shape may yield a factor of five performance enhancement for the multiplication of square matrices. For rectangular matrices a factor of 30 improvement was observed for an optimum processor array shape compared to a poorly chosen processor array shape.

  • 6.
    Narasimhamurthy, Sai
    et al.
    Seagate Syst UK, London, England..
    Danilov, Nikita
    Seagate Syst UK, London, England..
    Wu, Sining
    Seagate Syst UK, London, England..
    Umanesan, Ganesan
    Seagate Syst UK, London, England..
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Rivas-Gomez, Sergio
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Pleiter, Dirk
    Julich Supercomp Ctr, Julich, Germany..
    de Witt, Shaun
    Culham Ctr Fus Energy, Abingdon, Oxon, England..
    SAGE: Percipient Storage for Exascale Data Centric Computing2019In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 83, p. 22-33Article in journal (Refereed)
    Abstract [en]

    We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.

  • 7. Olsson, Pelle
    et al.
    Johnsson, Lennart
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    A Data Parallel Implementation of an Explicit Method for the Compressible Navier– Stokes Equations for Three–Dimensional Channel Flow1990In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 14, no 1, p. 1-30Article in journal (Refereed)
    Abstract [en]

    The fluid flow in a three-dimensional twisted channel is modeled by both the compressible Navier-Stokes equations, and the Euler equations. A three stage Runge-Kutta method is used for integrating the system of equations in time. A second-order accurate, centered difference scheme is used for spatial derivatives of the flux variables. For both the Euler and the Navier-Stokes equations artificial viscosity introduced through fourth-order centered differences is used to stabilize the numeric scheme. By using lower order difference approximations on or close to the boundary than in the interior, the difference stencils can be evaluated at all grid points concurrently. A few different difference molecules for the boundaries, and different factorizations of the fourth-order difference operators were evaluated. With the appropriate factorization of the difference stencils, six variables per lattice point suffice for the evaluation of the difference stencils occurring in the code. The three fourth-order stencils we investigated, including three different factorizations of one of these stencils, account for three out these six variables. The convergence rate for all stencils and their factorizations is approximately the same for the first 1000–1500 steps at which point the residual has reached a value of 10−2–10−3. From this point on the convergence rate for one of the factorizations of the fourth-order stencil is approximately twice that of one of the unfactored stencils.

    A performance of 1.05 Gflops/s was demonstrated on 65 536 processor Connection Machine system with 512 Mbytes of primary storage. The performance scales in proportion to the number of processors. The performance on 8k processor configurations was 135 Mflops/s, on 16k processors 265 Mflops/s and 525 Mflops/s on 32k processors. The efficiency is independent of the machine size. The evaluation of the boundary conditions accounted for less than 5% of the total time. A performance improvement by a factor of about three is expected with optimized implementations of functional kernels such as convolution, and matrix-vector multiplication.

  • 8. Peng, I. B.
    et al.
    Gioiosa, R.
    Kestor, G.
    Vetter, J. S.
    Cicotti, P.
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Characterizing the performance benefit of hybrid memory system for HPC applications2018In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 76, p. 57-69Article in journal (Refereed)
    Abstract [en]

    Heterogenous memory systems that consist of multiple memory technologies are becoming common in high-performance computing environments. Modern processors and accelerators, such as the Intel Knights Landing (KNL) CPU and NVIDIA Volta GPU, feature small-size high-bandwidth memory near the compute cores and large-size normal-bandwidth memory that is connected off-chip. Theoretically, HBM can provide about four times higher bandwidth than conventional DRAM. However, many factors impact the actual performance improvement that an application can achieve on such system. In this paper, we focus on the Intel KNL system and identify the most important factors on the application performance, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. We use a set of representative applications from both scientific and data-analytics domains. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to three times performance when compared to the performance obtained using only DRAM. On the contrary, applications with irregular memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. Also, we provide memory-centric analysis of four applications, identify their major data objects, correlate their characteristics to the performance improvement on the testbed.

  • 9.
    Rivas-Gomez, Sergio
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Gioiosa, Roberto
    Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA..
    Peng, Ivy Bo
    Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA..
    Kestor, Gokcen
    Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA..
    Narasimhamurthy, Sai
    Seagate Syst UK, Havant PO9 1SA, England..
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    MPI windows on storage for HPC applications2018In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 77, p. 38-56Article in journal (Refereed)
    Abstract [en]

    Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, "asymmetric" performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications.

1 - 9 of 9
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf