Endre søk
Begrens søket
123456 101 - 150 of 260
Referera
Referensformat
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Annet format
Fler format
Språk
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Annet språk
Fler språk
Utmatningsformat
• html
• text
• asciidoc
• rtf
Treff pr side
• 5
• 10
• 20
• 50
• 100
• 250
Sortering
• Standard (Relevans)
• Forfatter A-Ø
• Forfatter Ø-A
• Tittel A-Ø
• Tittel Ø-A
• Type publikasjon A-Ø
• Type publikasjon Ø-A
• Eldste først
• Nyeste først
• Skapad (Eldste først)
• Skapad (Nyeste først)
• Senast uppdaterad (Eldste først)
• Senast uppdaterad (Nyeste først)
• Disputationsdatum (tidligste først)
• Disputationsdatum (siste først)
• Standard (Relevans)
• Forfatter A-Ø
• Forfatter Ø-A
• Tittel A-Ø
• Tittel Ø-A
• Type publikasjon A-Ø
• Type publikasjon Ø-A
• Eldste først
• Nyeste først
• Skapad (Eldste først)
• Skapad (Nyeste først)
• Senast uppdaterad (Eldste først)
• Senast uppdaterad (Nyeste først)
• Disputationsdatum (tidligste først)
• Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
• 101. Innocenti, M. E.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Momentum conservation in Multi-Level Multi-Domain (MLMD) simulations2016Inngår i: Journal of Computational Physics, ISSN 0021-9991, E-ISSN 1090-2716, Vol. 312, s. 14-18Artikkel i tidsskrift (Fagfellevurdert)

Momentum conservation and self-forces reduction are challenges for all Particle-In-Cell (PIC) codes using spatial discretization schemes which do not fulfill the requirement of translational invariance of the grid Green's function. We comment here on the topic applied to the recently developed Multi-Level Multi-Domain (MLMD) method. The MLMD is a semi-implicit method for PIC plasma simulations. The multi-scale nature of plasma processes is addressed by using grids with different spatial resolutions in different parts of the domain.

• 102. Innocenti, M. E.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Introduction of temporal sub-stepping in the Multi-Level Multi-Domain semi-implicit Particle-In-Cell code Parsek2D-MLMD2015Inngår i: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 189, s. 47-59Artikkel i tidsskrift (Fagfellevurdert)

In this paper, the introduction of temporal sub-stepping in Multi-Level Multi-Domain (MLMD) simulations of plasmas is discussed. The MLMD method addresses the multi-scale nature of space plasmas by simulating a problem at different levels of resolution. A large-domain "coarse grid" is simulated with low resolution to capture large-scale, slow processes. Smaller scale, local processes are obtained through a "refined grid" which uses higher resolution. Very high jumps in the resolution used at the different levels can be achieved thanks to the Implicit Moment Method and appropriate grid interlocking operations. Up to now, the same time step was used at all the levels. Now, with temporal sub-stepping, the different levels can also benefit from the use of different temporal resolutions. This saves further resources with respect to "traditional" simulations done using the same spatial and temporal stepping on the entire domain. It also prevents the levels from working at the limits of the stability condition of the Implicit Moment Method. The temporal sub-stepping is tested with simulations of magnetic reconnection in space. It is shown that, thanks to the reduced costs of MLMD simulations with respect to single-level simulations, it becomes possible to verify with realistic mass ratios scaling laws previously verified only for reduced mass ratios. Performance considerations are also provided.

• 103. Innocenti, M. E.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Multi Level Multi Domain Method for Particle In Cell plasma simulations2013Inngår i: Journal of Computational Physics, ISSN 0021-9991, E-ISSN 1090-2716, Vol. 238, s. 115-140Artikkel i tidsskrift (Fagfellevurdert)

A novel adaptive technique for electromagnetic Particle In Cell (PIC) plasma simulations is presented here. Two main issues are identified as regards the development of the algorithm. First, the choice of the size of the particle shape function in progressively refined grids, with the decision to avoid both time-dependent shape functions and cumbersome particle-to-grid interpolation techniques, and, second, the necessity to comply with the strict stability constraints of the explicit PIC algorithm. The adaptive implementation presented responds to these demands with the introduction of a Multi Level Multi Domain (MLMD) system, where a cloud of self-similar domains is fully simulated with both fields and particles, and the use of an Implicit Moment PIC method as baseline algorithm for the adaptive evolution. Information is exchanged between the levels with the projection of the field information from the refined to the coarser levels and the interpolation of the boundary conditions for the refined levels from the coarser level fields. Particles are bound to their level of origin and are prevented from transitioning to coarser levels, but are repopulated at the refined grid boundaries with a splitting technique. The presented algorithm is tested against a series of simulation challenges.

• 104.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz).
KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre.
Evaluation of Parallel Communication Models in Nekbone, a Nek5000 mini-application2015Inngår i: 2015 IEEE International Conference on Cluster Computing, IEEE , 2015, s. 760-767Konferansepaper (Fagfellevurdert)

Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.

• 105.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz).
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Mekanik. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Centra, Linné Flow Center, FLOW. KTH, Skolan för teknikvetenskap (SCI), Mekanik, Stabilitet, Transition, Kontroll. KTH, Skolan för teknikvetenskap (SCI), Centra, Linné Flow Center, FLOW. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre.
Evaluating New Communication Models in the Nek5000 Code for Exascale2015Konferansepaper (Annet vitenskapelig)
• 106. Johan, Zdenek
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Data Parallel Finite Element Method for Computational Fluid Dynamics on the Connection Machine Systems1992Inngår i: Computer Methods in Applied Mechanics and Engineering, ISSN 0045-7825, E-ISSN 1879-2138, Vol. 99, nr 1, s. 113-134Artikkel i tidsskrift (Fagfellevurdert)

A finite element method for computational fluid dynamics has been implemented on the Connection Machine systems CM-2 and CM-200. An implicit iterative solution strategy, based on the pre-conditioned matrix-free GMRES algorithm, is employed. Parallel data structures built on both nodal and elemental sets are used to achieve maximum parallelization. Communication primitives provided through the Connection Machine Scientific Software Library substantially improved the overall performance of the program. Computations of three-dimensional compressible flows using unstructured meshes having close to one million elements, such as a complete airplane, demonstrate that the Connection Machine systems are suitable for these applications. Performance comparisons are also carried out with the vector computers Cray Y-MP and Convex C-1.

• 107. Johan, Zdenek
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Scalability of Finite Element Applications on Distributed–Memory Parallel Computers1994Inngår i: Computer Methods in Applied Mechanics and Engineering, ISSN 0045-7825, E-ISSN 1879-2138, Vol. 119, nr 1-2, s. 61-72Artikkel i tidsskrift (Fagfellevurdert)

This paper demonstrates that scalability and competitive efficiency can be achieved for unstructured grid finite element applications on distributed memory machines, such as the Connection Machine CM-5 system. The efficiency of finite element solvers is analyzed through two applications: an implicit computational aerodynamics application and an explicit solid mechanics application. Scalability of mesh decomposition and of data mapping strategies is also discussed. Numerical examples that support the claims for problems with an excess of fourteen million variables are presented.

• 108. Johan, Zdenek
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
An Efficient Communication Strategy for Finite Element Methods on the Connection Machine CM-5 System1994Inngår i: Computer Methods in Applied Mechanics and Engineering, ISSN 0045-7825, E-ISSN 1879-2138, Vol. 113, nr 3-4, s. 363-387Artikkel i tidsskrift (Fagfellevurdert)

The objective of this paper is to propose communication procedures suitable for unstructured finite element solvers implemented on distributed-memory parallel computers such as the Connection Machine CM-5 system. First, a data-parallel implementation of the recursive spectral bisection (RSB) algorithm proposed by Pothen et al. is presented. The RSB algorithm is associated with a node renumbering scheme which improves data locality of reference. Two-step gather and scatter operations taking advantage of this data locality are then designed. These communication primitives make use of the indirect addressing capability of the CM-5 vector units to achieve high gather and scatter bandwidths. The performance of the proposed communication strategy is illustrated on large-scale three-dimensional fluid dynamics problems

• 109. Johnsson, L.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
The impact of Moore's Law and loss of Dennard scaling: Are DSP SoCs an energy efficient alternative to x86 SoCs?2016Inngår i: Journal of Physics, Conference Series, ISSN 1742-6588, E-ISSN 1742-6596, Vol. 762, nr 1, artikkel-id 012022Artikkel i tidsskrift (Fagfellevurdert)

Moore's law, the doubling of transistors per unit area for each CMOS technology generation, is expected to continue throughout the decade, while Dennard voltage scaling resulting in constant power per unit area stopped about a decade ago. The semiconductor industry's response to the loss of Dennard scaling and the consequent challenges in managing power distribution and dissipation has been leveled off clock rates, a die performance gain reduced from about a factor of 2.8 to 1.4 per technology generation, and multi-core processor dies with increased cache sizes. Increased caches sizes offers performance benefits for many applications as well as energy savings. Accessing data in cache is considerably more energy efficient than main memory accesses. Further, caches consume less power than a corresponding amount of functional logic. As feature sizes continue to be scaled down an increasing fraction of the die must be "underutilized" or "dark" due to power constraints. With power being a prime design constraint there is a concerted effort to find significantly more energy efficient chip architectures than dominant in servers today, with chips potentially incorporating several types of cores to cover a range of applications, or different functions in an application, as is already common for the mobile processor market. Digital Signal Processors (DSPs), largely targeting the embedded and mobile processor markets, typically have been designed for a power consumption of 10% or less of a typical x86 CPU, yet with much more than 10% of the floating-point capability of the same technology generation x86 CPUs. Thus, DSPs could potentially offer an energy efficient alternative to x86 CPUs. Here we report an assessment of the Texas Instruments TMS320C6678 DSP in regards to its energy efficiency for two common HPC benchmarks: STREAM (memory system benchmark) and HPL (CPU benchmark).

• 110.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Computational Array for the QR–method1982Konferansepaper (Fagfellevurdert)
• 111.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A VLSI Algorithm and Array for the QR–method1981Konferansepaper (Fagfellevurdert)
• 112.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
An Algorithm for State Estimation in Power Systems1973Konferansepaper (Fagfellevurdert)
• 113.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
An Algorithm for State Estimation in Power Systems1973Konferansepaper (Fagfellevurdert)
• 114.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Band Matrix Systems Solvers on Ensemble Architectures1986Inngår i: Algorithms, Architectures and the Future of Scientific Computation, Texas Tech University Press, 1986, s. 195-216Kapittel i bok, del av antologi (Fagfellevurdert)

We present direct solvers for band matrix systems for processor ensembles configured a 2-dimensional meshes with end-around connections, binary trees, shuffle-exchange, perfect shuffle and boolean cube networks, and as clusters of processors with intracluster connections forming a torus or a boolean cube and intercluster connections forming binary trees, shuffle- exchange, perfect shuffle and boolean cube networks. The ensembles are assumed to be of the NIMD type, and each processor is equipped with substantial local storage. There is no shared storage, abd control is distributed.

• 115.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
CMSSL: A Scalable Scientific Software Library1994Konferansepaper (Fagfellevurdert)

Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used poses serious challenges for software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run-time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications

• 116.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Combining Parallel and Sequential Sorting on a Boolean n–cube1984Konferansepaper (Fagfellevurdert)
• 117.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures1987Inngår i: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 4, nr 2, s. 133-179Artikkel i tidsskrift (Fagfellevurdert)

This paper presents a few algorithms for embedding loops and multidimensional arrays in hypercubes with emphasis on proximity preserving embeddings. A proximity preserving embedding minimizes the need for communication bandwidth in computations requiring nearest neighbor communication. Two storage schemes for "large" problems on "small" machines are suggested and analyzed and algorithms for matrix transpose, multiplying matrices, factoring matrices,  and solving triangular linear systems are presented. A few complete binary tree embeddings are described and analyzed. The data movement in the matrix algorithms is analyzed and it is shown that in the majority of cases the directed routing paths intersect only at nodes of the hypercube allowing for a maximum degree of pipelining

• 118.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Cyclic Reduction on a Binary Tree1985Inngår i: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 37, nr 1-3, s. 195-203Artikkel i tidsskrift (Fagfellevurdert)

Ensembles of large numbers of processors tightly coupled into networks are of increasing interest. Binary tree interconnect has many favourable characteristics from a construction point of view, though the limited communication bandwidth between arbitrary processors poses a potential bottleneck. In this paper we present an algorithm for odd-even cyclic reduction on a binary tree for which the limited bandwidth does not increase the order of the computational complexity, compared to an ideal parallel machine. The complexity is 2 log2N with respect to arithmetic operations, and 3 log2N with respect to communication. The communication complexity compares favourably with the best previously published result, O(log22N). We also show that the benefits of truncated cyclic reduction are much greater for parallel reduction algorithms than for sequential algorithms. A reduction in the computational complexity proportional to the reduction in the number of reduction steps is possible.

• 119.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Data motion and high performance computing1994Inngår i: Data Motion and High Performance Computing, 1994, s. 1-18Konferansepaper (Fagfellevurdert)

Efficient data motion has been of critical importance in high performance computing almost since the first electronic computers were built. Providing sufficient memory bandwidth to balance the capacity of processors led to memory hierarchies, banked and interleaved memories. With the rapid evolution of MOS technologies, microprocessor and memory designs, it is realistic to build systems with thousands of processors and a sustained performance of a trillion operations per second or more. Such systems require tens of thousands of memory banks, even when locality of reference is exploited. Using conventional technologies, interconnecting several thousand processors with tens of thousands of memory banks can feasibly only be made by some form of sparse interconnection network. Efficient use of locality of reference and network bandwidth is critical

• 120.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Data Parallel Performance Optimizations Using Array Aliasing1999Inngår i: Algorithms for Parallel Processing, Springer-Verlag New York, 1999, 105, s. 213-246Kapittel i bok, del av antologi (Fagfellevurdert)

The array aliasing mechanism provided in the Connection Machine Fortran (CMF) language and run{time system provides a unique way of identifying the memory address spaces local to processors within the global address space of distributed memory architectures, while staying in the data parallel programming paradigm. We show how the array aliasing feature can be used e ectively in optimizing communication and computation performance. The constructs we present occur frequently in many sci- enti c and engineering applications, and include various forms of aggregation and array reshaping through array aliasing. The e ectiveness of the optimization techniques is demonstrated on an implementation of Anderson’s hierarchical O(N ) N {body method

• 121.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Data Parallel Programming and Basic Linear Algebra Subroutines1988Inngår i: Scientific Software, Springer, 1988, s. 183-196Kapittel i bok, del av antologi (Fagfellevurdert)
• 122.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Data Parallel Supercomputing1989Inngår i: Use of Parallel Processors in Meteorology, Springer Verlag , 1989Kapittel i bok, del av antologi (Fagfellevurdert)

Supercomputers with a performence of a trillion floating-point operations per second, or more, can be produced in state-of-the-art MOS technologies. Such computers will have tens of thousands of processors interconnected by a network of bounded degree. Reducing the requried data motion trough a careful choice of data allocation and computational and routing algorithms is critical for performance. The management of thousands of processors can only be accomplished trough programming languages with suitable abstractions.

We use Connection Machine as a model architecture for future supercomputers, and Fortran 8X as an example of a language with some of the abstractions suitable for programming thousands of processors. Some of the communication primitives suitable for expressing structured scientific computations are discussed, and their benefit with respect to performance illustrated. With thousands of processors engaged in the solution of a single scientific problem, several subtasks are often treaten concurrently in addition to the concurrent execution of each subtask. Some issues in constructing scientific libraries for such enviroments are discussed. Concurrent algorithms and performance data for matrix multiplication and the Fast Fourier Transformer are presented. The solution of the compressible Navier-Stokes equation in three spatial dimensions by an explicit finite difference method, and the solution of a paralbolic approximation of the Helmholtz equation by an implict method are two examples of applications for which data parallel implementations are described briefly. The Helmholtz equations models three dimensional acoustic waves in the ocean

• 123.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Datortekniken  : en ingenjörsmässig och humanistisk bedrift2002Inngår i: Dator till vardags, Deadalus , 2002, s. 11-30Kapittel i bok, del av antologi (Fagfellevurdert)
• 124.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Dense Matrix Operations on a Torus and a Boolean Cube1985Konferansepaper (Fagfellevurdert)
• 125.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Ensemble Architectures and Their Algorithms: An Overview1987Inngår i: Numerical Algorithms for Modern Parallel Computer Architectures, Springer Verlag , 1987, s. 109-144Kapittel i bok, del av antologi (Fagfellevurdert)
• 126.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Grids, Accounting and High-Performance Computing2007Konferansepaper (Fagfellevurdert)
• 127.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Highly Concurrent Algorithms for Solving Linear Systems of Equations1984Inngår i: Elliptic Problem Solvers II, Academic Press, 1984, s. 105-126Kapittel i bok, del av antologi (Fagfellevurdert)
• 128.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Highly Parallel Banded Systems Solvers1987Inngår i: Parallel Computations and Their Impact on Mechanics, ASME Press, 1987, s. 187-208Kapittel i bok, del av antologi (Fagfellevurdert)

We present algorithms for the solution of banded systems of equations on parallel architectures, in particular ensemble architectures, i.e., architectures that have a large number of processing elements. Each processor has its own local storage. The band is considered dense. Concurrent elimination of a single variable yields a linear speed-up for ensembles configured as tori, or Boolean cubes if N>m, with a maximum ensemble size of m(m+R) (or 2m(m+R))  processors for a banded system of N equations, bandwith 2m + 1 and R right hand sides. The minimum attainable computational complexity is of order O(N). Concurrent elimination of multiple variables as well as concurrent elemination of each such variable yields a minimum complexity of O(m+ m log2N/m) for a total of (2m + R)N ensemble nodes.   To attain this complexity the ensemble should be configured as clusters, each in the form of a torus of dimension m by 2m + R, or a Booelan cube of appropiate dimension. Furthermore, corresponding processors in different clusters are assumed to be interconnected to form a binary tree, shuffle-exchange, perfect shuffle, or Boolean cube network. The number of clusters should be of order 0(N/m)  for minimum computational comlpexity

• 129.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Issues in High Performance Computer Network1994Inngår i: IEEE Technical Comittee on Computer Archutecture Newsletter, s. 14-19Artikkel i tidsskrift (Fagfellevurdert)

And the memory modules are at opposite sides of the network, or the "switch". The network bandwidth is sufficient to suport the full bandwidth of the memory system. In Massively Parallel Processors, MPPs, the processors are typically associated with one or a few memory modules in order to reduce the demands on the network when locality of reference can be exploited. The network can only support a fraction of the memory bandwidth. This difference has important consequences both for data allocation and routing. Keeping network construction costs relatively low, dictates that networks be constructed out of parts that can be fabricated by replication processes with low unit cost, and that the same parts can be used for systems of different sizes to allow for the amortization of fixed costs over large numbers of parts. Multistage networks, such as butterfly networks, \Omega\Gammaks, works, and fat--tree networks, as well as two-- and three--dimensional meshes can all be built out of massproduce.

• 130.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Language and Compiler Issues in Scalable High Performance Libraries1993Inngår i: Compilation Techniques for Novel Architectures, Springer-Verlag New York, 1993Kapittel i bok, del av antologi (Fagfellevurdert)

Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, as well as the concurrent application of the functions to disjoint array segments, known as multiple-instance computation. Control over the data distribution is often critical for locality of reference, and so is the control over the interprocessor data motion. Scalability, while preserving efficiency, implies that the data distribution, the data motion, and the scheduling is adapted to the object shapes, the machine configuration, and the size of the objects relative to the machine size. The Connection Machine Scientific Software Library is a scalable library for distributed data structures. The library is designed for languages with an array syntax. It is accessible from all supported languages (Lisp, C, CM-Fortran, and Paris (PARallel Instruction Set) in combination with Lisp, C, and Fortran 77). Single library calls can manage both concurrent application of a function to disjoint array segments, as well as concurrency in

• 131.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Massively Parallel Computing: Data distribution and communication1993Inngår i: Parallel Architectures and their Efficient Use, Springer-Verlag New York, 1993, s. 68-92Kapittel i bok, del av antologi (Fagfellevurdert)

We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM--2 and CM--200, and give some performance data from implementations of communication primitives.

• 132.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Massively Parallel Computing: Unstructured Finite Element Simulations1993Inngår i: NAFEMS, s. 24-29Artikkel i tidsskrift (Fagfellevurdert)

Massively parallel computing holds the promise of extreme performance. Critical for achieving high performance is the ability to exploit locality of reference and effective management of the communication resources. This article describes two communication primitives and associated mapping strategies that have been used for several different unstructured, three-dimensional, finite element applications in computational fluid dynamics and structural mechanics.

• 133.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Matrix Multiplication on Boolean Cubes using Generic Communication Primitives1989Inngår i: Processing and Medium Scale Multiprocessors, SIAM , 1989, s. 108-156Kapittel i bok, del av antologi (Fagfellevurdert)

Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives. We describe generic shared memory primitives such as one-to-all and all-to-all broadcasting, and one-to-all and all-to-all personalized communication, and implementations theoref thar are within a factor of two of the best known lower bounds. We describe algorithms for the multiplication of arbitrarily shaped matrices using these primitives. Of the three loops required for a standard matrix multiplication algorithm expressed in Fortran all three can be parallelised. We show that if one loop is parallelised, then the processors shall be aligned with the loops having the most elements. Depending on the initial matrix allocation data permutatuions may be required to accomplish the processor/loop alignment. This permutation id included in our analysis. We show that in parallelizing two loops the optimum aspect ratio of the processing plane is equal to the ratio of the number of matrix elements in the two loops being parallelized

• 134.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Minimizing the Communication Time for Matrix Multiplication on Multiprocessors1993Inngår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 19, nr 11, s. 1235-1257Artikkel i tidsskrift (Fagfellevurdert)

We present one matrix multiplication algorithm for two-dimensional arrays of processing nodes, and one algorithm for three-dimensional nodal arrays. One-dimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one-, two-, or three-dimensional arrays may be embedded. For binary n-cubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square two-dimensional arrays by a factor of n/2. Configuring the N= 2(n) processing nodes as a three-dimensional array may reduce the communication complexity by a factor of N-1/6 compared to a two-dimensional nodal array. The three-dimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal two-dimensional nodal array shape with respect to communication has a ratio between the numbers of node rows and columns equal to the ratio between the numbers of matrix rows and columns of the product matrix, with the product matrix accumulated in-place. The optimal three-dimensional nodal array shape has a ratio between the lengths of the machine axes equal approximately to the ratio between the lengths of the three axes in matrix multiplication. For product matrices of extreme shape, one-dimensional nodal array shapes are optimal when N/n less than or similar to 2 P/R for P > R, or N/n less than or similar to 2R/P for R greater than or equal to P, where P is the number of rows and R the number of columns of the product matrix. All our algorithms use standard communication functions.

• 135.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Network–Related Performance Issues and Techniques for MPPs1996Inngår i: Optoelectronic Interconnect and Packaging, SPIE - International Society for Optical Engineering, 1996, CR62, s. 176-209Kapittel i bok, del av antologi (Fagfellevurdert)

In this paper we review network related performance issues for current massively parallel processors (MPPs) in the context of some important basic operations in scientific and engineering computation. The communication system is one of the most performance critical architectural components of MPPs. In particular, understanding the demand posed by collective communication is critical in architectural design and system software implementation. We discuss collective communication and some implementation techniques therefore on electronic networks. Finally, we give an example of a novel general routing technique that exhibits good scalability, efficiency and simplicity in electronic networks.

• 136.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Optimal Communication in Network Architectures1990Inngår i: VLSI Frontiers: Massively Parallel Models of Computation, Morgan Kaufmann Publishers, 1990, s. 223-389Kapittel i bok, del av antologi (Fagfellevurdert)
• 137.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Overview of Data Centers Energy Efficiency Evolution2011Inngår i: Handbook of Green Computing, CRC Press, 2011Kapittel i bok, del av antologi (Fagfellevurdert)
• 138.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Performance Modeling of Distributed Memory Architectures1991Inngår i: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 12, nr 4, s. 300-312Artikkel i tidsskrift (Fagfellevurdert)

We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single-source and multiple-source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, the data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory and a set of processor registers. The analytic models are verified by measurements on the Connection Machine Model CM-2.

• 139.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Pipelined Linear Equation Solvers and VLSI1982Konferansepaper (Fagfellevurdert)

Many of the commonly used methods for solution of linear systems of equations on sequential machines can be given a concurrent formulation. The concurrent algorithms take advantage of independence of operations in order to reduce the time complexity of the methods. During the course of computations specified by the algorithm data has to be routed to the various places of computation. Pipelining can be used to avoid broadcasting in VLSI arrays for computation. Pipelining will in general allow for a reduced cycle time but may force data to be spread out in time, as is the case for Gaussian elimination. What the required spacing is depends on the pipelining and the data flow.

• 140.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Solving Narrow Banded Systems on Ensemble Architectures1985Inngår i: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 11, nr 3, s. 271-288Artikkel i tidsskrift (Fagfellevurdert)

We present concurrent algorithms for the solution of narrow banded systems on ensemble architectures, and analyze the communication and arithmetic complexities of the algorithms. The algorithms consist of three phases. In phase 1, a block tridiagonal system of reduced size is produced through largely local operations. Diagonal dominance is preserved. If the original system is positive, definite, and symmetric, so is the reduced system. It is solved in a second phase, and the remaining variables obtained through local back substitution in a third phase. With a sufficient number of processing elements, there is no first and third phase. We investigate the arithmetic and communicationcomplexity of Gaussian elimination and block cyclic reduction for the solution of the reduced system on boolean cubes, perfect shuffle and shuffle-exchange networks, binary trees, and linear arrays. With an optimum number of processors, the minimum solution time on a linear array is of an order that ranges from Om2Nm to O(m3 + m3log2(N/m)) depending on the bandwidth, the dimension of the problem, and the times for communication and arithmetic. For boolean cubes, cube-connected cycles, prefect shuffle and shuffle-exchange networks, and binary trees, the minimum time is Om3+m3log 2N/m including the communication complexity

• 141.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Solving Tridiagonal Systems on Ensemble Architectures1987Inngår i: SIAM Journal on Scientific and Statistical Computing, Vol. 8, nr 3, s. 354-392Artikkel i tidsskrift (Fagfellevurdert)

The concurrent solution of tridiagonal systems on linear and 2-dimensional arrays, complete binary trees, shuffle-exchange and perfect shuffle networks, and boolean cubes by elimination methods are devised and analyzed. The methods can be obtained by symmetric permutations of some rows and columns, and amounts to cyclic reduction or a combination of Gaussian elimination and cyclic reduction (GECR). The ensembles have only local storage and no global control. Synchronization is accomplished via message passing to neighboring processors. The parallel arithmetic complexity of GECR for $N$ equations on a $K$ processor ensemble is $O({ N / K } + \log _2 K)$, and the communication complexity is $O(K)$ for the linear array, $O(\sqrt K )$ for the 2-dimensional mesh, and $O(\log _2 K)$ for the networks of diameter $O(\log _2 K)$. The maximum speed-up for the linear array is attained at $K \approx {({ N / \alpha })}^{1/2}$ and for the 2-d mesh at $K \approx ({N / 2\alpha })^{2/3}$, where $\alpha$ (the time to communicate one floating-point number)/(the time for a floating-point arithmetic operation). For the binary tree the maximum speed-up is attained at $K = N$, and for the perfect shuffle and boolean $k$-cube networks, $K = N/(1 + \alpha )$ yields the maximum speed-up. The minimum time complexity is of order $O(N^{1/2} )$ for the linear array, of order $O(N^{1/3} )$ for the mesh, and of order $O(\log _2 N)$ for the binary tree, the shuffle-exchange, the perfect shuffle and the boolean $k$-cube. The relative decrease in computational complexity due to a truncation of the reduction process in a highly concurrent system is much greater than on a uniprocessor. The reduction in the arithmetic complexity is proportional to the number of steps avoided, if the number of processing elements equals the number of equations. So also is the reduction in the communication complexity for ensembles configured as binary trees, shuffle-exchange and perfect shuffle networks, and boolean cubes. Partitioning the ensemble into subsets of processors is shown to be more efficient for the solution of multiple independent problems than pipelining the solutions over the entire ensemble. A balanced cyclic reduction algorithm is presented for the case where each system is spread uniformly over the processing elements, and its complexity is compared with Gaussian elimination.

• 142.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Supercomputers: Past and Future1990Inngår i: Kosmos, Almqvist & Wiksell, 1990, s. 31-44Kapittel i bok, del av antologi (Fagfellevurdert)

Abstract: "Progress in many fields of science and in engineering design is rapidly becomming [sic] critically dependent upon supercomputers. The management of very large data sets, including fast update and retrieval of information, is also becomming [sic] a very important function in many non-manufacturing businesses, such as the transportation, the securities, and financial industries, and in various parts of the government. The goal for the designers of the next generation

• 143.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
The Connection Machine System CM–51993Konferansepaper (Fagfellevurdert)
• 144.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
VLSI Algorithms for Doolittle’s, Crout’s and Cholesky’s Methods1982Konferansepaper (Fagfellevurdert)

In order to take full advantage of the emerging VLSI technology it is required to recognize its limited communication capability and structure algorithms accordingly. In this paper concurrent algorithms for the methods of Crout, Doolittle and Cholesky are described and compared with concurrent algorithms for Gauss', Given's and Householder's method. The effect of pipelining the computations in two dimensional arrays is given special attention.

• 145.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
The SNIC/KTH PRACE prototype: Achieving high energy efficiency with commodity technology without acceleration2010Konferansepaper (Fagfellevurdert)

Energy efficiency has become one of the most important considerations for HPC systems, particularly for large scale systems, for economic and environmental reasons and in exceptional cases also social and political. Many approaches are currently being pursued both in regards to architecture and hardware and software technologies to improve energy efficiency for HPC systems. The prototype described here, one of several within the PRACE project exploring improved energy efficiency, explores energy efficiency achievable through use of commodity components for cost effectiveness, and without acceleration for preservation/ease of portability of the large application code base that exists for the type of HPC systems that have been dominating for a decade. The prototype development was a collaborative effort between industry and academia. With a very limited budget for a server design project and severe time constraints the novelty was effectively limited to careful component choices in regards to energy efficiency for HPC workloads and a new motherboard design to support the component choices. A further constraint was that the outcome would be of production quality in order for the industry partners to market the prototype design should it be successful. For the component choices we did a characterization of the power consumption of a blade chassis and made an effort to measure the energy consumption of different memory modules under HPC workloads, information we could not find neither in the literature nor from memory or system vendors. Memory power consumption in the prototype, as well as most HPC systems, is second only to the CPU, sometimes a close second. We report on the design of the prototype, and preliminary performance results with an emphasis on the energy aspects of benchmarks and compare our results with the Blue Gene/P that, after its introduction, has dominated the top of the Green500 list for systems not using acceleration. The preliminary results show tha- - t energy efficiency comparable to the BG/P can be achieved without any proprietary technology at a fraction of the cost. The prototype design is now included in the standard product line of the participating platform vendor.

• 146.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Mathematical Approach to Modeling the Flow of Data and Control in Computational Networks1981Inngår i: VLSI Systems and Computations, Computer Science Press, 1981, s. 213-225Kapittel i bok, del av antologi (Fagfellevurdert)
• 147.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Node Orderings and Concurrency in Structurally–Symmetric Sparse Problems1989Inngår i: Parallel Supercomputing: Methods, Algorithms and Applications, Wiley-Blackwell, 1989, s. 177-189Kapittel i bok, del av antologi (Fagfellevurdert)
• 148.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Algorithms for Matrix Transposition on Boolean N-Cube Configured Ensemble Architectures1988Inngår i: SIAM Journal on Matrix Analysis and Applications, ISSN 0895-4798, E-ISSN 1095-7162, Vol. 9, nr 3, s. 419-454Artikkel i tidsskrift (Fagfellevurdert)

In a multiprocessor with distributed storage the data structures have a significant impact on the communication complexity. In this paper we present a few algorithms for performing matrix transposition on a Boolean $n$-cube. One algorithm performs the transpose in a time proportional to the lower bound both with respect to communication start-ups and to element transfer times. We present algorithms for transposing a matrix embedded in the cube by a binary encoding, a binary-reflected Gray code encoding of rows and columns, or combinations of these two encodings. The transposition of a matrix when several matrix elements are identified to a node by consecutive or cyclic partitioning is also considered and lower bound algorithms given. Experimental data are provided for the Intel iPSC and the Connection Machine

• 149.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Boolean Cube Emulation of Butterfly Networks Encoded by Gray Code1994Inngår i: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 20, nr 3, s. 261-179Artikkel i tidsskrift (Fagfellevurdert)

The authors present algorithms for butterfly emulation on binary-reflected Gray coded data that require the same number of element transfers in sequence in a Boolean cube network as for a binary encoding. The required code conversion is either performed in local memories, or through concurrent exchanges not effecting the number of element transfers in sequence. The emulation of a butterfly network with one or two elements per processor requires n communication cycles on an n-cube. For more than two elements per processor, one additional communication cycle is required for every pair of elements. The encoding on completion can be either binary, or binary reflected Gray code, or any combination thereof, without affecting the communication complexity.

• 150.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Generalized Shuffle Permutations on Boolean Cubes1992Inngår i: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 16, nr 1, s. 1-14Artikkel i tidsskrift (Fagfellevurdert)

In a generalized permutation an address (a[subscript q-1]a[subscript q-2] ... a0 receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Bit-complementation may be combined with the shift. We give an algorithm that requires K/2 + 2 exchanges for K elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions [omega subscript r] in the permutation.

123456 101 - 150 of 260
Referera
Referensformat
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Annet format
Fler format
Språk
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Annet språk
Fler språk
Utmatningsformat
• html
• text
• asciidoc
• rtf
v. 2.35.9
| | | |