Change search
Refine search result
123 1 - 50 of 149
Cite
Citation style
• apa
• harvard1
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf
Rows per page
• 5
• 10
• 20
• 50
• 100
• 250
Sort
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the 'Create feeds' function.
• 1. Abdullah, Matin
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
SimDB: A Problem Solving Environment for Molecular Dynamics Simulation and Analysis2000Conference paper (Refereed)

The design of a software environment, SimDB, for molecular dynamics simulation and analysis is presented as an example of virtual laboratories enabled by high-speed networks connecting substantial computing and storage resources with more modest local compuation and visualization resources available to research groups. SimDB includes large-scale, dynamic, distributed data repositories. The simulated data sets, trajectories, are usually interpreted through reduced data sets, processed data sets, calculated by analysis functions. Both trajectory data and processed data are saved, but in differnt data bases, with processed data bases having several smaller objects for each trajectory. A browser based user interface with a well defined API allows for a wide array of analysis functions. Analysis functions are executed only if the requested analysis result is not available. The ability to incorporate user defined functions is a critical feature of SimDB.

• 2.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Dynamic, Context-Aware, Least-Privilege Grid Delegation2007In: 8th IEEE/ACM International Conference on Grid Computing, New York: IEEE , 2007, p. 209-216Conference paper (Refereed)

Performing delegation in large scale, dynamic and distributed environments with large numbers of shared resources is more challenging than inside local administrative domains. In dynamic environments like Grids, on one hand, delegating a restricted set of rights reduces exposure to attack but also limits the flexibility and dynamism of the application; on the other hand, delegating all rights provides maximum flexibility but increases exposure. This issue has not yet been adequately addressed by current Grid security mechanisms and is becoming a very challenging and crucial issue for future Grid development. Therefore, providing an effective delegation mechanism which meets the requirements of the least privilege principle is becoming an essential need. Furthermore, we are witnessing a phenomenal increase in the automation of organizational tasks and decision making, as well as the computerization of information related services, requiring automated delegation mechanisms. In order to meet these requirements we introduce an Active Delegation Framework which extends our previous work on on-demand delegation, making it context-aware. The framework provides a just-in-time, restricted and dynamic delegation mechanism for Grids. In this paper we describe the development of this framework and its implementation and integration with the Globus Toolkit.

• 3.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC. KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Toward An On-demand Restricted Delegation Mechanism for Grids2006In: 2006 7TH IEEE/ACM INTERNATIONAL CONFERENCE ON GRID COMPUTING, New York: IEEE , 2006, p. 152-159Conference paper (Refereed)

Grids are intended to enable cross-organizationalinteractions which makes Grid security a challenging and nontrivialissue. In Grids, delegation is a key facility that canbe used to authenticate and authorize requests on behalf ofdisconnected users. In current Grid systems there is a tradeoffbetween flexibility and security in the context of delegation.Applications must choose between limited or full delegation: onone hand, delegating a restricted set of rights reduces exposure toattack but also limits the flexibility/dynamism of the application;on the other hand, delegating all rights provides maximumflexibility but increases exposure. In this paper, we propose anon-demand restricted delegation mechanism, aimed at addressingthe shortcomings of current delegation mechanisms by providingrestricted delegation in a flexible fashion as needed for Grid applications.This mechanism provides an ontology-based solutionfor tackling one the most challenging issues in security systems,which is the principle of least privileges. It utilizes a callbackmechanism, which allows on-demand provisioning of delegatedcredentials in addition to observing, screening, and auditingdelegated rights at runtime. This mechanism provides supportfor generating delegation credentials with a very limited andwell-defined range of capabilities or policies, where a delegatoris able to grant a delegatee a set of restricted and limited rights,implicitly or explicitly.

• 4. Ali, Ayaz
University of Houston, Texas.
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms2007Conference paper (Refereed)

We present an automatic, empirically tuned code genenrator for Real/Complex FFT and Trigonometric Transforms. The code generator is part of an adaptive and portable FFT computation framework - UHFFT. Performance portability over varying architectures is achieved by generating highly optimized set of straight line C codelets (micro-kernel) that adapt to the microprocessor architecture. The tuning is performed by generating several variants of same size codelet with different combinations of optimization parameters. The variants are iteratively compiled and evaluated to find the best implementation on a given platform. Apart from minimizing the operation count, the code generator optimizes for access pattern, register blocking, instruction schedule and structure of arrays. We present details of the optimizations conducted at several stages and the performance gain at each of those levels. We conclude the paper with discussion of the overall performance improvements due to this aggressive approach to generating optimized FFT kernels.

• 5. Ali, Ayaz
University of Houston.
Scheduling FFT Computation on SMP and Multi-core Systems2007In: Proceedings of the 21st annual international conference on Supercomputing, 2007, p. 293-301Conference paper (Refereed)

Increased complexity of memory systems to ameliorate the gap between the speed of processors and memory has made it increasingly harder for compilers to optimize an arbitrary code within a palatable amount of time. With the emergence of multicore (CMP), multiprocessor (SMP) and hybrid shared memory multiprocessor architectures, achieving high e ciency is becoming even more challenging. To address the challenge to achieve high e ciency in performance critical applications, domain speci c frameworks have been developed that aid the compilers in scheduling the computations. We have developed a portable framework for the Fast Fourier Transform (FFT) that achieves high e ciency by automatically adapting to various architectural features. Adapting to parallel architectures by searching through all the combinations of schedules (plans) is an expensive task, even when the search is conducted in parallel. In this paper, we develop heuristics to simplify the generation of better schedules for parallel FFT computations on CMP/SMP systems. We evaluate the performance of OpenMP and PThreads implementations of FFT on a number of latest architectures. The performance of parallel FFT schedules is compared with that of the best plan generated for sequential FFT and the speedup for di erent number of processors is reported. In the end, we also present a performance comparison between the UHFFT and FFTW implementations.

• 6. Ali, Ayaz
Department of Computer Science, University of Houston, Houston.
Adaptive Computation of Self Sorting In-place FFTs on Hierarchical Memory Architectures2007In: HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2007, p. 372-383Conference paper (Refereed)

Computing ”in-place and in-order”FFT poses a very difficult problem on hierarchical memory architectures where data movement can seriously degrade the performance. In this paper we present recursive formulation of a self sorting in-place FFT algorithm that adapts to the target architecture. For transform sizes where an in-place, in-order execution is not possible, we show how schedules can be constructed that use minimum work-space to perform the computation efficiently. In order to express and construct FFT schedules, we present a context free grammar that generates the FFT Schedule Specification Language. We conclude by comparing the performance of our in-place in-order FFT implementation with that of other well known FFT libraries. We also present a performance comparison between the out-of-place and in-place execution of various FFT sizes.

• 7. Austin, S
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.

The Minority University Consortium for Earth and Space Sciences (MUCESS), a collaboration among diverse minority institutions dedicated to increasing the number of underrepresented students pursuing professional and research careers in Earth and Atmospheric Science and Space Science, were informed that they had been funded by NSF for a faculty and student research opportunity in atmospheric science. Among the institutions only Medgar Evers College, City University of New York had a prior program in ozone monitoring and a bachelor's degree in environmental science. The funding provided an opportunity to strengthen the initial team with the addition of G. Morris, Valparaiso University and B. Lefer, University of Houston as both had an ongoing ozone research program. The grant enabled MEC to continue their activities and the University of Houston-Downtown to increase the number of launches per year. South Carolina State University is able to strengthen their support system and incorporate the activities into both their academic and outreach programs. The opportunity to partner with G. Morris and B. Lefer will enable the institutions to expand their ozonesonde launches to include both tropospheric and stratospheric ozone distribution and transport. Faculty student workshops will be an integral part of the program as the activity will increase the scientific knowledge of the participants.

The program provides an opportunity for minority students to pursue studies in the geosciences and develop the skills and knowledge to pursue graduate degrees in the discipline.

• 8. Baillie, Clive
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
QCD with Dynamical Fermions on the Connection Machine1989Conference paper (Refereed)

We have implemented Quantum Chromo-Dynamics (QCD) on the massively parallel Connection Machine in *Lisp. The code uses dynamical Wilson fermions and the Hybrid Monte Carlo Algorithm (HMCA) to update the lattice. We describe our program and give performance measurements for it. With no tuning or optimization, the code runs at approximately 500 to 1000 MFLOPS on a 64-K Connection Machine, model CM-2, depending on the VP ratio.

• 9. Berman, F.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
New grid scheduling and rescheduling methods in the GrADS Project2005In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 33, no 3-Feb, p. 209-229Article in journal (Refereed)

The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. This paper presents recent extensions to the GrADS software framework: a new approach to scheduling workflow computations, applied to a 3-D image reconstruction application; a simple stop/migrate/restart approach to rescheduling Grid applications, applied to a QR factorization benchmark; and a process-swapping approach to rescheduling, applied to an N-body simulation. Experiments validating these methods were carried out on both the GrADS MacroGrid (a small but functional Grid) and the MicroGrid (a controlled emulation of the Grid).

• 10. Berman, F.
KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
The GrADS project: Software support for high-level grid application development2001In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 15, no 4, p. 327-344Article in journal (Refereed)

Advances in networking technologies will soon make it possible to use the global information infrastructure in a qualitatively different way-as a computational as well as an information resource. As described in the recent book The Grid: Blueprint for a New Computing Infrastructure, this Grid will connect the nation's computers, databases, instruments, and people in a seamless web of computing and distributed intelligence, which can be used in an on demand fashion as a problem-solving resource in many fields of human endeavor-and, in particular, science and engineering. The availability of grid resources will give rise to dramatically new classes of applications, in which computing resources are no longer localized but, rather, distributed, heterogeneous, and dynamic; computation is increasingly sophisticated and multidisciplinary; and computation is integrated into our daily lives and, hence, subject to stricter time constraints than at present. The impact of these new applications will be pervasive, ranging from new systems for scientific inquiry, through computing support for crisis management, to the use of ambient computing to enhance personal mobile computing environments. To realize this vision, significant scientific and technical obstacles must be overcome. Principal among these is usability. The goal of the Grid Application Development Software (GrADS) project is to simplify distributed heterogeneous computing in the same way that the World Wide Web simplified information sharing over the Internet. To that end, the project is exploring the scientific and technical problems that must be solved to make it easier for ordinary scientific users to develop, execute, and tune applications on the Grid. In this paper, the authors describe the vision and strategies underlying the GrADS project, including the base software architecture for grid execution and performance monitoring, strategies and tools for construction of applications from libraries of grid-aware components, and development of innovative new science and engineering applications that can exploit these new technologies to run effectively in grid environments.

• 11. Brickner, R.G
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
QCD on the Connection Machine: Beyond-Lisp1991In: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 65, p. 39-51Article in journal (Refereed)

We report on the status of code development for a simulation of quantum chromodynamics (QCD) with dynamical Wilson fermions on the Connection Machine model CM-2. Our original code, written in * Lisp, gave performance in the near-GFLOPS range. We have rewritten the most time-consuming parts of the code in the low-level programming system CMIS, including the matrix multiply and the communication. Current versions of the code run at approximately 3.6 GFLOPS for the fermion matrix inversion, and we expect the next version to reach or exceed 5 GFLOPS.

• 12. Brunet, Jean-Philippe
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
All–to–All Broadcast with Applications on the Connection Machine1992In: International Journal of Supercomputer Applications, Vol. 6, no 3, p. 241-256Article in journal (Refereed)

An all-to-all broadcast algorithm that exploits concur rent communication on all channels of the Connection Machine system CM-200 binary cube network is de scribed. Issues in integrating a physical all-to-all broad cast between processing nodes into a language envi ronment using a global address space are discussed. Timings for the physical broadcast between nodes and for the virtual broadcast are given. The peak data transfer rate for the physical broadcast on a CM-200 is 5.9 gigabytes/sec, and the peak rate for the virtual broadcast is 31 gigabytes/sec. Array reshaping is an effective performance optimization technique. An ex ample is given where reshaping improved perfor mance by a factor of 7 by reducing the amount of local data motion. We also show how to exploit symmetry for computation of an interaction matrix using the all- to-all broadcast function. Further optimizations are suggested for N-body-type calculations. Using the all- to-all broadcast function, a peak rate of 9.3 GFLOPS/ sec has been achieved for the N-body computations in 32-bit precision on a 2,048 node Connection Machine system CM-200.

• 13. Chiang, Chao-Lin
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Residue Arithmetic and VLSI1983Conference paper (Refereed)

In the residue number system arithmetic is carried out on each digit individually. There is no carry chain. This locality is of particular interest in VLSI. An evaluation of different implementations of residue arithmetic is carried out, and the effects of reduced feature sizes estimated. At the current state of technology the traditional table lookup method is preferable for a range that requires a maximum modulus that is represented by up to 4 bits, while an array of adders offers the best performance fur 7 or more bits. A combination of adders and tables covers 5 and 6 bits the best. At 0.5 mu m feature size table lookup is competitive only up to 3 bits, These conclusions are based on sample designs in nMOS.

• 14. Cohen, D
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Mathematical Approach to Computational Networks1983Conference paper (Refereed)

This report deals with design principles for iterative computational networks. Such computational networks are used for performing repetitive computations which typically are not data-dependent. Most of the signal processing algorithms, like FFT and filtering, belong to this class. The main idea in this report is the development of mathematical notation for expressing such designs. This notation captures the important features and properties of these computational networks, and can be used for analyzing, designing, and objectively evaluating computational networks.

• 15. Cohen, Danny
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
An Algebraic Description of Array Implementations of FFT Algorithms1982Conference paper (Refereed)
• 16. Cooper, K.
KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
New Grid Scheduling and Rescheduling Methods in the GrADS Project2004Conference paper (Refereed)

Summary form only given. The goal of the Grid Application Development Software (GrADS) project is to provide programming tools and an execution environment to ease program development for the grid. We present recent extensions to the GrADS software framework: 1. A new approach to scheduling workflow computations, applied to a 3D image reconstruction application; 2. A simple stop/migrate/restart approach to rescheduling grid applications, applied to a QR 3. A process-swapping approach to rescheduling, applied to an N-body simulation. Experiments validating these methods were carried out on both the GrADS MacroGrid (a small but functional grid) and the MicroGrid (a controlled emulation of the grid) and the results were demonstrated at the SC2003 conference.

• 17. Dongarra, Jack
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Solving Banded Systems on Parallel Architectures1987In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 5, no 2, p. 219-246Article in journal (Refereed)
• 18. Edelman, Alan
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Index Transformation Algorithms in a Linear Algebra Framework1994In: IEEE Transactions on Parallel and Distributed Systems, ISSN 1045-9219, E-ISSN 1558-2183, Vol. 5, no 12, p. 1302-1309Article in journal (Refereed)

We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and 1's.

• 19.
KTH, School of Computer Science and Communication (CSC), Numerical Analysis, NA.
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Proceedings PDC Seventh Annual Conference: Simulation and Visualization on the Grid2000 (ed. 13)Book (Refereed)

The Grid is an emerging computational infrastructure, similar to the pervasive energy infrastructure provided by national power grids. Simulation and Visualization on the Grid focuses on applications and technologies on this emerging computational Grid. Readers will find interesting discussions of such Grid technologies as distributed file I/O, clustering, CORBA software infrastructure, tele-immersion, interaction environments, visualization steering and virtual reality as well as applications in biology, chemistry and physics. A lively panel discussion addresses current successes and pitfalls of the Grid. This book provides an understanding of the Grid that offers a persistent, wide-scale infrastructure for solving problems.

• 20. Feig, Michael
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Large Scale Data Repository: Design of a Molecular Dynamics Trajectory Database1999In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 16, no 1, p. 101-110Article in journal (Refereed)

The design of a molecular dynamics trajectory database is presented as an example of the organization of large-scale dynamic distributed repositories for scientific data. Large scientific datasets are usually interpreted through reduced data calculated by analysis functions. This allows a database architecture in which the analyzed datasets, that are kept in addition to the raw datasets, are transferred to a database user. A flexible user interface with a well defined Application Program Interface (API) allows for a wide array of analysis functions and the incorporation of user defined functions is a critical part of the database design. An analysis function is executed only when the requested analysis result is not available from an earlier request. A prototype implementation used to gain initial practical experiences with performance and scalability is presented.

• 21. Feng, Cheng
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Developing Assays for the Detection of Influenza in Human Samples2007In: International Conference on Bioinformatics & Computational Biology, BIOCOMP 2007 / [ed] Hamid R. Arabnia, Mary Qu Yang, Jack Y. Yang, 2007, p. 781-Conference paper (Refereed)
• 22. Gardfjall, Peter
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Scalable Grid-wide capacity allocation with the SweGrid Accounting System (SGAS)2008In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 20, no 18, p. 2089-2122Article in journal (Refereed)

The SweGrid Accounting System (SGAS) allocates capacity in collaborative Grid environments by coordinating enforcement of Grid-wide usage limits as a means to offer usage guarantees and prevent overuse. SGAS employs a credit-based allocation model where Grid capacity is granted to projects via Grid-wide quota allowances that can be spent across the Grid resources. The resources Collectively enforce these allowances in a soft, real-time manner. SGAS is built on service-oriented principles with a strong focus on interoperability, and Web services standards. This article covers the SGAS design and implementation, which, besides addressing inherent Grid challenges (scale, security, heterogeneity, decentralization), emphasizes generality and flexibility to produce a customizable system with lightweight integration into different middleware and scheduling system combinations. We focus the discussion around the system design, a flexible allocation model, middleware integration experiences and scalability improvements via a distributed virtual banking system, and finally, an extensive set of testhed experiments. The experiments evaluate the performance of SGAS in terms of response times, request throughput, overall system scalability, and its performance impact on the Globus Toolkit 4 job submission software. We conclude that, for all practical purposes, the quota enforcement overhead incurred by SGAS on job submissions is not a limiting factor for the job-handling capacity of the job submission software.

• 23. George, William
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
POLYSHIFT Communications Software for the Connection Machine System CM–2001994In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 3, no 1, p. 83-99Article in journal (Refereed)

We describe the use and implementation of a polyshift function PSHIFT for circular shifts and end-offs shifts. Polyshift is useful in many scientific codes using regular grids, such as finite difference codes in several dimensions, and multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as quantum chromodynamics (QCD) calculations. Our implementation of the PSHIFT function on the Connection Machine systems CM-2 and CM-200 offers a speedup of up to a factor of 3-4 compared with CSHIFT when the local data motion within a node is small. The PSHIFT routine is included in the Connection Machine Scientific Software Library (CMSSL).

• 24. Gerogiannis, D.C
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Histogram Computation on Distributed Memory Architectures1989In: Concurrency: Practice and Experience, Vol. 1, no 2, p. 219-237Article in journal (Refereed)

One data-independent and one data-dependent algorithm for the computation of image histograms on parallel computers are presented, analysed and implemented on the Connection Machine system CM-2. The data-dependent algorithm has a lower requirement on communication bandwidth by only transferring bins with a non-zero count. Both algorithms perform all-to-all reduction, which is implemented through a sequence of exchanges as defined by a butterfly network. The two algorithms are compared based on predicted and actual performance on the Connection Machine CM-2. With few pixels per processor the data-dependent algorithm requires in the order of √B data transfers for B bins compared to B data transfers for the data-independent algorithm. As the number of pixels per processor grows the advantage of the data-dependent algorithm decreases. The advantage of the data-dependent algorithm increases with the number of bins of the histogram.

• 25. Harris, Tim
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Matrix Multiplication on the Connection Machine1989Conference paper (Refereed)

A data parallel implementation of the multiplication of matrices of arbitrary shapes and sizes is presented. A systolic algorithm based on a rectangular processor layout is used by the implementation. All processors contain submatrices of the same size for a given operand. Matrix-vector multiplication is used as a primitive for local matrix-matrix multiplication in the Connection Machine system CM-2 implementation. The peak performance of the local matrix-matrix multiplication is in excess of 20 Gflops s-1. The overall algorithm including all required data motion has a peak performance of 5.8 Gflops s-1.

• 26. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Systolic FFT Algorithms on Boolean Cubes1988Conference paper (Refereed)

A description is given of a systolic Cooley-Tukey fast Fourier transform algorithm for Booleann-cubes with a substantial amount of storage per cube node. In mapping a Cooley-Tukey type FFT to such a network, the main concerns are effective use of the high connectivity/bandwidth of the Booleann-cube, the computational resources, the storage bandwidth, if there is a storage hierarchy, and the pipelines should the arithmetic units have such a feature. Another important consideration in a multiprocessor, distributed storage architecture is the allocation and access to coefficients, if they are precomputed. FFT algorithms are described that use both the storage bandwidth and the communication system optimally and require storage ofP+nNcoefficients for a transform onP&ges;Ndata elements. A complex-to-complex FFT on 16 million points is predicted to require about 1.5 s on a Connection Machine model CM-2

• 27. Ho, Ching-Tien
Algorithms for Matrix Transposition on Boolean Cube Configured Ensemble Architectures1987Conference paper (Refereed)

In a multiprocessor with distributed storage the data structures have a significant impact on the communication complexity. In this paper we present a few algorithms for performing matrix transposition on a Boolean $n$-cube. One algorithm performs the transpose in a time proportional to the lower bound both with respect to communication start-ups and to element transfer times. We present algorithms for transposing a matrix embedded in the cube by a binary encoding, a binary-reflected Gray code encoding of rows and columns, or combinations of these two encodings. The transposition of a matrix when several matrix elements are identified to a node by consecutive or cyclic partitioning is also considered and lower bound algorithms given. Experimental data are provided for the Intel iPSC and the Connection Machine.

• 28. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes1986Conference paper (Refereed)

High communication bandwidth in standard technologies is more expensive to realize than a high rate of arithmetic or logic operations. The effective utilization of communication resources is crucial for good overall performance in highly concurrent systems. In this paper we address two different communication problems in Boolean n-cube configured multiprocessors: 1) broadcasting, i.e., distribution of common data from a single source to all other nodes, and 2) sending personalized data from a single source to all other nodes. The well known spanning tree algorithm obtained by bit-wise complementation of leading zeroes (referredto as the SBT algorithm for Spanning Binomial nee) is compared with an algorithm using multiple spanning binomial trees (MSBT). The MSBT dgorithm offers a potential speed-up over the SBT dgorithm by afactor of log2 N. We also present a balanced #panning tree algorithm (BST) that offers a lower complexity than the SBT algorithm for Case 2. The potential improvement is by a factor of 3 log2 N. The analysis takes into account the size of the data sets, the communication bandwidth, and the overhead in communication. We also provide some experimental data for the Intel iPSC'd7.

• 29. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Embedding Hyper–pyramids in Hypercubes1994In: IBM Journal of Research and Development, ISSN 0018-8646, E-ISSN 2151-8556, Vol. 38, no 1, p. 31-45Article in journal (Refereed)
• 30. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Embedding Meshes in Boolean Cubes by Graph Decomposition1990In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 8, no 4, p. 325-339Article in journal (Refereed)

This paper explores the embeddings of multidimensional meshes into minimal Boolean cubes by graph decomposition. The dilation and the congestion of the product graph (G1 × G2) → (H1 × H2) is the maximum of the dilation and congestion for the two embeddings G1H1 and G2H2. The graph decomposition technique can be used to improve the average dilation and average congestion. The graph decomposition technique combined with some particular two-dimensional embeddings allows for minimal-expansion, dilation-two, congestion-two embeddings of about 87% of all two-dimensional meshes, with a significantly lower average dilation and congestion than by modified line compression. For three-dimensional meshes we show that the graph decomposition technique, together with two three-dimensional mesh embeddings presented in this paper and modified line compression, yields dilation-two embeddings of more than 96% of all three-dimensional meshes contained in a 512 × 512 × 512 mesh. The graph decomposition technique is also used to generalize the embeddings to meshes with wrap-around. The dilation increases by at most one compared to a mesh without wraparound. The expansion is preserved for the majority of meshes, if a wraparound feature is added to the mesh.

• 31. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Embedding Meshes into Small Boolean Cubes1990Conference paper (Refereed)

The embedding of arrays in Boolean cubes, when there are more array elements than nodes in the cube, can always be made with optimal load-factor by reshaping the array to a one-dimensional array. We show that the dilation for such an embedding is of an .to x .t1 x - + x &-I array in an n-cube.Dila tion one embeddings can be obtained by splitting each axis into segments and assigning segments to nodes in the cube by a Gray code. The load-factor is optimal if the axis lengths contain sufficiently many powers of two. The congestion is minimized, if the segment lengths along the different axes are as equal as possible, for the cube configured with at most as many axes as the array. A further decrease in the congestion is possible if the array is partitioned into subarrays, and corresponding axis of different subarrays make use of edge-disjoint Hamiltonian cycles within subcubes. The congestion can also be reduced by using multiple paths between pairs of cube nodes, i.e., by using “fat” edges.

• 32. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Embedding Three–Dimensional Meshes in Boolean Cubes by Graph Decomposition1990Conference paper (Refereed)

This paper explores the embeddings of multidimensional meshes into minimal Boolean cubes by graph decomposition. The graph decomposition technique can be used to improve the average dilation and average congestion. The graph decomposition technique combined with some particular two-dimensional embeddings allows for minimal-expansion, dilation-two, congestion-two embeddings of about 87% of all two-dimensional meshes, with a significantly lower average dilation and congestion than by modified line compression. For three-dimensional meshes the authors show that the graph decomposition technique, together with two three-dimensional mesh embeddings presented in this paper and modified line compression, yields dilation-two embeddings of more than 96% of all three dimensional meshes contained in a 512 {times} 512 {times} 512 mesh.

• 33. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage1991Conference paper (Refereed)

For matrix multiplication on hypercube multiproces- sors with the product matrix accumulated in place a processor must receive about P2= p N elements of each input operand, with operands of size P P distributed evenly over N processors. With concurrent communi- cation on all ports, the number of element transfers in sequence can be reduced to P2= p N logN for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multipli- cation with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P 3=N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 G ops, which would scale up to about 13 G ops for a 64K full machine.

• 34. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Maximizing Channel Utilization for All–to–All Personalized Communication on Boolean cubes1991Conference paper (Refereed)
• 35. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
On the Embedding of Arbitrary Meshes in Boolean Cubes with Expansion Two Dilation Two1987Conference paper (Refereed)
• 36. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Optimizing Tridiagonal Solvers for the Alternating Direction Method on Boolean Cube Multiprocessors1990In: SIAM Journal on Scientific Computing, ISSN 1064-8275, E-ISSN 1095-7197, Vol. 11, no 3, p. 563-592Article in journal (Refereed)

Sets of tridiagonal systems occur in many applications. Fast Poisson solvers and Alternate Direction Methods make use of tridiagonal system solvers. Network-based multiprocessors provide a cost-effective alternative to traditional supercomputer architectures. The complexity of concurrent algorithms for the solution of multiple tridiagonal systems on Boolean-cube-configured multiprocessors with distributed memory are investigated. Variations of odd-even cyclic reduction, parallel cyclic reduction, and algorithms making use of data transposition with or without substructuring and local elimination, or pipelined elimination, are considered. A simple performance model is used for algorithm comparison, and the validity of the model is verified on an Intel iPSC/ 1. For many combinations of machine and system parameters, pipelined elimination, or equation transposition with or without substructuring is optimum. Hybrid algorithms that at any stage choose the best algorithm among the considered ones for the remainder of the problem are presented. It is shown that the optimum partitioning of a set of independent tridiagonal systems among a set of processors yields the embarrassingly parallel case. If the systems originate from a lattice and solutions are computed in alternating directions, then to first order the aspect ratio of a computational lattice shall be the same as that of the lattice forming the base for the equations. The experiments presented here demonstrate the importance of combining in the communication system for architectures with a relatively high communications start-up time.

• 37. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Spanning Balanced Trees in Boolean cubes1989In: SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, Vol. 10, no 4, p. 607-630Article in journal (Refereed)

A Spanning Balanced$n$-tree (SBnT) in a Boolean $n$-cube is a spanning tree in which the root has fanout $n$, and all the subtrees of the root have $O({{2^n } / n})$ nodes. The number of tree edges in each dimension of the $n$-cube is of order $O({{2^n } / n})$. The spanning balanced n-tree allows for scheduling disciplines that realize lower bound (within a factor of two) one-to-all personalized communication, all-to-all broadcasting, and all-to-all personalized communication on a Boolean $n$-cube [C.-T. Ho and S. L. Johnsson, Proc. 1986 International Conference on Parallel Processing, pp. 640–648, IEEE Computer Society, 1986; Tech. Report YALEU/DCS/RR–483, May 1986], [S. L. Johnsson and C.-T. Ho, Tech. Report YALEU/DCS/RR–610, Dept. of Computer Science, Yale Univ., New Haven, CT, November 1987]. The improvement in data transfer time over the familiar binomial tree routing is a factor of ${n / 2}$ for concurrent communication on all ports and one-to-all personalized communication and all-to-all broadcasting. For all-to-all personalized communication on all ports concurrently, the improvement is of order $O(\sqrt n )$. Distributed routing algorithms defining the spanning balanced $n$-tree are given. The balanced $n$-tree is not unique, and a few definitions of $n$-trees that are effectively edge-disjoint are provided. Some implementation issues are also discussed. Binary numbers obtained from each other through rotation form necklaces that are full if the period is equal to the length of the number; otherwise, they are degenerate. As an intermediary result, it is shown that the ratio between the number of degenerate necklaces and the total number of necklaces with $l$ bits equal to one is at most ${4 / {(4 + n)}}$ for $1 \leqq l < n$

• 38. Ho, Ching-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
An Efficient Algorithm for Gray–to–Binary Permutation on Hypercubes1994In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 20, no 1, p. 114-120Article in journal (Refereed)

Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When different phases of computations have different types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of n=(n Gamma 1) with respect to data transfer time on an n-cube. The expected speedup is confirmed by measurements on an Intel iPSC/2 hypercube

• 39. Ho, Cieng-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Dilation d Embeddings of a Hyper–Pyramid into a Hypercube1989Conference paper (Refereed)

A P(k, d) hyper-pyramid is a level structure of k Boolean cubes where the cube at level i is of dimension id, and a node at level i - 1 connects to every node in a d dimensional Boolean subcube at level i, except for the leaf level k. Hyper-pyramids contain pyramids as proper subgraphs. We show that a P(k, d) hyper-pyramid can be embedded in a Boolean cube with minimal expansion and dilation d. The congestion is bounded from above by 2d+1/d+2 and from below by 1 + 2d-d/kd+1. For P(k, 2) hyper-pyramids we present a dilation 2 and congestion 2 embedding. As a corollary a complete n-ary tree can be embedded in a Boolean cube with dilation max(2, log2n) and expansion 2klog2n + 1/nk+1-1/n-1. We also discuss multiple pyramid embeddings.

• 40. Ho, Cieng-Tien
Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors1990Conference paper (Refereed)
• 41. Ho, Cieng-Tien
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
The Complexity of Reshaping Arrays on Boolean Cubes1990Conference paper (Refereed)

Reshaping ofarrays is a convenient programming primitive. For arrays encoded in a binary-reflected Gray code reshaping implies code change. We show that an axis splitting, or combining of two axes, requires communica tion in exactly one dimension, and that for multiple axes split tings the exchanges in the different dimensions can be ordered arbitrarily. The nnmber of element transfers in sequence is independent of the number of dimensions requiring coniniunication for large local data sets, and concurrent conuiiunication. The lowerboundfor the number of element transfers in sequence is \$with K elements perprocessor. We present algorithius that is of this complexity for some cases, and of complexity K in the worstcase. Conversion between binary code and binary-reflected Gray code is a special case of reshaping.

• 42. Hu, Y. C.
KTH, Superseded Departments, Numerical Analysis and Computer Science, NADA.
HPFBench: A High Performance Fortran benchmark suite2000In: ACM Transactions on Mathematical Software, ISSN 0098-3500, E-ISSN 1557-7295, Vol. 26, no 1, p. 99-149Article in journal (Refereed)

The High Performance Fortran (HPF) benchmark suite HPFBench is designed for evaluating the HPF language and compilers on scalable architectures. The functionality of the benchmarks covers scientific software library functions and application kernels that reflect the computational structure and communication patterns in fluid dynamic simulations, fundamental physics, and molecular studies in chemistry and biology. The benchmarks are characterized in terms of FLOP count, memory usage, communication pattern, local memory accesses, array allocation mechanism, as well as operation and communication counts per iteration. The benchmarks output performance evaluation metrics in the form of elapsed times, FLOP rates, and communication time breakdowns. We also provide a benchmark guide to aid the choice of subsets of the benchmarks for evaluating particular aspects of an HPF compiler. Furthermore, we report an evaluation of an industry-leading HPF compiler from the Portland Group Inc. using the HPFBench benchmarks on the distributed-memory IBM SP2.

• 43. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
A Data Parallel Implementation of Hierarchical N –body Methods1996In: International Journal of Supercomputer Applications, Vol. 10, no 1, p. 3-40Article in journal (Refereed)

The O(N) hierarchical N-body algorithms and massively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10%-25%, and the overall efficiency is about 35%, corresponding to a performance of about 60 Mflop/s per CM-5E node, independent of the number of nodes.

• 44. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
A Data Parallel Implementation of O(N) Hierarchical N–body Methods1996Conference paper (Refereed)

The O(N) hierarchical N-body algorithms and massively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10%-25%, and the overall efficiency is about 35%, corresponding to a performance of about 60 Mflop/s per CM-5E node, independent of the number of nodes.

• 45. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
Implementing O(N ) N–body algorithms efficiently in data parallel languages1996In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 5, no 4, p. 337-364Article in journal (Refereed)

The optimization techniques for hierarchical O(N) N-body algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes and within the memory hierarchy of each node. We show how the techniques can be expressed in data-parallel languages, such as High Performance Fortran (HPF) and Connection Machine Fortran (CMF). The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication) being about 35%. For the CM-5E, a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured.

• 46. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
On the Accuracy of Anderson’s fast N –body algorithm1997Conference paper (Refereed)

We present an empirical study of the accuracy-cost tradeoffs of Anderson's method. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations minimizes the total number of arithmetic operations.

• 47. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
On the Accuracy of Anderson’s fast N–body algorithm1997Conference paper (Refereed)

We present an empirical study of the accuracy-cost tradeoffs of Anderson's method. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations minimizes the total number of arithmetic operations.

• 48. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
A Data Parallel Fortran Benchmark Suite1997Conference paper (Refereed)

The Data Parallel Fortran (DPF) benchmark suite is designed for evaluating data parallel compilers and scalable architectures. Many of the DPF codes are provided in three versions: basic, optimized and with library calls for performance critical operations typically found in software libraries. The functionality of the benchmarks cover collective communication functions, scientific software library functions, and application kernels that reflect the computational structure and communication patterns in fluid dynamic simulations, fundamental physics and molecular studies in chemistry and biology. The intended target language is High Performance Fortran (HPF). However, due to the lack of HPF compilers at the time of this benchmark development, the suite is written in Connection Machine Fortran (CMF). The DPF codes provide performance evaluation metrics in the form of busy and elapsed times (CM-5), FLOP rates and arithmetic efficiency. The codes are characterized in terms of FLOP count, m...

• 49. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
A Data–Parallel Implementation of the Geometric Partitioning Algorithm1997Conference paper (Refereed)

We present a data-parallel, High Performance Fortran (HPF) implementation of the geometric partitioning algorithm. The geometric partitioning algorithm has provably good partitioning quality. To our knowledge, our implementation is the first data--parallel implementation of the algorithm. Our data--parallel formulation makes extensive use of segmented prefix sums and parallel selections, and provide a dataparallel procedure for geometric sampling. Experiments in partitioning particles for load--balance and data interactions as required in hierarchical N-body algorithms and iterative algorithms for the solution of equilibrium equations on unstructured meshes by the finite element method have shown that the geometric partitioning algorithm has an efficient data--parallel formulation. Moreover, the quality of the generated partitions is competitive with that offered by the spectral bisection technique and better than the quality offered by other partitioning heuristics. 1 Introduction Th...

• 50. Hu, Yu
KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
High Performance Fortran for Highly Irregular Problems1997Conference paper (Refereed)

We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated with the elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N--body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with the proper language/compiler/runtime support. Our data parallel N--body code provides a much needed "benchmark" code for evaluating and improving HPF compilers. 1 Introduction Data parallel programming provides an effective way to write ...

123 1 - 50 of 149
Cite
Citation style
• apa
• harvard1
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf
v. 2.30.1
| | | |