Endre søk
Begrens søket
123456 51 - 100 of 260
Referera
Referensformat
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Annet format
Fler format
Språk
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Annet språk
Fler språk
Utmatningsformat
• html
• text
• asciidoc
• rtf
Treff pr side
• 5
• 10
• 20
• 50
• 100
• 250
Sortering
• Standard (Relevans)
• Forfatter A-Ø
• Forfatter Ø-A
• Tittel A-Ø
• Tittel Ø-A
• Type publikasjon A-Ø
• Type publikasjon Ø-A
• Eldste først
• Nyeste først
• Disputationsdatum (tidligste først)
• Disputationsdatum (siste først)
• Standard (Relevans)
• Forfatter A-Ø
• Forfatter Ø-A
• Tittel A-Ø
• Tittel Ø-A
• Type publikasjon A-Ø
• Type publikasjon Ø-A
• Eldste først
• Nyeste først
• Disputationsdatum (tidligste først)
• Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
• 51.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk Analys och Datalogi, NADA. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsbiologi, CB.
Massively parallel simulation of brain-scale neuronal network models2005Rapport (Annet vitenskapelig)
• 52. Dongarra, Jack
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Solving Banded Systems on Parallel Architectures1987Inngår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 5, nr 2, s. 219-246Artikkel i tidsskrift (Fagfellevurdert)
• 53. Edelman, Alan
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Index Transformation Algorithms in a Linear Algebra Framework1994Inngår i: IEEE Transactions on Parallel and Distributed Systems, ISSN 1045-9219, E-ISSN 1558-2183, Vol. 5, nr 12, s. 1302-1309Artikkel i tidsskrift (Fagfellevurdert)

We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and 1's.

• 54.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Innovative companies and cloud computing2010Inngår i: BELIEF Zero-In e Magazine, nr 4Artikkel i tidsskrift (Fagfellevurdert)

To maintain a competitive advantage through innovation, companies of today must handle increasingly dynamic environments and increasingly rapid innovation cycles. Cloud computing is addressing many of these challenges, especially the possibility of rapid and cost-efficient prototyping and scaling. In this report we describe an example of how an EU-funded academic grid project supports small and medium enterprises (SMEs) and startups through a cloud service.

• 55.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Practical Cloud Evaluation from a Nordic eScience User perspective2011Inngår i: Proceedings of the 5th international workshop on Virtualization technologies in distributed computing:  , 2011Konferansepaper (Fagfellevurdert)

In this paper, we describe the findings of the NEON project - a cross-Nordic - Sweden, Norway, Denmark, Finland and Iceland - project evaluating the usefulness of private versus public cloud services for HPC users. Our findings are briefly that private cloud technology is not mature enough yet to provide a transparent user experience. It is expected that this will be the case mid 2012. The cost efficiency of both public and private cloud should be continuously monitored as there is a strong downward trend. This conclusion is supported by NEON experimenting as well as larger initiatives e.g. StratusLab reports. Public cloud technology is mature enough but lacks certain features that will be necessary to include cloud resources in a transparent manner in a national infrastructure like the Norwegian NOTUR (www.notur.no) case, e.g. with respect to quota management. These features are expected to emerge in 2011 via third party management software and in the best of breed public cloud services. Public clouds are competitive in the low end for non-HPC jobs (low memory, low number of cores) on price. A significant fraction (ca. 20%) of the jobs running on the current Nordic supercomputer infrastructure are potentially suitable for cloud-like technology. This holds in particular for singlethreaded or single-node jobs with small/medium memory requirements and non-intensive I/O. There is a backlog of real supercomputer jobs that suffers from the non-HPC jobs on the supercomputer infrastructure. Off-loading these non-HPC jobs to a public cloud would effectively add supercomputing capacity. Another finding is that available storage capacity is not accessible in a user-friendly way; most storage clouds are only accessible via programmable interfaces. A number of experiments and piloting are presented to support these claims.

• 56.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Cloud Computing and Startups2011Inngår i: Cloud Computing: Methodology, Systems, and Applications / [ed] Boualem Benatallah, CRC Press, 2011, s. 31-43Kapittel i bok, del av antologi (Fagfellevurdert)

Cloud computing characteristics map to the needs of startups very well. Sowell that a new group of cloud specific startups is now rapidly evolving. Onone hand, one could argue that it is just the same active group that hasquickly picked up a new technology, but on the other hand the number ofstartups evolving and how fast they can get started and how far they can goon low funding is a game changer. Startups have always been very good atbootstrapping, getting as far as possible on no or very little funding. But now,with cloud computing and high quality open source software, better and lessexpensive network, a more open mobile market, and more evolved customerbase, the bootstrapping can take you very far, possibly all the way to a selfsupported profitable business.For investors the market is also changing, creating a need for very early stage,close-to-the-founder, technology knowledgeable services — the seed accelera-tors. The seed accelerators act as an early investor — helping the startupswith technology decisions — and at the same time helping future investors inthe identification of interesting objects.

• 57. Ellert, M
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Advanced Resource Connector middleware for lightweight computational Grids2007Inngår i: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, nr 2, s. 219-240Artikkel i tidsskrift (Fagfellevurdert)

The popularity of virtual worlds and their increasing economic impact has created a situation where the value of trusted identification has risen substantially. We propose an identity management solution that provides the user with secure credentials and allows to decrease the required trust that the user must have towards the server running the virtual world. Additionally, the identity management system allows the virtual world to incorporate reputation information. This allows the “wisdom of the crowd” to provide more input to users about the reliability of a certain identity. We describe how to use these identities to provide secure services in the virtual world. These include secure communications, digital signatures and secure bindings to external services.

• 58.
KTH, Skolan för datavetenskap och kommunikation (CSC), Numerisk analys, NA.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Proceedings PDC Seventh Annual Conference: Simulation and Visualization on the Grid2000 (oppl. 13)Bok (Fagfellevurdert)

The Grid is an emerging computational infrastructure, similar to the pervasive energy infrastructure provided by national power grids. Simulation and Visualization on the Grid focuses on applications and technologies on this emerging computational Grid. Readers will find interesting discussions of such Grid technologies as distributed file I/O, clustering, CORBA software infrastructure, tele-immersion, interaction environments, visualization steering and virtual reality as well as applications in biology, chemistry and physics. A lively panel discussion addresses current successes and pitfalls of the Grid. This book provides an understanding of the Grid that offers a persistent, wide-scale infrastructure for solving problems.

• 59. Feig, Michael
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Large Scale Data Repository: Design of a Molecular Dynamics Trajectory Database1999Inngår i: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 16, nr 1, s. 101-110Artikkel i tidsskrift (Fagfellevurdert)

The design of a molecular dynamics trajectory database is presented as an example of the organization of large-scale dynamic distributed repositories for scientific data. Large scientific datasets are usually interpreted through reduced data calculated by analysis functions. This allows a database architecture in which the analyzed datasets, that are kept in addition to the raw datasets, are transferred to a database user. A flexible user interface with a well defined Application Program Interface (API) allows for a wide array of analysis functions and the incorporation of user defined functions is a critical part of the database design. An analysis function is executed only when the requested analysis result is not available from an earlier request. A prototype implementation used to gain initial practical experiences with performance and scalability is presented.

• 60. Feng, Cheng
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Developing Assays for the Detection of Influenza in Human Samples2007Inngår i: International Conference on Bioinformatics & Computational Biology, BIOCOMP 2007 / [ed] Hamid R. Arabnia, Mary Qu Yang, Jack Y. Yang, 2007, s. 781-Konferansepaper (Fagfellevurdert)
• 61. Friese, D. H.
KTH, Skolan för bioteknologi (BIO), Teoretisk kemi och biologi. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Five-photon absorption and selective enhancement of multiphoton absorption processes2015Inngår i: ACS Photonics, E-ISSN 2330-4022, Vol. 2, nr 5, s. 572-577Artikkel i tidsskrift (Fagfellevurdert)

We study one-, two-, three-, four-, and five-photon absorption of three centrosymmetric molecules using density functional theory. These calculations are the first ab initio calculations of five-photon absorption. Even- and odd-order absorption processes show different trends in the absorption cross sections. The behavior of all even- and odd-photon absorption properties shows a semiquantitative similarity, which can be explained using few-state models. This analysis shows that odd-photon absorption processes are largely determined by the one-photon absorption strength, whereas all even-photon absorption strengths are largely dominated by the two-photon absorption strength, in both cases modulated by powers of the polarizability of the final excited state. We demonstrate how to selectively enhance a specific multiphoton absorption process.

• 62. Gardfjall, Peter
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Scalable Grid-wide capacity allocation with the SweGrid Accounting System (SGAS)2008Inngår i: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 20, nr 18, s. 2089-2122Artikkel i tidsskrift (Fagfellevurdert)

The SweGrid Accounting System (SGAS) allocates capacity in collaborative Grid environments by coordinating enforcement of Grid-wide usage limits as a means to offer usage guarantees and prevent overuse. SGAS employs a credit-based allocation model where Grid capacity is granted to projects via Grid-wide quota allowances that can be spent across the Grid resources. The resources Collectively enforce these allowances in a soft, real-time manner. SGAS is built on service-oriented principles with a strong focus on interoperability, and Web services standards. This article covers the SGAS design and implementation, which, besides addressing inherent Grid challenges (scale, security, heterogeneity, decentralization), emphasizes generality and flexibility to produce a customizable system with lightweight integration into different middleware and scheduling system combinations. We focus the discussion around the system design, a flexible allocation model, middleware integration experiences and scalability improvements via a distributed virtual banking system, and finally, an extensive set of testhed experiments. The experiments evaluate the performance of SGAS in terms of response times, request throughput, overall system scalability, and its performance impact on the Globus Toolkit 4 job submission software. We conclude that, for all practical purposes, the quota enforcement overhead incurred by SGAS on job submissions is not a limiting factor for the job-handling capacity of the job submission software.

• 63. George, William
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
POLYSHIFT Communications Software for the Connection Machine System CM–2001994Inngår i: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 3, nr 1, s. 83-99Artikkel i tidsskrift (Fagfellevurdert)

We describe the use and implementation of a polyshift function PSHIFT for circular shifts and end-offs shifts. Polyshift is useful in many scientific codes using regular grids, such as finite difference codes in several dimensions, and multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as quantum chromodynamics (QCD) calculations. Our implementation of the PSHIFT function on the Connection Machine systems CM-2 and CM-200 offers a speedup of up to a factor of 3-4 compared with CSHIFT when the local data motion within a node is small. The PSHIFT routine is included in the Connection Machine Scientific Software Library (CMSSL).

• 64. Gerogiannis, D.C
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Histogram Computation on Distributed Memory Architectures1989Inngår i: Concurrency: Practice and Experience, Vol. 1, nr 2, s. 219-237Artikkel i tidsskrift (Fagfellevurdert)

One data-independent and one data-dependent algorithm for the computation of image histograms on parallel computers are presented, analysed and implemented on the Connection Machine system CM-2. The data-dependent algorithm has a lower requirement on communication bandwidth by only transferring bins with a non-zero count. Both algorithms perform all-to-all reduction, which is implemented through a sequence of exchanges as defined by a butterfly network. The two algorithms are compared based on predicted and actual performance on the Connection Machine CM-2. With few pixels per processor the data-dependent algorithm requires in the order of √B data transfers for B bins compared to B data transfers for the data-independent algorithm. As the number of pixels per processor grows the advantage of the data-dependent algorithm decreases. The advantage of the data-dependent algorithm increases with the number of bins of the histogram.

• 65.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Security and Privacy of Sensitive Data in Cloud Computing2016Doktoravhandling, monografi (Annet vitenskapelig)

Cloud computing offers the prospect of on-demand, elastic computing, provided as a utility service, and it is revolutionizing many domains of computing. Compared with earlier methods of processing data, cloud computing environments provide significant benefits, such as the availability of automated tools to assemble, connect, configure and reconfigure virtualized resources on demand. These make it much easier to meet organizational goals as organizations can easily deploy cloud services. However, the shift in paradigm that accompanies the adoption of cloud computing is increasingly giving rise to security and privacy considerations relating to facets of cloud computing such as multi-tenancy, trust, loss of control and accountability. Consequently, cloud platforms that handle sensitive information are required to deploy technical measures and organizational safeguards to avoid data protection breakdowns that might result in enormous and costly damages. Sensitive information in the context of cloud computing encompasses data from a wide range of different areas and domains. Data concerning health is a typical example of the type of sensitive information handled in cloud computing environments, and it is obvious that most individuals will want information related to their health to be secure. Hence, with the growth of cloud computing in recent times, privacy and data protection requirements have been evolving to protect individuals against surveillance and data disclosure. Some examples of such protective legislation are the EU Data Protection Directive (DPD) and the US Health Insurance Portability and Accountability Act (HIPAA), both of which demand privacy preservation for handling personally identifiable information. There have been great efforts to employ a wide range of mechanisms to enhance the privacy of data and to make cloud platforms more secure. Techniques that have been used include: encryption, trusted platform module, secure multi-party computing, homomorphic encryption, anonymization, container and sandboxing technologies. However, it is still an open problem about how to correctly build usable privacy-preserving cloud systems to handle sensitive data securely due to two research challenges. First, existing privacy and data protection legislation demand strong security, transparency and audibility of data usage. Second, lack of familiarity with a broad range of emerging or existing security solutions to build efficient cloud systems. This dissertation focuses on the design and development of several systems and methodologies for handling sensitive data appropriately in cloud computing environments. The key idea behind the proposed solutions is enforcing the privacy requirements mandated by existing legislation that aims to protect the privacy of individuals in cloud-computing platforms. We begin with an overview of the main concepts from cloud computing, followed by identifying the problems that need to be solved for secure data management in cloud environments. It then continues with a description of background material in addition to reviewing existing security and privacy solutions that are being used in the area of cloud computing. Our first main contribution is a new method for modeling threats to privacy in cloud environments which can be used to identify privacy requirements in accordance with data protection legislation. This method is then used to propose a framework that meets the privacy requirements for handling data in the area of genomics. That is, health data concerning the genome (DNA) of individuals. Our second contribution is a system for preserving privacy when publishing sample availability data. This system is noteworthy because it is capable of cross-linking over multiple datasets. The thesis continues by proposing a system called ScaBIA for privacy-preserving brain image analysis in the cloud. The final section of the dissertation describes a new approach for quantifying and minimizing the risk of operating system kernel exploitation, in addition to the development of a system call interposition reference monitor for Lind - a dual sandbox.

• 66.
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST).
KTH, Skolan för informations- och kommunikationsteknik (ICT), Programvaruteknik och Datorsystem, SCS. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST).
A security framework for population-scale genomics analysis2015Inngår i: Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015, IEEE conference proceedings, 2015, s. 106-114Konferansepaper (Fagfellevurdert)

Biobanks store genomic material from identifiable individuals. Recently many population-based studies have started sequencing genomic data from biobank samples and cross-linking the genomic data with clinical data, with the goal of discovering new insights into disease and clinical treatments. However, the use of genomic data for research has far-reaching implications for privacy and the relations between individuals and society. In some jurisdictions, primarily in Europe, new laws are being or have been introduced to legislate for the protection of sensitive data relating to individuals, and biobank-specific laws have even been designed to legislate for the handling of genomic data and the clear definition of roles and responsibilities for the owners and processors of genomic data. This paper considers the security questions raised by these developments. We introduce a new threat model that enables the design of cloud-based systems for handling genomic data according to privacy legislation. We also describe the design and implementation of a security framework using our threat model for BiobankCloud, a platform that supports the secure storage and processing of genomic data in cloud computing environments.

• 67.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
ScaBIA: Scalable brain image analysis in the cloud2013Inngår i: CLOSER 2013 - Proceedings of the 3rd International Conference on Cloud Computing and Services Science, 2013, s. 329-336Konferansepaper (Fagfellevurdert)

The use of cloud computing as a new paradigm has become a reality. Cloud computing leverages the use of on-demand CPU power and storage resources while eliminating the cost of commodity hardware ownership. Cloud computing is now gaining popularity among many different organizations and commercial sectors. In this paper, we present the scalable brain image analysis (ScaBIA) architecture, a new model to run statistical parametric analysis (SPM) jobs using cloud computing. SPM is one of the most popular toolkits in neuroscience for running compute-intensive brain image analysis tasks. However, issues such as sharing raw data and results, as well as scalability and performance are major bottlenecks in the "single PC"-execution model. In this work, we describe a prototype using the generic worker (GW), an e-Science as a service middleware, on top of Microsoft Azure to run and manage the SPM tasks. The functional prototype shows that ScaBIA provides a scalable framework for multi-job submission and enables users to share data securely using storage access keys across different organizations.

• 68.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Centra, SeRC - Swedish e-Science Research Centre.
Cray Inc.. University of Edinburgh. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). Argonne National Laboratory. Argonne National Laboratory.
OpenACC Acceleration of Nek5000: a Spectral Element Code2013Konferansepaper (Annet vitenskapelig)
• 69.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Centra, SeRC - Swedish e-Science Research Centre.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations2016Inngår i: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 72, nr 11, s. 4160-4180Artikkel i tidsskrift (Fagfellevurdert)

We present a hybrid GPU implementation and performance analysis of Nekbone, which represents one of the core kernels of the incompressible Navier-Stokes solver Nek5000. The implementation is based on OpenACC and CUDA Fortran for local parallelization of the compute-intensive matrix-matrix multiplication part, which significantly minimizes the modification of the existing CPU code while extending the simulation capability of the code to GPU architectures. Our discussion includes the GPU results of OpenACC interoperating with CUDA Fortran and the gather-scatter operations with GPUDirect communication. We demonstrate performance of up to 552 Tflops on 16, 384 GPUs of the OLCF Cray XK7 Titan.

• 70.
KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre.
NekBone with Optimizaed OpenACC directives2015Konferansepaper (Fagfellevurdert)

Accelerators and, in particular, Graphics Processing Units (GPUs) have emerged as promising computing technologies which may be suitable for the future Exascale systems. Here, we present performance results of NekBone, a benchmark of the Nek5000 code, implemented with optimized OpenACC directives and GPUDirect communications. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. Results of an optimized NekBone version lead to 78 Gflops performance on a single node. In addition, a performance result of 609 Tflops has been reached on 16, 384 GPUs of the Titan supercomputer at Oak Ridge National Laboratory.

• 71.
KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Mekanik, Stabilitet, Transition, Kontroll. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Mekanik, Stabilitet, Transition, Kontroll. KTH, Centra, SeRC - Swedish e-Science Research Centre.
Nek5000 with OpenACC2015Inngår i: Solving software challenges for exascale, 2015, s. 57-68Konferansepaper (Fagfellevurdert)

Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flows. We follow up on an earlier study which ported the simplified version of Nek5000 to a GPU-accelerated system by presenting the hybrid CPU/GPU implementation of the full Nek5000 code using OpenACC. The matrix-matrix multiplication, the Nek5000 gather-scatter operator and a preconditioned Conjugate Gradient solver have implemented using OpenACC for multi-GPU systems. We report an speed-up of 1.3 on single node of a Cray XK6 when using OpenACC directives in Nek5000. On 512 nodes of the Titan supercomputer, the speed-up can be approached to 1.4. A performance analysis of the Nek5000 code using Score-P and Vampir performance monitoring tools shows that overlapping of GPU kernels with host-accelerator memory transfers would considerably increase the performance of the OpenACC version of Nek5000 code.

• 72. Gonzales-Alvares, G
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Drug design on the cell BE2010Inngår i: Scientific Computing with Multicore and Accelerators:   / [ed] J. Kurzak, D.A. Bader, J Dongarra, CRC Press, 2010, s. 331-350Kapittel i bok, del av antologi (Fagfellevurdert)
• 73.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Rapid code iteration in an Integrated Development Environment with GPU just-in-time compilation2014Independent thesis Advanced level (degree of Master (Two Years)), 20 poäng / 30 hpOppgave

Rapid code iteration is a term designating short cycles between changes in the source-code of a program and observing results. This thesis describes an investigation about how an integrated development environment (IDE) can be built in order to gain rapid interaction during software development for graphics processing units (GPUs). The survey has been carried out by implementing an IDE, with a user interface, a compiler, and a runtime in order to provide direct feedback as code is typed.

The presented IDE transforms C-like code to PTX-assembler which is JIT-compiled on a NVIDIA graphics card. Compiling and running a computational intense program about 200 lines of C-like code yields a faster response time than in Visual Studio with either CUDA or C++ using SDL-threads. The program performs RSA encryption/decryption on a large image (11.625MiB) by dividing partial data blocks on different cores on the GPU. The faster response time (more rapid code iteration) is achieved by compiling less code of a smaller language, and using a recycled runtime environment between code iterations. The feedback is measured by the time it takes to compile a change in the source code, plus the time it takes to evaluate the computation.

The IDE provides feedback within 150 milliseconds compared to Visual Studio using CUDA which demand 2 400 milliseconds to provide a response for the same change in the source-code. The majority of the speedup is from the compile time which is 2 100 milliseconds within Visual Studio and CUDA, compared to 13 milliseconds within the presented IDE. Comparing run time of the computation yields a speedup of five times compared to a corresponding C++ SDL-threaded CPU implementation. Comparing run time with CUDA yields a tie.

• 74.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Uppsala University. Uppsala University. KTH, Skolan för datavetenskap och kommunikation (CSC).
Measurement of IP forwarding performance on complex computer architectures2011Inngår i: Swedish National Computer Networking Workshop, SNCNW 2011, 2011Konferansepaper (Fagfellevurdert)

Open-source routers on new PC hardware allows for forwarding speeds of 10Gb/s and above. We present detailed performance measurements using Linux on two complex PC hardware platforms. Both platforms use PCIe gen2, dual I/O bridges and have support for non-uniform memory access (NUMA). The AMD platform uses four processors equipped with eight cores and four nodes of local memory. The Intel platform has two quad-core CPUs each with local memory.

Packets being forwarded through a PC-based router can be separated into three steps: receive-dma, lookup, and transmitdma. Each step was studied individually. In particular, we studied how varying the CPU core and memory node effects the forwarding speeds.

Our results show a large performance dependency of selecting CPU cores and memory nodes. In particular, DMA works best with memory nodes closest to the I/O bridge where the interface card is connected. Correspondingly, CPU access is most efficient on local memory. Consequently, choosing CPU core and memory nodes badly leads to a significant performance decrease.

• 75. Harris, Tim
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Matrix Multiplication on the Connection Machine1989Konferansepaper (Fagfellevurdert)

A data parallel implementation of the multiplication of matrices of arbitrary shapes and sizes is presented. A systolic algorithm based on a rectangular processor layout is used by the implementation. All processors contain submatrices of the same size for a given operand. Matrix-vector multiplication is used as a primitive for local matrix-matrix multiplication in the Connection Machine system CM-2 implementation. The peak performance of the local matrix-matrix multiplication is in excess of 20 Gflops s-1. The overall algorithm including all required data motion has a peak performance of 5.8 Gflops s-1.

• 76.
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Jülich Supercomputing Centre, Forschungszentrum Jülich. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. Jülich Supercomputing Centre, Forschungszentrum Jülich. Jülich Supercomputing Centre, Forschungszentrum Jülich. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Benchmarking of Integrated OGSA-BES with the Grid Middleware2009Inngår i: EURO-PAR 2008 WORKSHOPS - PARALLEL PROCESSING / [ed] Eduardo César, Michael Alexander, Achim Streit, Jesper Larsson Träff, Christophe Cérin, Andreas Knüpfer, Dieter Kranzelmüller, Shantenu Jha, Berlin: Springer Berlin/Heidelberg, 2009, s. 113-122Konferansepaper (Fagfellevurdert)

This paper evaluates the performance of the emerging OGF standard OGSA - Basic Execution Service (BES) on three fundamentally different Grid middleware platforms: UNICORE 5/6, Globus Toolkit 4 and gLite. The particular focus within this paper is on the OGSA-BES implementation of UNICORE 6. A comparison is made with baseline measurements, for UNICORE 6 and Globus Toolkit 4, using the legacy job submission interfaces. Our results show that the BES components are comparable in performance to existing legacy interfaces. We also have a strong indication that other factors, attributable to the supporting infrastructure, have a bigger impact on performance than BES components.

• 77.
KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik. KTH, Skolan för teknikvetenskap (SCI), Mekanik. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Centra, Linné Flow Center, FLOW. KTH, Skolan för teknikvetenskap (SCI), Mekanik, Stabilitet, Transition, Kontroll.
Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek50002016Konferansepaper (Fagfellevurdert)
• 78. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Systolic FFT Algorithms on Boolean Cubes1988Konferansepaper (Fagfellevurdert)

A description is given of a systolic Cooley-Tukey fast Fourier transform algorithm for Booleann-cubes with a substantial amount of storage per cube node. In mapping a Cooley-Tukey type FFT to such a network, the main concerns are effective use of the high connectivity/bandwidth of the Booleann-cube, the computational resources, the storage bandwidth, if there is a storage hierarchy, and the pipelines should the arithmetic units have such a feature. Another important consideration in a multiprocessor, distributed storage architecture is the allocation and access to coefficients, if they are precomputed. FFT algorithms are described that use both the storage bandwidth and the communication system optimally and require storage ofP+nNcoefficients for a transform onP&ges;Ndata elements. A complex-to-complex FFT on 16 million points is predicted to require about 1.5 s on a Connection Machine model CM-2

• 79. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes1986Konferansepaper (Fagfellevurdert)

High communication bandwidth in standard technologies is more expensive to realize than a high rate of arithmetic or logic operations. The effective utilization of communication resources is crucial for good overall performance in highly concurrent systems. In this paper we address two different communication problems in Boolean n-cube configured multiprocessors: 1) broadcasting, i.e., distribution of common data from a single source to all other nodes, and 2) sending personalized data from a single source to all other nodes. The well known spanning tree algorithm obtained by bit-wise complementation of leading zeroes (referredto as the SBT algorithm for Spanning Binomial nee) is compared with an algorithm using multiple spanning binomial trees (MSBT). The MSBT dgorithm offers a potential speed-up over the SBT dgorithm by afactor of log2 N. We also present a balanced #panning tree algorithm (BST) that offers a lower complexity than the SBT algorithm for Case 2. The potential improvement is by a factor of 3 log2 N. The analysis takes into account the size of the data sets, the communication bandwidth, and the overhead in communication. We also provide some experimental data for the Intel iPSC'd7.

• 80. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Embedding Hyper–pyramids in Hypercubes1994Inngår i: IBM Journal of Research and Development, ISSN 0018-8646, E-ISSN 2151-8556, Vol. 38, nr 1, s. 31-45Artikkel i tidsskrift (Fagfellevurdert)
• 81. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Embedding Meshes in Boolean Cubes by Graph Decomposition1990Inngår i: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 8, nr 4, s. 325-339Artikkel i tidsskrift (Fagfellevurdert)

This paper explores the embeddings of multidimensional meshes into minimal Boolean cubes by graph decomposition. The dilation and the congestion of the product graph (G1 × G2) → (H1 × H2) is the maximum of the dilation and congestion for the two embeddings G1H1 and G2H2. The graph decomposition technique can be used to improve the average dilation and average congestion. The graph decomposition technique combined with some particular two-dimensional embeddings allows for minimal-expansion, dilation-two, congestion-two embeddings of about 87% of all two-dimensional meshes, with a significantly lower average dilation and congestion than by modified line compression. For three-dimensional meshes we show that the graph decomposition technique, together with two three-dimensional mesh embeddings presented in this paper and modified line compression, yields dilation-two embeddings of more than 96% of all three-dimensional meshes contained in a 512 × 512 × 512 mesh. The graph decomposition technique is also used to generalize the embeddings to meshes with wrap-around. The dilation increases by at most one compared to a mesh without wraparound. The expansion is preserved for the majority of meshes, if a wraparound feature is added to the mesh.

• 82. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Embedding Meshes into Small Boolean Cubes1990Konferansepaper (Fagfellevurdert)

The embedding of arrays in Boolean cubes, when there are more array elements than nodes in the cube, can always be made with optimal load-factor by reshaping the array to a one-dimensional array. We show that the dilation for such an embedding is of an .to x .t1 x - + x &-I array in an n-cube.Dila tion one embeddings can be obtained by splitting each axis into segments and assigning segments to nodes in the cube by a Gray code. The load-factor is optimal if the axis lengths contain sufficiently many powers of two. The congestion is minimized, if the segment lengths along the different axes are as equal as possible, for the cube configured with at most as many axes as the array. A further decrease in the congestion is possible if the array is partitioned into subarrays, and corresponding axis of different subarrays make use of edge-disjoint Hamiltonian cycles within subcubes. The congestion can also be reduced by using multiple paths between pairs of cube nodes, i.e., by using “fat” edges.

• 83. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Embedding Three–Dimensional Meshes in Boolean Cubes by Graph Decomposition1990Konferansepaper (Fagfellevurdert)

This paper explores the embeddings of multidimensional meshes into minimal Boolean cubes by graph decomposition. The graph decomposition technique can be used to improve the average dilation and average congestion. The graph decomposition technique combined with some particular two-dimensional embeddings allows for minimal-expansion, dilation-two, congestion-two embeddings of about 87% of all two-dimensional meshes, with a significantly lower average dilation and congestion than by modified line compression. For three-dimensional meshes the authors show that the graph decomposition technique, together with two three-dimensional mesh embeddings presented in this paper and modified line compression, yields dilation-two embeddings of more than 96% of all three dimensional meshes contained in a 512 {times} 512 {times} 512 mesh.

• 84. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage1991Konferansepaper (Fagfellevurdert)

For matrix multiplication on hypercube multiproces- sors with the product matrix accumulated in place a processor must receive about P2= p N elements of each input operand, with operands of size P P distributed evenly over N processors. With concurrent communi- cation on all ports, the number of element transfers in sequence can be reduced to P2= p N logN for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multipli- cation with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P 3=N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 G ops, which would scale up to about 13 G ops for a 64K full machine.

• 85. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Maximizing Channel Utilization for All–to–All Personalized Communication on Boolean cubes1991Konferansepaper (Fagfellevurdert)
• 86. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
On the Embedding of Arbitrary Meshes in Boolean Cubes with Expansion Two Dilation Two1987Konferansepaper (Fagfellevurdert)
• 87. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Optimizing Tridiagonal Solvers for the Alternating Direction Method on Boolean Cube Multiprocessors1990Inngår i: SIAM Journal on Scientific Computing, ISSN 1064-8275, E-ISSN 1095-7197, Vol. 11, nr 3, s. 563-592Artikkel i tidsskrift (Fagfellevurdert)

Sets of tridiagonal systems occur in many applications. Fast Poisson solvers and Alternate Direction Methods make use of tridiagonal system solvers. Network-based multiprocessors provide a cost-effective alternative to traditional supercomputer architectures. The complexity of concurrent algorithms for the solution of multiple tridiagonal systems on Boolean-cube-configured multiprocessors with distributed memory are investigated. Variations of odd-even cyclic reduction, parallel cyclic reduction, and algorithms making use of data transposition with or without substructuring and local elimination, or pipelined elimination, are considered. A simple performance model is used for algorithm comparison, and the validity of the model is verified on an Intel iPSC/ 1. For many combinations of machine and system parameters, pipelined elimination, or equation transposition with or without substructuring is optimum. Hybrid algorithms that at any stage choose the best algorithm among the considered ones for the remainder of the problem are presented. It is shown that the optimum partitioning of a set of independent tridiagonal systems among a set of processors yields the embarrassingly parallel case. If the systems originate from a lattice and solutions are computed in alternating directions, then to first order the aspect ratio of a computational lattice shall be the same as that of the lattice forming the base for the equations. The experiments presented here demonstrate the importance of combining in the communication system for architectures with a relatively high communications start-up time.

• 88. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Spanning Balanced Trees in Boolean cubes1989Inngår i: SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, Vol. 10, nr 4, s. 607-630Artikkel i tidsskrift (Fagfellevurdert)

A Spanning Balanced$n$-tree (SBnT) in a Boolean $n$-cube is a spanning tree in which the root has fanout $n$, and all the subtrees of the root have $O({{2^n } / n})$ nodes. The number of tree edges in each dimension of the $n$-cube is of order $O({{2^n } / n})$. The spanning balanced n-tree allows for scheduling disciplines that realize lower bound (within a factor of two) one-to-all personalized communication, all-to-all broadcasting, and all-to-all personalized communication on a Boolean $n$-cube [C.-T. Ho and S. L. Johnsson, Proc. 1986 International Conference on Parallel Processing, pp. 640–648, IEEE Computer Society, 1986; Tech. Report YALEU/DCS/RR–483, May 1986], [S. L. Johnsson and C.-T. Ho, Tech. Report YALEU/DCS/RR–610, Dept. of Computer Science, Yale Univ., New Haven, CT, November 1987]. The improvement in data transfer time over the familiar binomial tree routing is a factor of ${n / 2}$ for concurrent communication on all ports and one-to-all personalized communication and all-to-all broadcasting. For all-to-all personalized communication on all ports concurrently, the improvement is of order $O(\sqrt n )$. Distributed routing algorithms defining the spanning balanced $n$-tree are given. The balanced $n$-tree is not unique, and a few definitions of $n$-trees that are effectively edge-disjoint are provided. Some implementation issues are also discussed. Binary numbers obtained from each other through rotation form necklaces that are full if the period is equal to the length of the number; otherwise, they are degenerate. As an intermediary result, it is shown that the ratio between the number of degenerate necklaces and the total number of necklaces with $l$ bits equal to one is at most ${4 / {(4 + n)}}$ for $1 \leqq l < n$

• 89. Ho, Ching-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
An Efficient Algorithm for Gray–to–Binary Permutation on Hypercubes1994Inngår i: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 20, nr 1, s. 114-120Artikkel i tidsskrift (Fagfellevurdert)

Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When different phases of computations have different types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of n=(n Gamma 1) with respect to data transfer time on an n-cube. The expected speedup is confirmed by measurements on an Intel iPSC/2 hypercube

• 90. Ho, Cieng-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Dilation d Embeddings of a Hyper–Pyramid into a Hypercube1989Konferansepaper (Fagfellevurdert)

A P(k, d) hyper-pyramid is a level structure of k Boolean cubes where the cube at level i is of dimension id, and a node at level i - 1 connects to every node in a d dimensional Boolean subcube at level i, except for the leaf level k. Hyper-pyramids contain pyramids as proper subgraphs. We show that a P(k, d) hyper-pyramid can be embedded in a Boolean cube with minimal expansion and dilation d. The congestion is bounded from above by 2d+1/d+2 and from below by 1 + 2d-d/kd+1. For P(k, 2) hyper-pyramids we present a dilation 2 and congestion 2 embedding. As a corollary a complete n-ary tree can be embedded in a Boolean cube with dilation max(2, log2n) and expansion 2klog2n + 1/nk+1-1/n-1. We also discuss multiple pyramid embeddings.

• 91. Ho, Cieng-Tien
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
The Complexity of Reshaping Arrays on Boolean Cubes1990Konferansepaper (Fagfellevurdert)

Reshaping ofarrays is a convenient programming primitive. For arrays encoded in a binary-reflected Gray code reshaping implies code change. We show that an axis splitting, or combining of two axes, requires communica tion in exactly one dimension, and that for multiple axes split tings the exchanges in the different dimensions can be ordered arbitrarily. The nnmber of element transfers in sequence is independent of the number of dimensions requiring coniniunication for large local data sets, and concurrent conuiiunication. The lowerboundfor the number of element transfers in sequence is \$with K elements perprocessor. We present algorithius that is of this complexity for some cases, and of complexity K in the worstcase. Conversion between binary code and binary-reflected Gray code is a special case of reshaping.

• 92. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Data Parallel Implementation of Hierarchical N –body Methods1996Inngår i: International Journal of Supercomputer Applications, Vol. 10, nr 1, s. 3-40Artikkel i tidsskrift (Fagfellevurdert)

The O(N) hierarchical N-body algorithms and massively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10%-25%, and the overall efficiency is about 35%, corresponding to a performance of about 60 Mflop/s per CM-5E node, independent of the number of nodes.

• 93. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Data Parallel Implementation of O(N) Hierarchical N–body Methods1996Konferansepaper (Fagfellevurdert)

The O(N) hierarchical N-body algorithms and massively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10%-25%, and the overall efficiency is about 35%, corresponding to a performance of about 60 Mflop/s per CM-5E node, independent of the number of nodes.

• 94. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
Implementing O(N ) N–body algorithms efficiently in data parallel languages1996Inngår i: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, Vol. 5, nr 4, s. 337-364Artikkel i tidsskrift (Fagfellevurdert)

The optimization techniques for hierarchical O(N) N-body algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes and within the memory hierarchy of each node. We show how the techniques can be expressed in data-parallel languages, such as High Performance Fortran (HPF) and Connection Machine Fortran (CMF). The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication) being about 35%. For the CM-5E, a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured.

• 95. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
On the Accuracy of Anderson’s fast N –body algorithm1997Konferansepaper (Fagfellevurdert)

We present an empirical study of the accuracy-cost tradeoffs of Anderson's method. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations minimizes the total number of arithmetic operations.

• 96. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
On the Accuracy of Anderson’s fast N–body algorithm1997Konferansepaper (Fagfellevurdert)

We present an empirical study of the accuracy-cost tradeoffs of Anderson's method. The various parameters that control the degree of approximation of the computational elements and the separateness of interacting computational elements govern both the arithmetic complexity and the accuracy of the method. Our experiment shows that for a given error requirement, using a near-field containing only nearest neighbor boxes and a hierarchy depth that minimizes the number of arithmetic operations minimizes the total number of arithmetic operations.

• 97. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Data Parallel Fortran Benchmark Suite1997Konferansepaper (Fagfellevurdert)

The Data Parallel Fortran (DPF) benchmark suite is designed for evaluating data parallel compilers and scalable architectures. Many of the DPF codes are provided in three versions: basic, optimized and with library calls for performance critical operations typically found in software libraries. The functionality of the benchmarks cover collective communication functions, scientific software library functions, and application kernels that reflect the computational structure and communication patterns in fluid dynamic simulations, fundamental physics and molecular studies in chemistry and biology. The intended target language is High Performance Fortran (HPF). However, due to the lack of HPF compilers at the time of this benchmark development, the suite is written in Connection Machine Fortran (CMF). The DPF codes provide performance evaluation metrics in the form of busy and elapsed times (CM-5), FLOP rates and arithmetic efficiency. The codes are characterized in terms of FLOP count, m...

• 98. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Data–Parallel Implementation of the Geometric Partitioning Algorithm1997Konferansepaper (Fagfellevurdert)

We present a data-parallel, High Performance Fortran (HPF) implementation of the geometric partitioning algorithm. The geometric partitioning algorithm has provably good partitioning quality. To our knowledge, our implementation is the first data--parallel implementation of the algorithm. Our data--parallel formulation makes extensive use of segmented prefix sums and parallel selections, and provide a dataparallel procedure for geometric sampling. Experiments in partitioning particles for load--balance and data interactions as required in hierarchical N-body algorithms and iterative algorithms for the solution of equilibrium equations on unstructured meshes by the finite element method have shown that the geometric partitioning algorithm has an efficient data--parallel formulation. Moreover, the quality of the generated partitions is competitive with that offered by the spectral bisection technique and better than the quality offered by other partitioning heuristics. 1 Introduction Th...

• 99. Hu, Yu
KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
High Performance Fortran for Highly Irregular Problems1997Konferansepaper (Fagfellevurdert)

We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated with the elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N--body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with the proper language/compiler/runtime support. Our data parallel N--body code provides a much needed "benchmark" code for evaluating and improving HPF compilers. 1 Introduction Data parallel programming provides an effective way to write ...

• 100.
KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
University of Innsbruck, Institute of Computer Science. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC. KTH, Skolan för datavetenskap och kommunikation (CSC), Beräkningsvetenskap och beräkningsteknik (CST). KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
A Particle-in-Cell Method for Automatic Load-Balancing with the AllScale Environment2016Konferansepaper (Annet vitenskapelig)

We present an initial design and implementation of a Particle-in-Cell (PIC) method based on the work carried out in the European Exascale AllScale project. AllScale provides a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. The AllScale approach is based on task-based nested recursive parallelism and it provides mechanisms for automatic load-balancing in the PIC simulations. We provide the preliminary results of the AllScale-based PIC implementation and draw directions for its future development.

123456 51 - 100 of 260
Referera
Referensformat
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Annet format
Fler format
Språk
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Annet språk
Fler språk
Utmatningsformat
• html
• text
• asciidoc
• rtf
v. 2.35.9
| | | |