Change search
Refine search result
12 1 - 50 of 89
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Alexandru, Iordan
    et al.
    Norwegian University of Science and Technology Trondheim.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Natvig, Lasse
    Norwegian University of Science and Technology Trondheim.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Investigating the Potential of Energy-savings Using a Fine-grained Task Based Programming Model on Multi-cores2011Conference paper (Refereed)
    Abstract [en]

    In this paper we study the relation between energy-efficiencyand parallel executions when implemented with a fine-grained task-centricprogramming model. Using a simulation framework comprised of an ar-chitectural simulator and a power and area estimation tool, we haveinvestigated the potential energy-savings when employing parallelism onmulti-cores system. In our experiments with 2 - 8 multi-cores systems,we employed frequency and voltage scaling in order to keep the relativeperformance of the systems constant and measured the energy-efficiencyusing the Energy-delay-product. Also, we compared the energy consump-tion of the parallel execution against the serial one. Our results showthat through judicious choice of load balancing parameters, significantimprovements of around 200 % in energy consumption can be acheived.

  • 2.
    Awan, Ahsan Javed
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Ayguade, Eduard
    Barcelona Super Computing Center and Technical University of Catalunya.
    Architectural Impact on Performance of In-memoryData Analytics: Apache Spark Case StudyManuscript (preprint) (Other academic)
    Abstract [en]

    While cluster computing frameworks are contin-uously evolving to provide real-time data analysis capabilities,Apache Spark has managed to be at the forefront of big data an-alytics for being a unified framework for both, batch and streamdata processing. However, recent studies on micro-architecturalcharacterization of in-memory data analytics are limited to onlybatch processing workloads. We compare micro-architectural per-formance of batch processing and stream processing workloadsin Apache Spark using hardware performance counters on a dualsocket server. In our evaluation experiments, we have found thatbatch processing are stream processing workloads have similarmicro-architectural characteristics are bounded by the latency offrequent data access to DRAM. For data accesses we have foundthat simultaneous multi-threading is effective in hiding the datalatencies. We have also observed that (i) data locality on NUMAnodes can improve the performance by 10% on average and(ii)disabling next-line L1-D prefetchers can reduce the executiontime by up-to 14% and (iii) multiple small executors can provideup-to 36% speedup over single large executor

  • 3.
    Awan, Ahsan Javed
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Ayguade, Eduard
    Technical University of Catalunya, Barcelona Super Computing Center.
    How Data Volume Affects Spark Based Data Analytics on a Scale-up Server2015In: Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers, Springer, 2015, Vol. 9495, p. 81-92Conference paper (Refereed)
    Abstract [en]

    Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10 % better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x.

  • 4.
    Awan, Ahsan Javed
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Ayguade, Eduard
    Barcelona Super Computing Center and Technical University of Catalunya.
    Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads2016Conference paper (Refereed)
    Abstract [en]

    While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

  • 5.
    Awan, Ahsan Javed
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Ayguade, Eduard
    Barcelona Super Computing Center and Technical University of Catalunya.
    Node architecture implications for in-memory data analytics on scale-in clusters2016Conference paper (Refereed)
    Abstract [en]

    While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory bandwidth characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.

  • 6.
    Ayguadé, Eduard
    et al.
    European Center for Parallelism of Barcelona (CEPBA), Technical University of Catalunya (UPC).
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brunst, H.
    ) Center for High Performance Computing (ZHR), TU Dresden.
    Hoppe, H. -C
    Pallas GmbH.
    Karlsson, S.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Martorell, X.
    European Center for Parallelism of Barcelona (CEPBA), Technical University of Catalunya (UPC).
    Nagel, W. E.
    ) Center for High Performance Computing (ZHR), TU Dresden.
    Schlimbach, F.
    Pallas GmbH.
    Utrera, G.
    European Center for Parallelism of Barcelona (CEPBA), Technical University of Catalunya (UPC).
    Winkler, M.
    ) Center for High Performance Computing (ZHR), TU Dresden.
    OpenMP Performance Analysis in the INTONE Project2001Conference paper (Refereed)
  • 7.
    Bao, Yan
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture (Closed 20120101), Software and Computer Systems, SCS (Closed 20120101).
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture (Closed 20120101), Software and Computer Systems, SCS (Closed 20120101).
    An Implementation of Cache-Coherence for the Nios II ™ Soft-core processor2009Conference paper (Refereed)
    Abstract [en]

    Soft-core programmable processors mapped onto fieldprogrammable gate arrays (FPGA) can be considered as equivalents to a microcontroller. They combine central processing units (CPUs), caches, memories, and peripherals on a single chip. Soft-cores processors represent an increasingly common embedded software implementation option. Modern FPGA soft-cores are parameterized to support application-specific customization. However, these softcore processors are designed to be used in uniprocessor system, not for multiprocessor system. This project describes an implementation to solve the cache coherency problem in an ALTERA Nios II soft-core multiprocessor system.

  • 8.
    Barriga, L.
    et al.
    KTH, Superseded Departments, Teleinformatics.
    Brorsson, Mats
    Lund university.
    Ayani, Rassul
    KTH, Superseded Departments, Teleinformatics.
    Hybrid Parallel Simulation of Distributed Shared-Memory Architectures1996Report (Other academic)
  • 9.
    Barriga, Luis
    et al.
    KTH, Superseded Departments, Teleinformatics.
    Brorsson, Mats
    Lund university.
    Ayani, Rassul
    KTH, Superseded Departments, Teleinformatics.
    A model for parallel simulation of distributed shared memory1996Conference paper (Refereed)
    Abstract [en]

    We present an execution model for parallel simulation of a distributed shared memory architecture. The model captures the processor-memory interaction and abstracts the memory subsystem. Using this model we show how parallel, on-line, partially-ordered memory traces can be correctly predicted without interacting with the memory subsystem. We also outline a parallel optimistic memory simulator that uses these traces, finds a global order among all events, and returns correct data and timing to each processor. A first evaluation of the amount of concurrency that our model can extract for an ideal multiprocessor shows that processors may execute relatively long instruction sequences without violating the causality constraints. However parallel simulation efficiency is highly dependent on the memory consistency model and the application characteristics.

  • 10.
    Bhatti, Muhammad Khurram
    et al.
    Informat Technol Univ, Embedded Comp Lab, 346-B Ferozpur Rd, Lahore, Pakistan..
    Oz, Isil
    Izmir Inst Technol, Comp Engn Dept, Izmir, Turkey..
    Amin, Sarah
    Informat Technol Univ, Embedded Comp Lab, 346-B Ferozpur Rd, Lahore, Pakistan..
    Mushtaq, Maria
    Informat Technol Univ, Embedded Comp Lab, 346-B Ferozpur Rd, Lahore, Pakistan..
    Farooq, Umer
    Dhofar Univ, Dept Elect & Comp Engn, Salalah 211, Oman..
    Popov, Konstantin
    SICS, Isafjordsgatan 22, S-16429 Kista, Sweden..
    Brorsson, Mats
    KTH, School of Electrical Engineering and Computer Science (EECS), Fusion Plasma Physics. , S-16429 Kista, Sweden..
    Locality-aware task scheduling for homogeneous parallel computing systems2018In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 100, no 6, p. 557-595Article in journal (Refereed)
    Abstract [en]

    In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.

  • 11. Bhatti, Muhammad Khurram
    et al.
    Oz, Isil
    Popov, Konstantin
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Farooq, Umer
    Scheduling of Parallel Tasks with Proportionate Priorities2016In: ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, ISSN 2193-567X, Vol. 41, no 8, p. 3279-3295Article in journal (Refereed)
    Abstract [en]

    Parallel computing systems promise higher performance for computationally intensive applications. Since programmes for parallel systems consist of tasks that can be executed simultaneously, task scheduling becomes crucial for the performance of these applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper presents a novel list scheduling heuristic, called the Noodle heuristic. Noodle is a simple yet effective scheduling heuristic that differs from the existing list scheduling techniques in the way it assigns task priorities. The priority mechanism of Noodle maintains a proportionate fairness among all ready tasks belonging to all paths within a task graph. We conduct an extensive experimental evaluation of Noodle heuristic with task graphs taken from Standard Task Graph. Our experimental study includes results for task graphs comprising of 50, 100, and 300 tasks per graph and execution scenarios with 2-, 4-, 8-, and 16-core systems. We report results for average Schedule Length Ratio (SLR) obtained by producing variations in Communication to Computation cost Ratio. We also analyse results for different degree of parallelism and number of edges in the task graphs. Our results demonstrate that Noodle produces schedules that are within a maximum of 12 % (in worst-case) of the optimal schedule for 2-, 4-, and 8-core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance. Noodle outperforms existing heuristics for average SLR values.

  • 12. Bhatti, Muhammad Khurram
    et al.
    Oz, Isil
    Popov, Konstantin
    Muddukrishna, Ananya
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS. SICS Swedish ICT, Sweden.
    Noodle: A Heuristic Algorithm for Task Scheduling in MPSoC Architectures2014In: Proceedings - 2014 17th Euromicro Conference on Digital System Design, DSD 2014, 2014, p. 667-670Conference paper (Refereed)
    Abstract [en]

    Task scheduling is crucial for the performance of parallel applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper(1) presents a novel heuristic algorithm, called the Noodle heuristic, which differs from the existing list scheduling techniques in the way it assigns task priorities. We conduct an extensive experimental to validate Noodle for task graphs taken from Standard Task Graph (STG). Results show that Noodle produces schedules that are within a maximum of 12% (in worst-case) of the optimal schedule for 2, 4, and 8 core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance.

  • 13.
    Brorsson, Mats
    Lund university.
    A decentralized virtual memory scheme implemented on an emulated multiprocessor1989In: Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Vol.I: Architecture Track (IEEE Cat. No. 89TH0242-8), 1989, p. 286-95Conference paper (Refereed)
    Abstract [en]

    A decentralized scheme for virtual memory management on MIMD (multiple-instruction-multiple-data) multiprocessors with shared memory has been developed. Control and data structures are kept local to the processing elements (PE), which reduces the global traffic and makes a high degree of parallelism possible. Each of the PEs in the target architecture consists of a processor and part of the shared memory and is connected to the others by a common bus. The traditional approach, based on replication or sharing of data structures is not suitable in this case when the number of PEs is of the magnitude of 100. This is due to the excessive global traffic caused by consistency or mutual exclusion protocols. A variant of the Dennings working set page replacement algorithm is used, in which each process owns a page list. Shared pages are not present in more than one list, and it is shown that this will not increase the page fault rate in most cases.

  • 14.
    Brorsson, Mats
    Lund university.
    Datorsystem – program- och maskinvara1999Book (Other academic)
  • 15.
    Brorsson, Mats
    Lund university.
    Emulation of Shared Virtual Memory on an Experimental Multiprocessor1989Report (Other academic)
  • 16.
    Brorsson, Mats
    Lund university.
    Intone—Tools and Environment for OpenMP on Clusters of SMPs2000Conference paper (Refereed)
  • 17.
    Brorsson, Mats
    Lund university.
    Local vs. global memory in the IBM RP3: experiments and performance modelling1991Conference paper (Refereed)
    Abstract [en]

    A number of experiments regarding the placement of instructions, private data and shared data in the Non-Uniform-Memory-Access multiprocessor, RP3, have been performed. Three scientific/mathematical workloads have been used in the experiments, and the results have been modelled in a simple performance model which takes linear contention into consideration. The results indicate that it can very well be feasible not to have memory local to the processors in RP3-like architectures. There seems to be a trade-off between the effort spent in the design on the memory system and the interconnection network and the use of local memory which can be costly in terms of prohibited process migration and more complicated software management.

  • 18.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture (Closed 20120101), Software and Computer Systems, SCS (Closed 20120101).
    MipsIt-a simulation and development environment using animation for computer architecture education2002In: / [ed] Ed Gehringer, 2002, p. 65-72Conference paper (Other academic)
    Abstract [en]

    Computer animation is a tool which nowadays is used in more and more fields. In this paper we describe the use of computer animation to support the learning of computer organization itself. MipsIt is a system consisting of a software development environment, a system and cache simulator and a highly flexible microarchitecture simulator used for pipeline studies. It has been in use for several years now and constitutes an important tool in the education at Lund University and KTH, Royal Institute of Technology in Sweden.

  • 19.
    Brorsson, Mats
    Lund university.
    Performance Impact of Code and Data Placement on the IBM RP31989Report (Other academic)
  • 20.
    Brorsson, Mats
    Lund university.
    Performance tuning of small scale shared memory multiprocessor applications using visualisation1997Conference paper (Refereed)
    Abstract [en]

    Even though shared memory multiprocessors are becoming more and more common, it is still a formidable task to achieve high performance on parallel applications. One of the main reasons for this is a high amount of implicit communication generated by the program due to poor structuring of the program. This article shows the importance of performance visualisation in order to spot and find the source of cache coherence bottlenecks. This is exemplified by a performance analysis tool, SM-prof, that visualises accesses to shared data structures so that problematic access patterns are highlighted. SM-prof maintains links from the visualisation to the actual source code lines responsible for the accesses. In contrast to earlier approaches, SM-prof shows the inherent data sharing of the application that would occur in any shared memory architecture. We demonstrate the merits of SM-prof by means of two detailed case studies.

  • 21.
    Brorsson, Mats
    Lund university.
    SM-prof: a tool to visualise and find cache coherence performance bottlenecks in multiprocessor programs1995Conference paper (Refereed)
    Abstract [en]

    Cache misses due to coherence actions are often the major source for performance degradation in cache coherent multiprocessors. It is often difficult for the programmer to take cache coherence into account when writing the program since the resulting access pattern is not apparent until the program is executed. SM-prof is a performance analysis tool that addresses this problem by visualising the shared data access pattern in a diagram with links to the source code lines causing performance degrading access patterns. The execution of a program is divided into time slots and each data block is classified based on the accesses made to the block during a time slot. This enables the programmer to follow the execution over time and it is possible to track the exact position responsible for accesses causing many cache misses related to coherence actions. Matrix multiplication and the MP3D application from SPLASH are used to illustrate the use of SM-prof. For MP3D, SM-prof revealed performance limitations that resulted in a performance improvement of over 75%. The current implementation is based on program-driven simulation in order to achieve non-intrusive profiling. If a small perturbation of the program execution is acceptable, it is also possible to use software tracing techniques given that a data address can be related to the originating instruction.

  • 22.
    Brorsson, Mats
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Collin, Mikael
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Adaptive and flexible dictionary code compression for embedded applications2006In: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, 2006, p. 113-124Conference paper (Refereed)
    Abstract [en]

    Dictionary code compression is a technique where long instructions in the memory are replaced with shorter code words used as index in a table to look up the original instructions. We present a new view of dictionary code compression for moderately high-performance processors for embedded applications. Previous work with dictionary code compression has shown decent performance and energy savings results which we verify with our own measurement that are more thorough than previously published. We also augment previous work with a more thorough analysis on the effects of cache and line size changes. In addition, we introduce the concept of aggregated profiling to allow for two or more programs to share the same dictionary contents. Finally, we also introduce dynamic dictionaries where the dictionary contents is considered to be part of the context of a process and show that the performance overhead of reloading the dictionary contents on a context switch is negligible while on the same time we can save considerable energy with a more specialized dictionary contents.

  • 23.
    Brorsson, Mats
    et al.
    Lund university.
    Dahlgren, Fredrik
    Lund university.
    Nilsson, Håkan
    Lund university.
    Stenström, Per
    Lund university.
    The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors1993Conference paper (Refereed)
  • 24.
    Brorsson, Mats
    et al.
    Lund university.
    Kral, Martin
    Lund university.
    Performance tuning software DSM applications using visualisation1999In: Journal of Supercomputing, Vol. 13, p. 249-65Article in journal (Refereed)
    Abstract [en]

    Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing platforms. Software Distributed Shared Memory (Software DSM) packages have been developed to facilitate the programming of such systems. However, because of the high interprocess latencies in a NOW, the performance of a software DSM application is more susceptible to the partitioning of the problem than what might be expected. This paper presents an approach for a tool to visualise the execution of a program in a way that highlights performance bottlenecks. The tool associates identified bottlenecks with the corresponding source code lines in order to determine what piece of code is the cause of poor performance. The visualisation technique is demonstrated in two case studies. They clearly show that the visualisation is indeed useful and provides an effective way to acquire an understanding of what characterises an applications sharing behaviour

  • 25.
    Brorsson, Mats
    et al.
    Lund university.
    Kral, Martin
    Lund university.
    Visualisation for performance tuning of DVSM applications1998Conference paper (Refereed)
    Abstract [en]

    Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing platforms. Distributed Virtual Shared Memory (DVSM) packages have been developed to facilitate the programming of such systems. However, because of the high interprocess latencies in a NOW, the performance of a DVSM application is more susceptible to the partitioning of the problem than what might be expected. The paper presents an approach for a tool to visualise the execution of a program in a way that highlights performance bottlenecks. The tool associates identified bottlenecks with the corresponding source code lines in order to determine what piece of code is the cause of poor performance. The visualisation technique is demonstrated in two case studies. They clearly show that the visualisation is indeed useful and provides an effective way to acquire an understanding of what characterises an application sharing behaviour.

  • 26.
    Brorsson, Mats
    et al.
    Telesoft AB.
    Kruzela, Ivan
    Telesoft AB.
    Museion-reuse support system for design of service features1991Conference paper (Refereed)
    Abstract [en]

    Museion is a reuse support system integrated in a service creation environment. A number of service features have been implemented and tested on a test bed consisting of a PABX switching system and a workstation. Reuse has been applied to all phases of the software development cycle. Museion is an intelligent database system with facilities for storage, search and evaluation of reusable components. One feature of Museion is the concept of aggregates, an abstract data structure containing reusable components as well as information supporting the classification, search and retrieval of the specific components. Museion maintains links between related aggregates enabling a hypertext navigation facility. A prototype Museion consisting of a number of integrated tools has been designed and implemented. The authors discuss the usefulness of Museion and compare it with other repositories.

  • 27.
    Brorsson, Mats
    et al.
    Telesoft AB.
    Kruzela, Ivan
    Telesoft AB.
    Reuse in Telecommunication System Development1990Report (Other academic)
  • 28.
    Brorsson, Mats
    et al.
    Lund university.
    Stenstrom, Per
    Lund university.
    Characterising and modelling shared memory accesses in multiprocessor programs1996In: Parallel Computing, Vol. 22, p. 869-93Article in journal (Refereed)
    Abstract [en]

    Directory-based, write-invalidate cache coherence protocols are effective in reducing memory latency in shared memory multiprocessors. However, their performance is highly related to the number of coherence actions induced by the application’s access pattern. It is therefore important to understand the nature of data sharing access patterns that lead to cache misses for this class of cache coherence protocols. We identify a set of application parameters that characterises data sharing, the sharing behaviour, for three distinct categories of access patterns: stationary, migratory and producer-consumer accesses. The characterisation can be done in sufficient detail so as to predict the number of cold, coherence and directory replacement misses for a limited-directory cache coherence scheme. To validate a workload model that essentially uses the parameter set as input, a reference generator has been designed. This reference generator is shown to generate the same miss ratio as the workload it models

  • 29.
    Brorsson, Mats
    et al.
    Lund university.
    Stenstrom, Per
    Lund university.
    Modelling accesses to migratory and producer-consumer characterised data in a shared memory multiprocessor1994Conference paper (Refereed)
    Abstract [en]

    Directory-based, write-invalidate cache coherence protocols are effective in reducing latencies to the memory but suffer from cache misses due to coherence actions. It is therefore important to understand the nature of data sharing causing misses for this class of protocols. We identify a set of parameters that characterises the accesses to migratory and producer-consumer data in sufficient detail so as to predict the number of cache misses in directory-based, write-invalidate protocols. We show that the parameters can be extracted from real programs and used as input to a reference generator that artificially generates a stream of references causing accurate estimates of cold, coherence and directory replacement misses, compared to the program itself.

  • 30.
    Brorsson, Mats
    et al.
    Lund university.
    Stenstrom, Per
    Lund university.
    Modelling accesses to stationary data in a shared memory multiprocessor1994Conference paper (Refereed)
    Abstract [en]

    Cache misses due to coherence and directory maintenance is a major reason for poor performance in shared memory multiprocessors. We show that the relationship between a particular access pattern and cache miss ratios for a class of directory-based, write-invalidate cache coherence protocols can be characterised in a small set of parameters. In order to do this, a reference generator has been designed that, based on parameters automatically extracted from a program, can artificially generate a reference stream that results in the same cold, coherence and directory replacement miss ratios as an execution of the program.

  • 31.
    Brorsson, Mats
    et al.
    Lund university.
    Stenström, Per
    Lund university.
    Visualising Sharing Behaviour in relation to Shared Memory Management1992Conference paper (Refereed)
  • 32. Brunschen, C.
    et al.
    Brorsson, Mats
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    OdinMP/CCp - a portable implementation of OpenMP for C2000In: Concurrency, ISSN 1040-3108, E-ISSN 1096-9128, Vol. 12, no 12, p. 1193-1203Article in journal (Refereed)
    Abstract [en]

    We describe here the design and performance of OdinMP/CCp, which is a portable compiler for C-programs using the OpenMP directives for parallel processing with shared memory. OdinMP/CCp was written in Java for portability reasons and takes a C-program with OpenMP directives and produces a C-program for POSIX threads. We describe some of the ideas behind the design of OdinMP/CCp and show some performance results achieved on an SGI Origin 2000 and a Sun E10000, Speedup measurements relative to a sequential version of the test programs show that OpenMP programs using OdinMP/CCp exhibit excellent performance on the Sun E10000 and reasonable performance on the Origin 2000,

  • 33.
    Collin, Mikael
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Improving Code Density of Embedded Software using a 2-level Dictionary Code Compression Architecture2008In: 2008 13TH ASIA-PACIFIC COMPUTER SYSTEMS ARCHITECTURE CONFERENCE, NEW YORK: IEEE , 2008, p. 284-291Conference paper (Refereed)
    Abstract [en]

    Dictionary code compression has been proposed to reduce the energy consumed in the instruction fetch path of processors or to reduce program footprint in memory With this technique, instructions, or instruction sequences, are in the binary code replaced with short code words that in run-time are replaced with the original instructions using the dictionary inside the data-path. We present here a new method with the aim to further improve on code density as compared to previously proposed dictionary code compression techniques. It is a 2-level approach capable of handling compression of both individual instructions and code sequences of 2-16 instructions. Our proposed approach is more flexible and has better dynamic compression ratio and fetch path energy consumption as compared to previously studied 1-level approaches. The energy consumed in the instruction fetch path is reduced with up to 56% as compared to using uncompressed instructions.

  • 34.
    Collin, Mikael
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Low Power Instruction Fetch using Profiled Variable Length Instructions2003Conference paper (Refereed)
    Abstract [en]

    Computer system performance depends on high access rate and low miss rate in the instruction cache, which also affects energy consumed by fetching instructions. Simulation of a small computer typical for embedded systems show that up to 20% of the overall processor energy is consumed in the instruction fetch path and as much as 23% of the execution time is spent on instruction fetch. One way to increase the instruction memory bandwidth is to fetch more instructions each access without increasing the bus width. We propose an extension to a RISC ISA, with variable length instructions, yielding higher information density without compromising programmability. Based on profiling of dynamic instruction usage and argument locality of a set of SPEC CPU2000 applications, we present a scheme using 8- 16- and 24-bit instructions accompanied by lookup tables inside the processor. Our scheme yields a 20-30% reduction in static memory usage, and experiments show that up to 60% of all executed instructions consist of short instructions. The overall energy savings are up to 15% for the entire data path and memory system, and up to 20% in the instruction fetch path.

  • 35.
    Collin, Mikael
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Low Power Instruction Fetch using Variable Length Instructions2003Conference paper (Refereed)
  • 36.
    Collin, Mikael
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Two-Level Dictionary Code Compression: a New Scheme to Improve Instruction Code Density of Embedded Applications2009In: CGO 2009: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, PROCEEDINGS, LOS ALAMITOS: IEEE COMPUTER SOC , 2009, p. 231-242Conference paper (Refereed)
    Abstract [en]

    Dictionary code compression is a technique which has been studied as a method to reduce the energy consumed in the instruction fetch path of processors. Instructions or instruction sequences in the code are replaced with short code words. These code words are later used to index a dictionary which contains the original uncompressed instruction or an entire sequence. In this paper, we present a new method which improves on code density compared to previously published dictionary methods. It uses a two-level dictionary design and is capable of handling compression of both individual instructions and code sequences of 2-16 instructions. The two dictionaries are in separate pipeline stages and work together to decompress sequences and instructions. The impact on storage size for the dictionaries is rather small as the sequences in the dictionary are stored as individually compressed instructions, instead of normal instructions. Compared to previous dictionary code compression methods we achieve improved dynamic compression rate, potential for better performance with reasonable static compression rate and with still small dictionary size suitable for context switching.

  • 37.
    Collin, Mikael
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Öberg, Johnny
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A performance and energy exploration of dictionary code compression architectures2011In: 2011 International  Green Computing Conference and Workshops (IGCC), IEEE conference proceedings, 2011, p. 1-8Conference paper (Refereed)
    Abstract [en]

    We have made a performance and energy exploration of a previously proposed dictionary code compression mechanism where frequently executed individual instructions and/or sequences are replaced in memory with short code words. Our simulated design shows a dramatically reduced instruction memory access frequency leading to a performance improvement for small instruction cache sizes and to significantly reduced energy consumption in the instruction fetch path. We have evaluated the performance and energy implications of three architectural parameters: branch prediction accuracy, instruction cache size and organization. To asses the complexity of the design we have implemented the critical stages in VHDL.

  • 38. Du, M.
    et al.
    Sassioui, R.
    Varisteas, G.
    State, R.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Cherkaoui, O.
    Improving real-time bidding using a constrained markov decision process2017In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Springer, 2017, Vol. 10604, p. 711-726Conference paper (Refereed)
    Abstract [en]

    Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.

  • 39.
    Fang, Huan
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture (Closed 20120101), Software and Computer Systems, SCS (Closed 20120101).
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture (Closed 20120101), Software and Computer Systems, SCS (Closed 20120101).
    Scalable directory architecture for distributed shared memory chip multiprocessors2008In: Proceedings of the 1st Swedish Workshop on Multi-core Computing, 2008, p. 73-81Conference paper (Refereed)
    Abstract [en]

    Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main memory, we consider distributing the directory together with L2 cache across all nodes on a Chip Multiprocessor. Each node contains a processing unit, a private L1 cache, a slice of L2 cache, memory controller and a router. Both L2 cache and memories are distributed shared and interleaved by a subset of memory address bits. All nodes are interconnected through a low latency two dimensional Mesh network.

    Directory, as a split component as L2 cache, only stores sharing information for blocks while L2 cache only stores data blocks exclusive with L1 cache. Shared L2 cache can increase total effective cache capacity on chip, but also increase the miss latency when data is on a remote node. Being different from Directory Cache structure, our proposal totally removes the directory from memory which saves memory space and reduces access latency. Compared to L2 cache which combines directory information internally, our split L2 cache structure saves over 88% cache space while having achieved similar performance.

  • 40.
    Fang, Huan
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Scalable directory architecture for distributed shared memory chip multiprocessors2008In: SIGARCH Computer Architecture News, ISSN 0163-5964, E-ISSN 1943-5851, Vol. 36, no 5, p. 56-64Article in journal (Refereed)
    Abstract [en]

    Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main memory, we consider distributing the directory together with L2 cache across all nodes on a Chip Multiprocessor. Each node contains a processing unit, a private L1 cache, a slice of L2 cache, memory controller and a router. Both L2 cache and memories are distributed shared and interleaved by a subset of memory address bits. All nodes are interconnected through a low latency two dimensional Mesh network. Directory, being a split component to L2 cache, only stores sharing information for blocks while L2 cache stores only data blocks exclusive with L1 cache. Shared L2 cache can increase total effective cache capacity on chip, but also increase the miss latency when data is on a remote node. Being different from Directory Cache structure, our proposal totally removes the directory from memory, which saves memory space and reduces access latency. Compared to L2 cache that combines directory information internally, our L2 cache structure saves up to 88% cache space and achieves similar performance.

  • 41.
    Faxén, Karl-Filip
    et al.
    Swedish Institute of Computer Science.
    Bengtsson, Christer
    Swedsoft.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Grahn, Håkan
    Blekinge Institute of Technology.
    Hagersten, Erik
    Uppsala university.
    Jonsson, Bengt
    Uppsala university.
    Kessler, Christoph
    Linköping university.
    Lisper, Björn
    Mälardalen university.
    Stenström, Per
    Chalmers university of Technology.
    Multicore computing--the state of the art2009Report (Other academic)
    Abstract [en]

    This document presents the current state of the art in multicore computing, in hardware and software, as well as ongoing activities, especially in Sweden. To a large extent, it draws on the presentations given at the Multicore Days 2008 organized by SICS, Swedish Multicore Initiative and Ericsson Software Research but the published literature and the experience of the authors has been equally important sources. It is clear that multicore processors will be with us for the foreseeable future; there seems to be no alternative way to provide substantial increases of microprocessor performance in the coming years. While processors with a few (2–8) cores are common today, this number is projected to grow as we enter the era of manycore computing. The road ahead for multicore and manycore hardware seems relatively clear, although some issues like the organization of the on-chip memory hierarchy remain to be settled. Multicore software is however much less mature, with fundamental questions of programming models, languages, tools and methodologies still outstanding.

  • 42.
    Issa, Shady
    et al.
    Universidade de Lisboa, Portugal.
    Romano, Paolo
    INESC-ID, Instituto Superior Tecnico, Universidade de Lisboa.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Green-CM: Energy efficient contention management for Transactional Memory2015In: 44th International Conference on Parallel Processing (ICPP), September 2015, IEEE , 2015Conference paper (Refereed)
    Abstract [en]

    Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different backoff policies in order to take advantage of dynamic frequency and voltage scaling; ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations; iii) it makes extensive use of selftuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.

  • 43.
    Javed Awan, Ahsan
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Ayguade, Eduard
    Technical University of Catalunya (UPC), Computer Architecture Department.
    Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server2015In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015, IEEE Computer Society, 2015, p. 1-8, article id 7310708Conference paper (Refereed)
    Abstract [en]

    In last decade, data analytics have rapidly progressed from traditional disk-based processing tomodern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.

  • 44.
    Karlsson, S.
    et al.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    An Infrastructure for Portable and Efficient Software DSM1999Conference paper (Refereed)
  • 45.
    Karlsson, S.
    et al.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Priority Based Messaging for Software Distributed Shared Memory2003In: Cluster Computing, Vol. 6, p. 161-169Article in journal (Refereed)
  • 46.
    Karlsson, S.
    et al.
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    Producer-push-a protocol enhancement to page-based software distributed shared memory systems1999In: Proceedings of ICPP’99: 1999 International Conference on Parallel Processing, 1999, p. 291-300Conference paper (Refereed)
    Abstract [en]

    This paper describes a technique called producer-push that enhances the performance of a page-based software distributed shared memory system. Shared data, in software DSM systems, must normally be requested from the node that produced the latest value. Producer-push utilizes the execution history to predict this communication so that the data is pushed to the consumer before it is requested. In contrast to previously proposed mechanisms to proactively send data to where it is needed, producer-push uses information about the source code location of communication to more accurately predict the needed communication. Producer-push requires no source code modifications of the application and it effectively reduces the latency of shared memory accesses. This is confirmed by our performance evaluation which shows that the average time to wait for memory updates is reduced by 74%. Producer-push also changes the communication pattern of an application making it more suitable for modern networks. The latter is a result of a 44% reduction of the average number of messages and an enlargement of the average message size by 65%.

  • 47.
    Karlsson, Sven
    et al.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    A comparative characterization of communication patterns in applications using MPI and shared memory on an IBM SP21998Conference paper (Refereed)
    Abstract [en]

    In this paper we analyze the characteristics of communication in three different applications, FFT, Barnes and Water, on an IBM SP2. We contrast the communication using two different programming models: message-passing, MPI, and shared memory, represented by a state-of-the-art distributed virtual shared memory package, TreadMarks. We show that while communication time and busy times are comparable for small systems, the communication patterns are fundamentally different leading to poor performance for TreadMarks-based applications when the number of processors increase. This is due to the request/reply technique used in TreadMarks that results in a large fraction of very small messages. However, if the application can be tuned to reduce the impact of small message communication it is possible to achieve acceptable performance at least up to 32 nodes. Our measurements also show that TreadMarks programs tend to cause a more even network load compared to MPI programs

  • 48.
    Karlsson, Sven
    et al.
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    A free openmp compiler and run-time library infrastructure for research on shared memory parallel computing2004In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems, ACTA Press, 2004, p. 354-361Conference paper (Refereed)
    Abstract [en]

    OpenMP is an informal industry standard for programming parallel computers with a shared memory and has during the last few years achieved considerable acceptance in both the academic world and the industry. OpenMP is a thread level fork-join programming model and relies on a set of compiler directives. An OpenMP aware compiler uses these directives to generate a multi-threaded application. In practice, an OpenMP run-time library is also needed as OpenMP specifies a set of run-time library calls. In this paper we report on a free OpenMP compiler and run-time library infrastructure. We present an OpenMP compiler for C called OdinMP and briefly discuss the run time library that the compiler targets. The source code to both the compiler and the run-time libraries are available and can be freely used for OpenMP research. The compilation system is evaluated using the EPCC micro-benchmark suite for OpenMP and a set of appli cations from the SPLASH-2 benchmarks suite ported to OpenMP. Comparisons are made to OpenMP aware com piler systems from SGI and Intel. The performance of code generated with the pre sented compilation system is shown to be very close to or exceeding that of commercial compilers for a wide range of benchmark applications.

  • 49.
    Karlsson, Sven
    et al.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Priority Based Messaging for Software Distributed Shared Memory – Model and Implementation2001Conference paper (Refereed)
  • 50.
    Karlsson, Sven
    et al.
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Lee, S. -W
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    A Fully Compliant OpenMP implementation on Software Distributed Shared Memory2002Conference paper (Refereed)
    Abstract [en]

    OpenMP is a relatively new industry standard for programming parallel computers with a shared memory programming model. Given that clusters of workstations are a cost-effective solution to build parallel platforms, it would of course be highly interesting if the OpenMP model could be extended to these systems as well as to the standard shared memory architectures for which it was originally intended. We present in this paper a fully compliant implementation of the OpenMP specification 1.0 for C targeting networks of workstations. We have used an experimental software distributed shared memory system, CVM, to implement a run-time library which is the target of a source-to-source OpenMP translator also developed in this project. The system has been evaluated using an OpenMP microbenchmark suite used to evaluate the effect of some memory coherence protocol improvements. We have also used OpenMP versions of three Splash-2 applications concluding in reasonable speedups on an IBM SP machine with eight nodes. This is the first study to investigate the subtle mechanisms of consistency in OpenMP on software DSM systems.

12 1 - 50 of 89
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf