Workshop Program

International Research Workshop on

Advanced High Performance Computing Systems

Cetraro (Italy) – June 27-29, 2011

Workshop Agenda

Monday, June 27^th

Session	Time	Speaker/Activity
	9:00 – 9:15	*Welcome Address*
Session I		Grid and Cloud Computing 1
	9:15 – 9:45	V. GETOV Smart Cloud Computing: Autonomy, Intelligence and Adaptation
	9:45 – 10:15	P. MARTIN Provisioning Data-Intensive Workloads in the Cloud
	10:15 – 10:45	J.L. LUCAS A Multi-cloud Management Architecture and Early Experiences
	10:45 – 11:15	COFFEE BREAK
	11:15 – 11:45	D. PETCU How is built a mosaic of Clouds
	11:45 – 12:15	M. KUNZE Towards High Performance Cloud Computing (HPCC)
	12:15 – 12:45	R. MAINIERI The future of cloud computing and its impact on transforming industries
	12:45 – 13:00	CONCLUDING REMARKS
Session II		Grid and Cloud Computing 2
	17:00 – 17:30	B. SOTOMAYOR Reliable File Transfers with Globus Online
	17:30 – 18:00	K. MIURA RENKEI: A Light-weight Grid Middleware for e-Science Community
	18:00 – 18:30	COFFEE BREAK
	18:30 – 19:00	P. KACSUK Supporting Scientific and Web-2 Communities by Desktop Grids
	19:00 – 19:30	L. LEFEVRE Energy efficiency from networks to large scale distributed systems
	19:30 – 20:00	M. STILLWELL Dynamic Fractional Resource Scheduling
	20:00 – 20:10	CONCLUDING REMARKS

Tuesday, June 28^th

Session	Time	Speaker/Activity
Session III		New Heterogeneous Architectures and Software for HPC 1
	9:00 – 9:30	E. D’HOLLANDER High-performance Computing for low-power systems
	9:30 – 10:00	L. SOUSA Distributed Computing on Highly Heterogeneous Systems
	10:00 – 10:30	F. PINEL Utilizing GPUs to Solve Large Instances of the Tasks Mapping Problem
	10:30 – 11:00	R. ANSALONI Cray’s Approach to Heterogeneous Computing
	11:00 – 11:30	COFFEE BREAK
	11:30 – 12:00	C. PEREZ Resource management system for complex and non-predictably evolving applications
	12:00 – 12:30	G. BOSILCA Flexible Development of Dense Linear Algebra Algorithms on Heterogeneous Parallel Architectures with DAGuE
	12:30 – 13:00	M. SHEIKHALISHAHI Resource Management and Green Computing
	13.00 – 13:15	CONCLUDING REMARKS
Session IV		New Heterogeneous Architectures and Software for HPC 2
	17:00 – 17:30	A. SHAFARENKO New-Age Component Linking: Compilers Must Speak Constraints
	17:30 – 18:00	A. BENOIT Energy-aware mappings of series-parallel workflows onto chip multiprocessors
	18:00 – 18:30	COFFEE BREAK
	18:30 – 19:00	H. KAISER ParalleX - A Cure for Scaling-Impaired Parallel Applications
	19:00 – 19:30	L. MIRTAHERI An Algebraic Model for a Runtime High Performance Computing Systems Reconfiguration
	19:30 – 19:45	CONCLUDING REMARKS

Wednesday, June 29^th

Session	Time	Speaker/Activity
Session V		Advanced software issues for top scale HPC
	9:00 – 9:30	E. LAURE CRESTA - Collaborative Research into Exascale Systemware, Tools and Applications
	9:30 – 10:00	T. LIPPERT Amdahl hits the Exascale
	10:00 – 10:30	S. DOSANJH On the Path to Exascale
	10:30 – 11:00	C. SIMMENDINGER Petascale in CFD
	11:00 – 11:30	COFFEE BREAK
Session VI		Advanced Infrastructures, Projects and Applications
	11:30 – 13:00	PANEL DISCUSSION *Exascale* *Computing: from utopia to reality*
	13:00 – 13:15	CONCLUDING REMARKS

INVITED SPEAKERS

R. Ansaloni	Cray Italy	Italy
A. Benoit	ENS Lyon and Institut Universitaire de France	France
G. Bosilca	University of Tennessee	USA
E. D’Hollander	Ghent University	Belgium
S. Dosanjh	SANDIA National Laboratories	USA
V. Getov	School of Electronics and Computer Science, University of Westminster	United Kingdom
P. Kacsuk	MTA SZTAKI	Hungary
H. Kaiser	Center of Computation and Technology (CCT) Louisiana State University	USA
M. Kunze	Karlsruhe Institute of Technology	Germany
E. Laure	Royal Institute of Technology Stockholm	Sweden
L. Lefevre	INRIA RESO – LIP	France
T. Lippert	Juelich Supercomputing Centre	Germany
J.L. Lucas	Complutense University of Madrid	Spain
R. Mainieri	IBM Italy	Italy
P. Martin	Queen’s University, Kingston, Ontario	Canada
L. Mirtaheri	National Technical University, Teheran	Iran
K. Miura	Center for Grid Research and Development National Institute of Informatics	Japan
C. Perez	INRIA – LIP	France
D. Petcu	Research Institute e-Austria Timisoara and West University of Timisoara	Romania
F. Pinel	University of Luxembourg	Luxembourg
A. Shafarenko	University of Hertfordshire	United Kingdom
M. Shekhalishahi	University of Calabria	Italy
C. Simmendinger	T-Systems Solutions for Research GmbH	Germany
B. Sotomayor	Computation Institute, University of Chicago	USA
L. Sousa	INESC and TU Lisbon	Portugal
M. Stillwell	INRIA, University of Lyon, LIP Laboratory	France

ABSTRACTS

R. Ansaloni

Cray’s Approach to Heterogeneous Computing

There seems to be a general consensus among the HPC community, about the impossibility to reach hexascale performance with systems based only on multi-core chips. Heterogeneous nodes where the traditional CPU is combined with many-core accelerators have the potential to provide a much more energy efficient solution capable to overcome the power consumption challenge.

However this emerging hybrid node architecture is expected to pose significant challenges for applications developers, in order to efficiently program these systems and achieve a significant fraction of the available peak performance.

This is certainly the case for today’s GPU-based accelerators with separate memory space, but it also holds true for future unified nodes with CPU and many-core accelerator on chip sharing common memory.

In this talk I’ll describe Cray’s approach to heterogeneous computing and the first Cray hybrid supercomputing system with its unified programming environment.

I’ll also describe Cray’s proposal to extend the OpenMP standard to support a wide range of accelerators.

Benoit

Energy-aware mappings of series-parallel workflows onto chip multiprocessors

In this talk, we will study the problem of mapping streaming applications that can be odelled by a series-parallel graph, onto a 2-dimensional tiled CMP architecture. The objective of the mapping is to minimize the energy consumption, using dynamic voltage scaling techniques, while maintaining a given level of performance, reflected by the rate of processing the data streams. This mapping problem turns out to be NP-hard, but we identify simpler instances, whose optimal solution can be computed by a dynamic programming algorithm in polynomial time. Several heuristics are proposed to tackle the general problem, building upon the theoretical results. Finally, we assess the performance of the heuristics through a set of comprehensive simulations.

G. Bosilca

Flexible Development of Dense Linear Algebra Algorithms on Heterogeneous Parallel Architectures with DAGuE

In the context of dense linear algebra developing algorithms that seamlessly scales to thousands of cores can be achieved using DPLASMA (Distributed PLASMA). DPLASMA take advantage of a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for fine granularity tasks and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks.

E. D’Hollander

High-performance computing for low-power systems

Intelligent low-power devices such as portable phones, tablet computers, embedded systems and sensor networks require low-power solutions for high-performance applications. GPUs have a highly parallel multithreaded architecture and an efficient programming model, but are power-hungry. On the other hand field programmable gate arrays have a highly configurable parallel architecture and a substantially better energy efficiency, but are difficult to program. An approach is presented which maps the GPU architecture and programming model onto the configuration synthesis and the programming of FPGAs. Implementation details, benefits and trade-offs are discussed. In particular the architecture, memory and communication issues are addressed when porting a biomedical image application with a 20-fold GPU speedup onto an FPGA accelerator.

S. Dosanjh

On the path to Exascale

This presentation will describe technical and programmatic progress in Exascale computing. The Exascale Initiative was included in the U.S. Department of Energy’s budget starting in the U.S. Government’s 2012 fiscal year. Several partnerships are forming, a number of projects have already been funded and several co-design centers are being planned. These co-design centers will develop applications for Exascale systems and will provide feedback to computer companies on the impact of computer architecture changes on application performance. An enabling technology for these efforts is the Structural Simulation Toolkit (SST), which allows hardware/software co-simulation. Another key aspect of this work is the development of mini-applications. One difficulty of co-design in high performance computing (HPC) is the complexity of HPC applications, many of which have millions of lines of code. Mini-applications, which are typically one-thousand lines of code, have the potential to reduce the complexity of co-design by a factor of one-thousand. Mini-applications representative of finite-elements, molecular dynamics, contact algorithms, and shock physics are described. The performance of these mini-applications on different computer systems is compared to the performance of the full application.

V. Getov

Smart Cloud Computing: Autonomy, Intelligence and Adaptation

In recent years, cloud computing has rapidly emerged as a widely accepted computing paradigm. The cloud computing paradigm emerged shortly after the introduction of the “invisible” grid concepts. The research and development community has quickly reached consensus on the core cloud properties such as on-demand computing resources, elastic scaling, elimination of up-front capital and operational expenses, and establishing a pay-as-you-go business model for computing and information technology services. With the widespread adoption of virtualization, service-oriented architectures, and utility computing, there is also consensus on the enabling technologies necessary to support this new consumption and delivery model for information technology services. Additionally, the need to meet quality-of-service requirements and service-level agreements, including security, is well understood. Important limitations of the current cloud computing systems include lack of sufficient autonomy and intelligence based on the existence of dynamic non-functional properties. Such properties together with support for adaptation can change completely the quality of computerised services provided by the future cloud systems. In this presentation, we plan to address these issues and demonstrate the significant advantages provided to the users by the smart cloud computing platforms. Some of the available directions for future work are also discussed.

P. Kacsuk

Supporting scientific and Web-2 communities by desktop grids

Although the nature of scientific and Web-2 communities are different they both require more and more processing power to run compute-intensive applications for the sake of community members. Scientific communities typically require run large parameter sweep based simulations that are ideal for both volunteer and institutional desktop grids. Web-2 communities use community portals like facebook through which they organize their social relationship and activities. Such activities also could include time-consuming processing, like water marking the photos of community members.

Both communities prefer to use affordable distributed infrastructures in order to minimize the processing cost. Such a low-cost infrastructure could be a volunteer or institutional desktop grid. The EU EDGI project developed technology and infrastructure to support scientific communities by desktop grids, while the Web2Grid Hungarian national project provides desktop grid technology and the corresponding business model for Web-2 communities.

The talk will discuss the main characteristics of such desktop grid support and also shows the major architectural components of the supporting architecture. The application areas and the possible business models of using volunteer desktops will also be addressed in the talk.

H. Kaiser

ParalleX – A Cure for Scaling-Impaired Parallel Applications

High Performance Computing is experiencing a phase change with the challenges of programming and management of heterogeneous multicore systems architectures and large scale systems configurations. It is estimated that by the end of this decade Exaflops computing systems requiring hundreds of millions of cores demanding multi-billion-way parallelism with a power budgets of 50Gflops/watt may emerge. At the same time, there are many scaling-challenged applications that although taking many weeks to complete, cannot scale even to a thousand cores using conventional distributed programming models. This talk describes an experimental methodology, ParalleX, that addresses these challenges through a change in the fundamental model of parallel computation from that of the communicating sequential processes (e.g. MPI) to an innovative synthesis of concepts involving message-driven work-queue computation in the context of a global address space. We will present early but promising results of tests using a proof-of-concept runtime system implementation guiding future work towards full scale parallel programming.

M. Kunze

Towards High Performance Cloud Computing (HPCC)

Today’s HPC clusters are typically operated and administrated by a single organization. Demand is fluctuating, however, resulting in periods where dedicated resources are either underutilized or overloaded. A cloud-based Infrastructure-as-a-Service (IaaS) model for HPC promises cost savings and more flexibility, as it allows to move away from physically owned and potentially underutilized HPC clusters to virtualized and elastic HPC resources available on-demand from consolidated large cloud computing providers.

The talk discusses specific issues of the introduction of a resource virtualization layer in HPC environments such as latency, jitter and performance.

E. Laure

CRESTA - Collaborative Research into Exascale Systemware, Tools and Applications

For the past thirty years, the need for ever greater supercomputer performance has driven the development of many computing technologies which have subsequently been exploited in the mass market. Delivering an exaflop (or 10^18 calculations per second) by the end of this decade is the challenge that the supercomputing community worldwide has set itself. The Collaborative Research into Exascale Systemware, Tools and Applications project (CRESTA) brings together four of Europe’s leading supercomputing centres, with one of the world’s major equipment vendors, two of Europe’s leading programming tools providers and six application and problem owners to explore how the exaflop challenge can be met.

CRESTA focuses on the use of six applications with exascale potential and uses them as co-design vehicles to develop: the development environment, algorithms and libraries, user tools, and the underpinning and cross-cutting technologies required to support the execution of applications at the exascale. The applications represented in CRESTA have been chosen as a representative sample from across the supercomputing domain including: biomolecular systems, fusion energy, the virtual physiological human, numerical weather prediction and engineering.

No one organisation, be they a hardware or software vendor or service provider can deliver the necessary range of technological innovations required to enable computing at the exascale. This is recognised through the on-going work of the International Exascale Software Project and, in Europe, the European Exascale Software Initiative. CRESTA will actively engage with European and International collaborative activities to ensure that Europe plays its full role worldwide. Over its 36 month duration the project will deliver key, exploitable technologies that will allow the co-design applications to successfully execute on multi-petaflop systems in preparation for the first exascale systems towards the end of this decade.

In this talk we will give an overview of CRESTA, outline the challenges we face in reaching exascale performance and how CRESTA intends to respond to them.

L. Lefevre

Energy efficiency from networks to large scale distributed systems

Energy efficiency begins to be largely adressed for distributed systems like Grids, Clouds or networks. These large-scale distributed systems need an ever-increasing amount of energy and urgently require effective and scalable solutions to manage and limit their electrical consumption.

The challenge is to coordinate all low-level improvements at the middleware level to improve the energy efficiency of the overall systems. Resource-management solutions can indeed benefit from a broader view to pool the resources and to share them according to the needs of each user. During, this talk I will describe some solutions adopted for large scale monitoring of distributed infrastructures. This talk will present our work on energy efficient approaches for reservation based large scale distributed systems. I will present the ERIDIS model, an Energy-efficient Reservation Infrastructure for large-scale DIstributed Systems which provides a unified and generic framework to manage resources from Grids, Clouds and dedicated networks in an energy-efficient way.

T. Lippert

Amdahl hits the Exascale

With the advent of Petascale supercomputers the scalability of scientific application codes on such systems becomes a most pressing issue. The current world record holder as far as the number of concurrent cores is concerned, the IBM Blue Gene /P system "JUGENE" at the Jülich Supercomputing Centre with 294 912 cores will soon be displaced by systems comprising millions of cores. In this talk I am going to review the constraints put on scalability by Amdahl’s and Gustafson’s Laws. I am proposing architectural concepts that are optimized for the concurrency hierarchies of application codes and I will give a glimpse on the DEEP Exascale supercomputer project, to be funded by the European Community, that explicitly addresses concurrency hierarchies on the hardware, system software and application software level.

J.L. Lucas

A Multi-cloud Management Architecture and Early Experiences

In this talk we present a cloud broker architecture for deploying virtual infrastructures across multiple IaaS clouds. We analyse the main challenges of the brokering problem in multi-cloud environments, and we propose different scheduling policies, based on several criteria, that can guide the brokering decisions. Moreover, we present some preliminary results to prove the benefits of this broker architecture in multi-cloud environments in the execution of virtualized computing clusters.

R. Mainieri

The future of cloud computing and its impact on transforming industries

In its centennial IBM demonstrated that long-term success requires vision, strategy and managing for the long term. Deciding how and where investing and allocating resources, shaping talent development and taking decisive action. Three years ago IBM started talking about smarter planet and how it was driving innovation across industries. On a Smarter Planet, successful companies think differently about computing and realize IT infrastructure that is designed for data, tuned to the task, and managed in the cloud.

The talk will illustrate IBM cloud computing vision, strategy and management plan for the long term: the smarter computing for a smarter planet. It will discuss about resources invested in research and development, present the main important global projects and how some specific actions such as laboratories around the world, new cloud data center, software companies acquisitions, fostering the adoption of open standards is going to lead a sustainable industry transformation in specific industries.

P. Martin

Provisioning Data-Intensive Workloads in the Cloud

Data-intensive workloads involve significant amounts of data access. The individual requests composing these workloads can vary from complex processing of large data sets, such as in business analytics and OLAP workloads, to small transactions randomly accessing individual records within the large data sets, such as in OLTP workloads. In the cloud, applications generating these workloads may be built on different frameworks from shared-nothing database management systems to MapReduce or even some mix of the two. We believe that effective provisioning methods for data-intensive workloads in the cloud must consider where to place the data in the cloud when they are allocating resources to the workloads.

In the talk, I will provide an overview of an approach that provisions a workload in a public cloud while simultaneously placing the data in an optimal configuration for the workload. We solve this data placement problem by solving two subproblems, namely how to first partition the data to suit the workload and then how to allocate data partitions to virtual machines in the cloud.

L. Mirtaheri

An Algebraic Model for a Runtime High Performance Computing Systems Reconfiguration

Tailored High Performance Computing Systems (HPCS) represent the best performance because their configuration is customized regarding the features of the problem to be solved. 21th century processes are dynamic in nature. However, this dynamicity in nature is caused either because of the dimensions of today’s problems being undetermined or the dynamicity of the underlying platform. A drawback of this dynamicity is for the systems customized at design phase facing challenges at runtime and consequently showing worse performance. The reason for these challenges might be for the processes with dynamic nature being in the opposite direction as that of the system configuration. Many approaches like dynamic reconfiguration with dynamic load balancing are introduced to solve the challenges. In this talk, I will present a mathematical model based on vector algebra for system reconfiguration. This model determines the element (process) causing the opposition and discovers the reason of that regarding both software and hardware at runtime. Results of the presented model show that by defining a general status vector whose direction is towards reaching high performance and size is based on the initial features and explicit requirements of the problem and also by defining a vector for each process in the problem at runtime, we can trace changes in the directions and find out the reason, as well.

K. Miura

RENKEI: A Light-weight Grid Middleware for e-Science Community

The “RENKEI (Resources Linkages for e-Science) Project” started in September 2008 under the auspices of the Ministry of Education, Culture, Sports, Science and Technology (MEXT). In this project, a new light-weight grid middleware and software tools are developed, in order to provide the user-friendly connection between the major grid environment and users’ local computing environment. In particular, technology for enabling the flexible and seamless accesses to the national computing center level and the departmental/laboratory level resources, such as computers, storage and databases, is one of the key objectives. Another key ingredient of this project is “interoperability” with the major international grids along the line of OGF standardization activities, such as GIN, PGI, SAGA and RNS.

With the RENKEI workflow tool users can submit jobs from the local environment or even from a cloud to the “TSUBAME2 supercomputer system at the Tokyo Institute of Technology, via the networking infrastructure called “SINET4”, for example..

http://www.naregi.org/index_e.html

http://www.e-sciren.org

C. Perez

Resource management system for complex and non-predictably evolving applications

High-performance scientific applications are becoming increasingly complex, in particular because of the coupling of parallel codes. This results in applications having a complex structure, characterized by multiple deploy-time parameters, such as the number of processes of each code. In order to optimize the performance of these applications, the parameters have to be carefully chosen, a process which is highly resource dependent. Moreover, some applications are (non-predictably) changing their resource requirements during their execution.

Abstractions provided by current Resource Management Systems (RMS) appears insufficient to efficiently select resources for such applications. This talks will discuss CooRM, an RMS architecture to support such applications. It will also show how applications can benefit from it to achieve a more efficient resource usage.

D. Petcu

How is built a mosaic of Clouds

The developers of Cloud compliant application are facing the dilemma of which Cloud provider API to select knowing that later on this decision will be lead to a provider dependence. mOSAIC (www.mosaic-cloud.eu) is addressing this issue by proposing a vendor and language-independent API for developing Cloud compliant applications. Moreover it has promise to build a Platform as a Service solution that will allow the selection at run-time of the Cloud services from multiple offers based on semantic processing and agent technologies.

The presentation will focus on the problems raised by implementing the Sky computing concept (cluster of Clouds), the issues of Virtual Cluster deployment on top of multiple Clouds, and the technical solutions that were adopted by mOSAIC.

F. Pinel

Utilizing GPUs to Solve Large Instances of the Tasks Mapping Problem

In this work, we present and analyze a local search algorithm designed to solve large instances of the independent tasks mapping problem. The genesis of the algorithm is the sensitivity analysis of a cellular genetic algorithm, which illustrates the benefits of such an analysis for algorithmic design activities.

Moreover, to solve instances of up to 65,536 tasks over 2,048 machines and to achieve scalability, the local search is accelerated by utilizing a GPU. The proposed local search algorithm improves the results of other well-known algorithms in the modern literature.

A. Shafarenko

New-Age Component Linking: Compilers Must Speak Constraints

This presentation will focus on the agenda of the FP7 project ADVANCE. The project is seeking to redefine the concept of component technology by investigating the possibility of exporting out of a component not only interfaces, but functional and extrafunctional constraints as well. The new, rich component interface requires a hardware model for the aggregation and resolution of constraints, but if that is available, then a much more targeted approach can be defined for compiling distributed applications down to heterogeneous architectures.

Constraint aggregation can deliver the missing global (program-wide) intelligence to a component compiler and enable it to tune up the code for alternative hardware, communication harness or memory model.

The talk will discuss these ideas in some detail and provide a sketch of a Constraint Aggregation Language, developed in the project.

M. Sheikhalishahi

Resource Management and Green Computing

In this talk, we review green and performance aspects of resource management. Components of resource management system are explored in detail to seek new developments by exploiting contemporary emerging technologies, computing paradigms, energy efficient operations, etc. to define, design and develop new metrics, techniques, mechanisms, models, policies, and algorithms. In addition, modeling relationships within and between various layers are considered to present some novel approaches. In particular, as a case study we define and model resource contention metric and consequently we develop two energy aware consolidation policies.

C. Simmendinger

Petascale in CFD

In this talk we outline a highly scalable and also highly efficient PGAS implementation for the CFD solver TAU.

TAU is an unstructured RANS CFD Solver and one of the key applications in the European aerospace Eco-System. We show that our implementation is able to scale to petascale systems within the constraints of a single regular production run.

To reach this goal, we have implemented a novel approach for shared memory parallelization, which is based on an asynchronous thread pool model. Due to the asynchronous operation the model is implicitly load balanced, free of global barriers and also allows for a near-optimal overlap of communication and computation. We have complemented this model with an asynchronous global communication strategy, in which we made use of the PGAS API of GPI.

We briefly outline this strategy and show first results.

B. Sotomayor

Reliable File Transfers with Globus Online

File transfer is both a critical and frustrating aspect of high-performance computing. For a relatively mundane task, moving terabytes of data reliably and efficiently can be surprisingly complicated. One must discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, negotiate authentication, configure protocols, detect and respond to failures, determine expected and actual performance, identify, diagnose and correct network misconfigurations, integrate with file systems, and a host of other things. Automating these makes users’ lives much, much easier.

In this presentation I will provide a technical overview of Globus Online: a fast, reliable file transfer service that simplifies large-scale, secure data movement without requiring construction of custom end-to-end systems. The presentation will include a demonstration as well as highlights from several user case studies.

L. Sousa

Distributed computing on highly heterogeneous systems

The approaches used in traditional heterogeneous distributed computing to achieve efficient execution across a set of architecturally similar compute nodes (such as CPU-only distributed systems) are only partially applicable to the systems with a high degree of architectural heterogeneity. For example, when considering clusters of multi-core CPUs equipped with specialized accelerators/co-processors, such as GPUs. This is mainly due to the fact that efficient load balancing decisions must also be made at the level of each compute node, in addition to the decisions made at the overall system level.

In this work, we propose a method for dynamic load balancing and performance modeling for heterogeneous distributed systems, when all available compute nodes and all devices in compute nodes are employed for collaborative execution. Contrary to the common practice in task scheduling, we do not make any pre-execution assumptions to ease the modeling of either the application or the system. The heterogeneous system is modeled as it is, by monitoring and recording the behavior of the essential parts affecting the performance.

The parallel execution requires explicit data transfers to be performed prior to and after any actual computation. In order to exploit the concurrency between data transfers and computation, we investigate herein the processing in an iterative multi-installment divisible-load space at both, overall system and compute node levels. Namely, the proposed approach dispatches the load using many sub-loads, whose size is carefully determined to allow the best overlap between communication and computation. The load division is performed according to several factors: i) current performance models (per-device and per-node), ii) modeled bidirectional interconnection bandwidths (between compute nodes and between devices in each compute node), and iii) the amount of supported concurrency by the node/device hardware.

The problem that we tackle herein is how to find task distribution, such that the overall application make-span is the shortest possible according to the current performance models of devices, interconnections and compute nodes. Performance models are application-centric piece-wise linear approximations constructed during the application runtime to direct further load-balancing decisions according to the exact task requirements.

The proposed approach is evaluated in a real distributed environment consisting of quad-core CPU+GPU nodes, for iterative scientific applications, such as matrix multiplication (DGEMM), and Fast Fourier 2D batch Transform (FFT). Due to ability to overlap execution of several sub-loads, our approach results in more accurate performance models comparing to the current state-of-the-art approaches.

M. Stillwell

Dynamic Fractional Resource Scheduling

Dynamic Fractional Resource Scheduling is a novel approach for scheduling jobs on cluster computing platforms. Its key feature is the use of virtual machine technology to share \emph{fractional} node resources in a precise and controlled manner. Our previous work focused on the development of task placement and resource allocation heuristics to maximize an objective metric correlated with job performance, and our results were based on simulation experiments run against real traces and established models. We are currently performing a new round of experiments using synthetic workloads that launch parallel benchmark applications in multiple virtual machine instances on a real cluster. Our goals are to see how well our ideas work in practice and determine how they can be improved, and to develop empirically validated models of the interaction between resource allocation decisions and application performance.