Main navigation | Main content

HOME » SCIENTIFIC RESOURCES » Volumes

Abstracts and Talk Materials

Linearly scaling algorithms will be crucial for the problem sizes that
will be tackled in capability exascale systems. It is interesting to
note that many of the most successful algorithms are hierarchical in
nature, such as multi-grid methods and fast multipole methods (FMM). We
have been leading development efforts for open-source FMM software for
some time, and recently produced GPU implementations of the various
computational kernels involved in the FMM algorithm. Most recently, we
have produced a multi-GPU code, and performed scalability studies
showing high parallel efficiency in strong scaling. These results have
pointed to several features of the FMM that make it a particularly
favorable algorithm for the emerging heterogeneous, many-core
architectural landscape. We propose that the FMM algorithm offers
exceptional opportunities to enable exascale applications. Among its
exascale-suitable features are: (i) it has intrinsic geometric
locality, and access patterns are made local via particle indexing
techniques; (ii) we can achieve temporal locality via an efficient
queuing of GPU tasks before execution, and at a fine level by means of
memory coalescing based on the natural index-sorting techniques; (iii)
global data communication and synchronization, often a significant
impediment to scalability, is a soft barrier for the FMM, where the
most time-consuming kernels are, respectively, purely local
(particle-to-particle interactions) and "hierarchically synchronized"
(multipole-to-local interactions, which happen simultaneously at every
level of the tree). In addition, we suggest a strategy for achieving
the best algorithmic performance, based on two key ideas: (i)
hybridize the FMM with treecode by choosing on-the-fly between
particle-particle, particle-box, and box-box interactions, according to
a work estimate; (ii) apply a dynamic error-control technique, effected
on the treecode by means of a variable "box-opening angle" and on the
FMM by means of a variable order of the multipole expansion. We have
carried out preliminary implementation of these ideas/techniques,
achieving a 14x speed-up with respect to our current published version
of the FMM. Considering that this effort was only exploratory, we are
certain to possess the potential for unprecedented performance with
these algorithms.

Increases in computational power allow lattice field theories to resolve
smaller scales, but to realize the full benefit for scientific discovery,
new multi-scale algorithms must be developed to maximize efficiency.
Examples of new trends in algorithms include adaptive multigrid solvers
for the quark propagator and an improved symplectic Force Gradient
integrator for the Hamiltonian evolution used to include the quark
contribution to vacuum fluctuations in the quantum path integral. Future
challenges to algorithms and software infrastructure targeting many-core
GPU accelerators and heterogeneous extreme scale computing are discussed.

January 12, 2011

We discuss multiple strategies to perform general computations on unstructured grids using a GPU, with specific application to the assembly of systems of equations in finite element methods (FEMs). For each method, we discuss the GPU hardware's limiting resources, optimizations, key data structures, and dependence of the performance with respect to problem size, element size, and GPU hardware generation. These methods are applied to a nonlinear hyperelastic material model to develop a large-scale real-time interactive elastodynamic visualization. By performing the assembly, solution, update, and visualization stages solely on the GPU, the similuation benefits from speed-ups in each stage and avoids costly GPU-CPU transfers of data.

Read More...

Iterative sparse linear solvers are a critical component of a scientific computing platform. Developing effective preconditioning strategies is the main challenge in developing iterative sparse solvers on massively parallel systems. As computing systems become increasingly power-constrained, memory hierarchies for massively parallel systems will become deeper and more hierarchical. Parallel algorithms with all-to-all communication patterns that assume uniform memory access times will be inefficient on these systems. In this talk, I will outline the challenges of developing good parallel preconditioners, and demonstrate that domain decomposition methods have communication patterns that match emerging parallel platforms. I will present recent work to develop restricted additive Schwarz (RAS) preconditioners as part of the open source 'cusp' library of sparse parallel algorithms. On 2d Poisson problems, a RAS preconditioner is consistently faster than diagonal preconditioning in time-to-solution. Detailed analysis demonstrates that the communication pattern of RAS matches the on-chip bandwidths of a Fermi GPU. Line smoothing, which requires solving a large number of small tridiagonal linears systems in local memory, is another preconditioning approach with similar communication patterns. I will conclude with a roadmap for devoping a range of preconditioners, smoothers, and linear solvers on massively parallel hardware based on the domain decomposition and line smoothing approaches.

Read More...

January 10, 2011

In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder.
We will look at five areas of research that will have an importance impact in the development of software and algorithms.
We will focus on following themes:

- Redesign of software to fit multicore architectures
- Automatically tuned application software
- Exploiting mixed precision for performance
- The importance of fault tolerance
- Communication avoiding algorithms

Anne C. Elster (Norwegian University of Science and Technology (NTNU))

http://www.idi.ntnu.no/~elster/

http://www.idi.ntnu.no/~elster/

December 31, 1969

Collaborators: Frank Linseth, Holger Ludvigsen, Erik Smistad and Thor Kristian Valgerhaug

GPUs offer a lot of compute power enabling real-time processing of images. This poster depict som our of group's recent work on image processing for medical applications on GPUs including 3D surface extraction using marching cubes and 3D ultrasound reconstruction. We have previously developed Cg and CUDA codes for wavelet transforms and CUDA codes for surface extraction for seismic images.

GPUs offer a lot of compute power enabling real-time processing of images. This poster depict som our of group's recent work on image processing for medical applications on GPUs including 3D surface extraction using marching cubes and 3D ultrasound reconstruction. We have previously developed Cg and CUDA codes for wavelet transforms and CUDA codes for surface extraction for seismic images.

Anne C. Elster (Norwegian University of Science and Technology (NTNU))

http://www.idi.ntnu.no/~elster/

http://www.idi.ntnu.no/~elster/

January 14, 2011

GPUs are now massive floating-point stream processors that offer a source of energy-efficient compute power on our laptops and desktops. Recent development of tools such as CUDA and OpenCL have made it much easier to utilize the computational power these systems offer. However, in order to optimally harness the the power of these GPU-based systems, there still are many challenges to overcome.
In this talk, several issues related to our experiences with medical and geological processing applications that can benefit from real-time processing of data on GPUs, will be discussed. These include real-time medical imaging, e.g. for ultrasound-guided discovery and surgery, real-time
seismic CT image enhancement, and using GPUs for real-time compression of seismic data in order to lower I/O latency. This talk will highlight work our research group has been involved dating back from 2006 through today.

Read More...

December 31, 1969

The research objective of this work is to develop a new dedicated and
massively parallel tool for efficient simulation of unsteady nonlinear
free surface waves. The tool will be used for applications in coastal and
offshore engineering, e.g. in connection with prediction of wave
kinematics and forces at or near human-made structures. The tool is based
on a unified potential flow formulation which can account for fully
nonlinear and dispersive wave motion over uneven depths under the
assumptions of nonbreaking waves, irrotational and inviscid flow.
This work is a continuation of earlier work and will continue to
contribute to advancing state-of-the-art for efficient wave simulation.
The tool is expected to be orders of magnitude faster than current tools
due to efficient algorithms and utilization of available hardware
resources.

GPULab - A competence center and laboratory for research and collaboration
within academia and partners in industry has been established in 2008 at
section for Scientific Computing, DTU informatics, Technical University of
Denmark. In GPULab we focus on the utilization of Graphics Processing
Units (GPUs) for high-performance computing applications and software
tools in science and engineering, inverse problems, visualization,
imaging, dynamic optimization. The goals are to contribute to the
development of new state-of-the-art mathematical models and algorithms for
maximum throughout performance, improved performance profiling tools and
assimilation of results to academic and industrial partners in our
network. Our approaches calls for multi-disciplinary skills and
understanding of hardware, software development, profiling tools and
tuning techniques, analytical methods for analysis and development of new
approaches, together with expert knowledge in specific application areas
within science and engineering. We anticipate that our research in a near
future will bring new algorithms and insight in engineering and science
applications targeting practical engineering problems.

Geoffrey Charles Fox (Indiana University)

http://www.informatics.indiana.edu/people/profiles.asp?u=gcf

http://www.informatics.indiana.edu/people/profiles.asp?u=gcf

1) We analyze the different tradeoffs and goals of Grid, Cloud and parallel (cluster/supercomputer) computing.
2) They tradeoff performance, fault tolerance, ease of use (elasticity), cost, interoperability.
3) Different application classes (characteristics) fit different architectures and we describe a hybrid model with Grids for data, traditional supercomputers for large scale simulations and clouds for broad based "capacity computing" including many data intensive problems.
4) We discuss the impressive features of cloud computing platforms and compare MapReduce and MPI.
5) We take most of our examples from the life science area.
6) We conclude with a description of FutureGrid -- a TeraGrid system for prototyping new middleware and applications.

Read More...

December 31, 1969

Joint work with Felix Kwok.

All domain decomposition methods are based on a decomposition of the physical domain into many subdomains and an iteration, which uses subdomain solutions only (and maybe a coarse grid), in order to compute an approximate solution of the problem on the entire domain. We show in this poster that it is possible to formulate such an iteration, only based on subdomain solutions, which converges in two steps to the solution of the underlying problem, independently of the number of subdomains and the PDE solved. This method is mainly of theoretical interest, since it contains sophisticated non-local operators (and a natural coarse grid component), which need to be approximated in order to obtain a practical method.

All domain decomposition methods are based on a decomposition of the physical domain into many subdomains and an iteration, which uses subdomain solutions only (and maybe a coarse grid), in order to compute an approximate solution of the problem on the entire domain. We show in this poster that it is possible to formulate such an iteration, only based on subdomain solutions, which converges in two steps to the solution of the underlying problem, independently of the number of subdomains and the PDE solved. This method is mainly of theoretical interest, since it contains sophisticated non-local operators (and a natural coarse grid component), which need to be approximated in order to obtain a practical method.

Joint work with Steven F. Wojtkiewicz (Department of Civil Engineering, University of Minnesota, Minneapolis, MN 55414, USA. bykvich@umn.edu).

Graphics processing units (GPUs) have emerged as a much economical and a highly competitive alternative to CPU-based parallel computing. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing equivalents by up to two orders of magnitude in certain applications. Moreover, the portability of the GPUs enables even a desktop computer to provide a teraflop (1012 floating point operations per second) of computing power. This study presents the gains in computational efficiency obtained using the GPU-based implementations of five types of algorithms frequently used in uncertainty quantification problems arising in the analysis of dynamical systems with uncertain parameters and/or inputs.

Graphics processing units (GPUs) have emerged as a much economical and a highly competitive alternative to CPU-based parallel computing. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing equivalents by up to two orders of magnitude in certain applications. Moreover, the portability of the GPUs enables even a desktop computer to provide a teraflop (1012 floating point operations per second) of computing power. This study presents the gains in computational efficiency obtained using the GPU-based implementations of five types of algorithms frequently used in uncertainty quantification problems arising in the analysis of dynamical systems with uncertain parameters and/or inputs.

Based on an MPI library written over 10 years ago, OP2 is a new open-source library which is aimed at application developers using unstructured grids. Using a single API, it targets a variety of backend architectures, including both manycore GPUs and multicore CPUs with vector units. The talk will cover the API design, key aspects of the parallel implementation on the different platforms, and preliminary performance results on a small but representative CFD test code.

Read More...

Dominik Göddeke (Universität Dortmund)

http://www.mathematik.uni-dortmund.de/~goeddeke/

Robert Strzodka (Max-Planck-Institut für Informatik)

http://www.mpi-inf.mpg.de/~strzodka/

http://www.mathematik.uni-dortmund.de/~goeddeke/

Robert Strzodka (Max-Planck-Institut für Informatik)

http://www.mpi-inf.mpg.de/~strzodka/

We present efficient fine-grained parallelization techniques for robust multigrid solvers and Krylov subspace schemes, in particular for numerically strong smoothing and preconditioning operators. We apply them to sparse ill-conditioned linear systems of equations that arise from grid-based discretization techniques like finite differences, volumes and elements; the systems are notoriously hard to solve due to severe anisotropies in the underlying mesh and differential operator. These strong smoothers are characterized by sequential data dependencies, and do not parallelize in a straightforward manner. For linewise preconditioners, exact parallel algorithms exist, and we present a novel, efficient implementation of a cyclic reduction tridiagonal solver. For other preconditioners, traditional wavefront techniques can be applied, but their irregular and limited parallelism makes them a bad match for GPUs. Therefore, we discuss multicoloring techniques to recover parallelism in these preconditioners, by decoupling some of the dependencies at the expense of at first reduced numerical performance. However, by carefully balancing the coupling
strength (more colors) with the parallelization benefits, the multicolored variants retain almost all of the sequential numerical
performance. Further improvements are achieved by merging the tridiagonal and Gauß-Seidel approach into a smoothing operator that
combines their advantages, and by employing an alternating direction implicit scheme to gain independence of the numbering of the unknowns. Due to their advantageous numerical properties, multigrid solvers equipped with strong smoothers are between four and eight times more efficient than with simple Gauß-Seidel preconditioners, and we achieve speedups factors between six and 18 with the GPU implementations over carefully tuned CPU variants.

January 13, 2011

Solvers for coupled multi-scale (multi-physics) may be constructed by
coupling an array of existing and well tested parallel numerical
solvers, each designed to tackle a problem at different spatial and
temporal scale. Each solver can be optimized/designed for different
computer architecture. Future supercomputers may be composed of
heterogeneous processing units, i.e., CPU/GPU. To make an efficient
use of computational recourses, the coupled solvers must support
topology-aware mapping of tasks to the processing units were the best
parallel efficiency could be achieved.

Arterial blood circulation is a multi-scale process where time and space scales range from nanoseconds (nanometers) to seconds (meters), reciprocally. The macro-vascular scales describing the flow dynamics in larger vessels are coupled to the meso-vascular scales unfolding dynamics of individual blood cells. The meso- vascular events are coupled to the micro-vascular ones accounting for blood perfusion, clot formation, adhesion of the blood cells to the arterial walls, etc. Besides the multi-scale nature of the problem, its size often presents a substantial computational challenge even for simulations considering a single scale.

In this talk we will try to envision the design of a multi-scale solver for blood flow simulations, tailored to heterogeneous computer architecture.

Arterial blood circulation is a multi-scale process where time and space scales range from nanoseconds (nanometers) to seconds (meters), reciprocally. The macro-vascular scales describing the flow dynamics in larger vessels are coupled to the meso-vascular scales unfolding dynamics of individual blood cells. The meso- vascular events are coupled to the micro-vascular ones accounting for blood perfusion, clot formation, adhesion of the blood cells to the arterial walls, etc. Besides the multi-scale nature of the problem, its size often presents a substantial computational challenge even for simulations considering a single scale.

In this talk we will try to envision the design of a multi-scale solver for blood flow simulations, tailored to heterogeneous computer architecture.

Joint work with J. Insley, M. Papka, and G. E. Karniadakis.

Interactions of blood flow in the human brain occur between different scales, determined by flow features in the large arteries (above 0.5mm diameter), arterioles, and the capillaries (of 5E-3 mm). To simulate such multi-scale flow we develop mathematical models, numerical methods, scalable solvers and visualization tools. Our poster will present NektarG - a research code developed at Brown University for continuum and atomistic simulations. NektarG is based on a high-order spectral/hp element discretization featuring multi-patch domain decomposition for continuum flow simulations, and modified DPD-LAMMPS for mesoscopic simulations. The continuum and atomistic solvers are coupled via Multi-level Communicating Interface to exchange data required by interface conditions. The visualization software is based on ParaView and NektarG utilities accessed through the ParaView GUI. The new visualization software allows to simultaneously present data computed in coupled (multi-scale) simulations. The software automatically synchronizes the display of time evolution of solutions at multiple scales.

Interactions of blood flow in the human brain occur between different scales, determined by flow features in the large arteries (above 0.5mm diameter), arterioles, and the capillaries (of 5E-3 mm). To simulate such multi-scale flow we develop mathematical models, numerical methods, scalable solvers and visualization tools. Our poster will present NektarG - a research code developed at Brown University for continuum and atomistic simulations. NektarG is based on a high-order spectral/hp element discretization featuring multi-patch domain decomposition for continuum flow simulations, and modified DPD-LAMMPS for mesoscopic simulations. The continuum and atomistic solvers are coupled via Multi-level Communicating Interface to exchange data required by interface conditions. The visualization software is based on ParaView and NektarG utilities accessed through the ParaView GUI. The new visualization software allows to simultaneously present data computed in coupled (multi-scale) simulations. The software automatically synchronizes the display of time evolution of solutions at multiple scales.

After 15-20 years of architectural stability, we are in the midst of a dramatic change in high performance computing systems design. In this talk we discuss the commonalities across the viable systems of today, and look at opportunities for numerical algorithms research and development. In particular, we explore possible programming and machine abstractions and how we can develop effective algorithms based on these abstractions, addressing, among other things, robustness issues for preconditioned iterative methods and resilience of algorithms in the presence of soft errors.

The performance of many high performance computing applications is
limited by data movement from memory to the processor. Often their cost is more
accurately expressed in terms of memory traffic rather than
floating-point operations and, to improve performance, data movement
must be reduced. One technique to reduce memory traffic is the fusion of loops
that access the same data. We have built the Build to Order (BTO) compiler to automate the
fusion of loops in matrix algebra kernels. Loop fusion often produces speedups
proportional to the reduction in memory traffic, but it can also lead to
negative effects in cache and register use. We present the results of experiments
with BTO that help us to understand the workings of loop fusion.

David E. Keyes (King Abdullah University of Science & Technology)

http://www.kaust.edu.sa/academics/faculty/keyes.html

http://www.kaust.edu.sa/academics/faculty/keyes.html

Sustained floating-point computation rates on real applications, as
tracked by the ACM Gordon Bell Prize, increased by three orders of
magnitude from 1988 (1 Gigaflop/s) to 1998 (1 Teraflop/s), and by
another three orders of magnitude to 2008 (1 Petaflop/s). Computer
engineering provided only a couple of orders of magnitude of
improvement for individual cores over that period; the remaining
factor came from concurrency, which is approaching one million-fold.

Algorithmic improvements contributed meanwhile to making each flop more valuable scientifically. As the semiconductor industry now slips relative to its own roadmap for silicon-based logic and memory, concurrency, especially on-chip many-core concurrency and GPGPU SIMD-type concurrency, will play an increasing role in the next few orders of magnitude, to arrive at the ambitious target of 1 Exaflop/s, extrapolated for 2018. An important question is whether today's best algorithms are efficiently hosted on such hardware and how much co-design of algorithms and architecture will be required.

From the applications perspective, we illustrate eight reasons why today's computational scientists have an insatiable appetite for such performance: resolution, fidelity, dimension, artificial boundaries, parameter inversion, optimal control, uncertainty quantification, and the statistics of ensembles.

The paths to the exascale summit are debated, but all are narrow and treacherous, constrained by fundamental laws of physics, cost, power consumption, programmability, and reliability. Drawing on recent reports, workshops, vendor projections, and experiences with scientific codes on contemporary platforms, we propose roles for today's researchers in one of the great global scientific quests of the next decade.

Algorithmic improvements contributed meanwhile to making each flop more valuable scientifically. As the semiconductor industry now slips relative to its own roadmap for silicon-based logic and memory, concurrency, especially on-chip many-core concurrency and GPGPU SIMD-type concurrency, will play an increasing role in the next few orders of magnitude, to arrive at the ambitious target of 1 Exaflop/s, extrapolated for 2018. An important question is whether today's best algorithms are efficiently hosted on such hardware and how much co-design of algorithms and architecture will be required.

From the applications perspective, we illustrate eight reasons why today's computational scientists have an insatiable appetite for such performance: resolution, fidelity, dimension, artificial boundaries, parameter inversion, optimal control, uncertainty quantification, and the statistics of ensembles.

The paths to the exascale summit are debated, but all are narrow and treacherous, constrained by fundamental laws of physics, cost, power consumption, programmability, and reliability. Drawing on recent reports, workshops, vendor projections, and experiences with scientific codes on contemporary platforms, we propose roles for today's researchers in one of the great global scientific quests of the next decade.

January 12, 2011

Having recently shown that high-order unstructured discontinuous
Galerkin (DG) methods are a discretization method for systems of
hyperbolic conservation laws that is well-matched to execution on GPUs,
in this talk I will explore both core and supporting components of
high-order DG solvers for their suitability for and performance on
modern, massively parallel architectures. Components examined range from
software components facilitating implementation to strategies for
automated tuning and, time permitting, numerical tweaks to the method
itself. In concluding, I will present a selection of further design
considerations and performance data.

We discuss the construction and execution of GPU kernels
from higher level specifications. Examples will be shown
using low-order finite elements and fast multipole method.

Hugo Leclerc (École Normale Supérieure de Cachan)

http://www.lmt.ens-cachan.fr/personnels/perso_page.php?nom=LECLERC§eur=2

http://www.lmt.ens-cachan.fr/personnels/perso_page.php?nom=LECLERC§eur=2

December 31, 1969

Tools have been developed to generate code to solve partial differential equations from high level descriptions (manipulation of files, global operators, ...). The successive symbolic transformations lead to a macroscopic description of the code to be executed, which can thus be translated into x86 (SSEx), C++ or cuda code. The point emphasized here is that the different processes can be adapted to the target hardware, taking into account the ratio gflops / gbps (making e.g. the choice between re-computations or cache), the SIM[DT] abilities, ... The poster will present the gains (compared to classical CPU/GPU implementations) for two implementation of a 3D unstructured FEM solver,using respectively a conjugate gradient and a domain decomposition method with repetitive patterns.

January 10, 2011

We live in the age of heroic programming for scientific applications on Graphics Processing Units (GPUs). Typically a scientist chooses an application to accelerate and a target platform, and through great effort maps their application to that platform. If they are a true hero, they achieve two or three orders of magnitude speedup for that application and target hardware pair. The effort required includes a deep understanding of the application, its implementation and the target architecture. When a new, higher performance architecture becomes available additional heroic acts are required.
There is another group of scientists who prefer to spend their time focused on the application level rather than lower levels. These scientists would like to use GPUs for their applications, but would prefer to have parameterized library components available that deliver high performance without requiring heroic efforts on their part.
The library components should be easy to use and should support a wide range of user input parameters. They should exhibit good performance on a range of different GPU platforms, including future architectures. Our research focuses on creating such libraries.
We have been investigating parameterized library components for use with Matlab/Simulink and with the SCIRun Biomedical Problem Solving Environment from the University of Utah. In this talk I will discuss our library development efforts and challenges to achieving high performance across a range of both application and architectural parameters.
I will also focus on issues that arise in achieving correct behavior of GPU kernels. One issue is correct behavior with respect to thread synchronization. Another is knowing whether or not your scientific application that uses floating point is correct when the results differ depending on the target architecture and order of computation.

Read More...

December 31, 1969

This research demonstrates the incorporation of GPU's parallel processing architecture into the SCIRun biomedical problem solving environment with minimal changes to the environment or user experience. SCIRun, developed at the University of Utah, allows scientists to interactively construct many different types of biomedical simulations. We use this environment to demonstrate the effectiveness of the GPU by accelerating time consuming algorithms present in these simulations. Specifically, we target the linear solver module, which contains multiple solvers that benefit from GPU hardware. We have created a class to accelerate the conjugate gradient, Jacobi and minimal residual linear solvers; the results demonstrate that the GPU can provide acceleration in this environment. A principal focus was to remain transparent by retaining the user friendly experience to the scientist using SCIRun's graphical user interface. NVIDIA's CUDA C language is used to enable performance on NVIDIA GPUs. Challenges include manipulating the sparse data processed by these algorithms and communicating with the SCIRun interface amidst computation. Our solution makes it possible to implement GPU versions of the existing SCIRun algorithms easily and can be applied to other parallel algorithms in the application. The GPU executes the matrix and vector arithmetic to achieve acceleration performance of up to 16x on the algorithms in comparison to SCIRun's existing multithreaded CPU implementation. The source code will contain single and double precision versions to utilize a wide variety of GPU hardware and will be incorporated and publicly available in future versions of SCIRun.

Fusion (the integration of CPU and GPU into a single processing entity) is here. Cloud based software services are here. Large processing clusters are running massively parallel Hadoop programs now. Can large-scale, commercial, enterprise, server solutions be dynamically repurposed to run HPC problem sets? The future of HPC may well be a massive set of virtual machines running in "curve of the earth" sized data centers. The cost of HPC processing sponges (HPC problem sets that consume otherwise wasted processing cycles in scale-out server clusters) will probably make all but the most extreme purpose-built HPC systems obsolete.

This talk summarizes an effort at the Modeling, Simulation and Visualization Center at the University of Wisconsin-Madison to model and simulate large scale discrete dynamics problems. This effort is motivation by a desire to address unsolved challenges posed by granular dynamics problems, mobility of tracked and wheeled vehicle on granular terrain, and digging into granular material, to name a few. In the context of simulating the dynamics of large systems of interacting rigid bodies, we briefly outline a method for solving large cone complementarity problems by means of a fixed-point iteration algorithm. The method is an extension of the Gauss-Jacobi algorithms with over-relaxation for symmetric convex complementarity problems. Convergent under fairly standard assumptions, the method is implemented in a scalable parallel computational framework by using a single instruction multiple data (SIMD) execution paradigm supported by the Compute Unified Device Architecture (CUDA) library for programming on the graphical processing unit (GPU). The simulation framework developed supports the analysis of problems with more than one million rigid bodies that interact through contact and friction forces, and whose dynamics are constrained by either unilateral or bilateral kinematic constraints. Simulation thus becomes a viable tool for investigating in the near future the dynamics of complex systems such as the Mars Rover operating on granular terrain, powder composites, and granular material flow. The talk concludes with a short summary of other applications that stand to benefit from the computational power available on today’s GPUs.

Read More...

Hyperspectral images can be used for abundance estimation and anomaly
detection, however, the algorithms involved tend to be I/O intensive.
Parallelizing these algorithms can enable their use in real-time
applications. A method of overcoming these limitations involves
selecting parallelizable algorithms and implementing them using GPUs.
GPUs are designed as throughput engines, built to process large
amounts of dense data in a parallel fashion. RX's detectors and
estimators of abundance will be parallelized and tested for
correctness and performance.

January 11, 2011

Stencil calculations comprise an important class of kernels in many scientific computing applications
ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an efficient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution.
However, in the current complex hardware microarchitectures, meticulous architecture-specific tuning is required to elicit the machine's full compute power. We present a code generation and auto-tuning framework PATUS for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a specification of the stencil operation and a parallelization and optimization strategy, and leverages the autotuning methodology to optimize strategy-dependent parameters for the given hardware architecture.

The enormous growth of biological sequence data has caused bioinformatics to be rapidly moving towards a data-intensive, computational science. As a result, the computational power needed by bioinformatics applications is growing rapidly as well. The recent emergence of parallel accelerator technologies such as GPUs has made it possible to significantly reduce the execution times of many bioinformatics applications. In this talk I will present the design and implementation of scalable GPU algorithms based on the CUDA programming model in order to accelerate important bioinformatics applications. In particular, I will focus on algorithms and tools for next-generation sequencing (NGS) using error correction as an example.Detection and correction of sequencing errors is an important but time-consuming pre-processing step for de-novo genome assembly or read mapping. In this talk, I discuss the parallel algorithm design used for the CUDA-EC and DecGPU tools. I will also give an overview of other CUDA-enabled tools developed by my research group.

Vortex particle methods, when combined with multipole-accelerated
boundary element methods (BEM), become a complete tool for direct
numerical simulation (DNS) of internal or external vortex-dominated
flows. In previous work, we presented a method to accelerate the
vorticity-velocity inversion at the heart of vortex particle methods by
performing a multipole treecode N-body method on parallel graphics
hardware. The resulting method achieved a 17-fold speedup over a
dual-core CPU implementation. In the present work, we will demonstrate
both an improved algorithm for the GPU vortex particle method that
outperforms an 8-core CPU by a factor of 43, but also a GPU-accelerated
multipole treecode method for the boundary element solution. The new BEM
solves for the unknown source, dipole, or combined strengths over a
triangulated surface using all available CPU cores and GPUs. Problems
with up to 1.4 million unknowns can be solved on a single commodity
desktop computer in one minute, and at that size the hybrid CPU/GPU
outperforms a quad-core CPU alone by 22.5 times. The method is exercised
on DNS of impulsively-started flow over spheres at Re=500, 1000, 2000,
and 4000.

In addition to my research into vortex particle methods, parallel N-body methods, and GPU programming, I create artwork using these same computer programs. The work consists of imagery and animations of fluid forms and other shapes and patterns in nature. Using relatively simple algorithms reflecting the origins of their underlying processes, many of these patterns can be recreated and their inherent beauty exposed. In this talk, I will discuss the technical aspects of my work, but mainly plan to distract attention with the works themselves.

Biography:

Mark Stock earned his PhD from Aerospace Engineering at the University of Michigan in 2006, and has been working for Applied Scientific Research in Santa Ana, CA since then. He has been creating computer imagery and numerical simulations for over 25 years, and started exhibiting his artwork in 2001.

Biography:

Mark Stock earned his PhD from Aerospace Engineering at the University of Michigan in 2006, and has been working for Applied Scientific Research in Santa Ana, CA since then. He has been creating computer imagery and numerical simulations for over 25 years, and started exhibiting his artwork in 2001.

Parallelism is largely seen as a necessary evil to cope with the power restrictions on a chip and most programmers would prefer to continue writing sequential programs rather than dealing with the alien and error-prone parallel programming. This talk will question this view and point out how the allegedly unfamiliar parallel processing is utilized by millions of people everyday. Parallelism appears as a course only when looking at it from the crooked illusion of sequential processing. Admittedly, there are critical decisions associated with specialization, data movement or synchronization, but we also have lots of experience in taking them because they are performed everyday. Presented results will demonstrate that the drawn analogies are not just theoretic.

Locally-Self-Consistent Multiple-Scattering (LSMS) is one of the major petascale applications and highly tuned for supercomputer systems like Cray XT5 Jaguar. We present our recent effort on porting and tuning the major computational routine of LSMS to GPU based systems to demonstrate the feasibility of LSMS beyond petaflops. In particular, we discuss the techniques, including auto-tuning of dense matrix kernels and computation-communication overlap.

We present a very efficient implementation of a multiphase lattice Boltzmann methods (LBM) based on CUDA. This technology delivers significant benefits for predictions of properties in rocks. The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure. We will show videos of these simulations in complex real world porous media and rocks.

We show how Ingrain's digital rock physics technology works to predict fluid flow properties in rocks.
NVIDIA CUDA technology delivers significant acceleration for this technology.
The simulator on NVIDIA hardware enables us to perform pore scale multi-phase (oil-water-matrix) simulations
in natural porous media and to predict important rock properties like absolute permeability, relative permeabilites, and capillary pressure.

Algebraic Multigrid (AMG) solvers are an essential component of many large-scale
scientific simulation codes. Their continued numerical scalability and efficient
implementation is critical for preparing these codes for exascale.
Our experiences on modern multi-core machines show that significant challenges
must be addressed for AMG to perform well on such machines. We discuss our
experiences and describe the techniques we have used to overcome scalability
challenges for AMG on hybrid architectures in preparation for exascale.

The combination of algorithmic acceleration and hardware acceleration can have tremendous impact. The FMM is a fast algorithm for calculating matrix vector multiplications in O(N) time, and it runs very fast on GPUs. Its combination of high degree of parallelism and O(N) complexity make it an attractive solver for the Peta-scale and Exa-scale era. It has a wide range of applications, e.g. quantum mechanics, molecular dynamics, electrostatics, acoustics, structural mechanics, fluid mechanics, and astrophysics.