-During the last two decades we have seen how computer graphics hardware has been developed from fixed pipeline processors with no level of programmability, to flexible high-performance hardware platforms, suitable for general purpose scientific computations other than computer graphics. This trend has been disruptive high-performance computing on mass-produced commodity hardware and give new opportunities for computational science and engineering on work stations for a broad range of scientific applications. This emphasizes the increasingly important role of computers in simulation of real world dynamics \cite{ch7:Keyes201170}. In recent years, the Compute Unified Device Architecture (CUDA) programming model, based on the standard C/C++ programming language and introduced by Nvidia, has become popular as a proprietary and widely used standard in high performance communities. It is by design and supported functionality, easy and sufficient to be deployed for wide improvement of existing and new applications across science and engineering fields, that can benefit from the the use of heterogenous hardware.
+During the last two decades we have seen how computer graphics hardware has been developed from fixed pipeline processors with no level of programmability, to flexible high-performance hardware platforms, suitable for general purpose scientific computations other than computer graphics. This trend has contributed to a disruptive breakthrough in high-performance computing on mass-produced commodity hardware and fuelled new opportunities for computational science and engineering for a broad range of scientific as well as modern business applications. This emphasizes the increasingly important role of computers in simulation of real world dynamics \cite{ch7:Keyes201170}. In recent years, the CUDA programming model, based on the standard C/C++ programming language and introduced by NVIDIA Corporation worldwide, has become popular as a proprietary and widely used standard in high performance communities. It is, by design and supported functionality, easy and sufficient to be deployed for wide improvement of existing and new applications across science and engineering fields, that can benefit from the the use of heterogenous hardware.
+
+We should be careful about speculating about the future and extrapolating from current trends. The TOP 500 list\footnote{\url{http://www.top500.org.}} of supercomputers shows that there are some general noticeable hardware trends and gives indication of what to expect in the near future. First, since 2005 we have seen how power constraints and resulting heat dissipation problems forced chip producers to increase the number of cores rather than clock frequency. Multicore processors have become the new standard and many-core processors are becoming available as a standard in commodity hardware, from personal laptops to supercomputing clusters.
+
+This trend suggests that there will be less fast low-latency memory available per core in the future, favoring data-locality in algorithms. In addition, we have also seen how communication speed to computation speed ratio decreases, making it increasingly difficult to supply data to hungry floating point units. In addition, there will likely be increasing amounts of data to store as a result of increasing processing capabilities. The rapidly increasing floating point performance following Moore's law for transistor production has resulted in a significant memory gap which leaves most scientific applications based on partial differential equations (PDEs) bandwidth bound rather than compute bound. This trend is driven by pure commercial needs and not the needs of high-performance computing. Roads to better performance include standardization of software infrastructure, rethinking algorithms to better exploit memory hierarchies optimally to boost strong scaling properties, increasing locality in algorithms, and introducing as much concurrency and work as possible to both utilize and exploit the many cores. Also, software that can utilize many cores should be fault-tolerant to maximize time to solution for application users. We should also expect to see multiple layers of parallelism that will have to be exploited and possibly autotuned to optimally utilize available hardware resources. This introduces new challenges in compilers, requires programming experts with hardware knowledge, and introduces new trends in software developments to leverage productivity and utilize available computational resources in more optimal ways. We have observed a fundamental paradigm shift of underlying hardware design towards much more heterogeneity and parallelism.
+
+A key problem is that improvements in performance require porting legacy codes\footnote{In the worst case, a legacy code is an undocumented serial code developed a long time ago by a developer no longer around.} to new hardware, and possibly changing algorithms which have been developed for the conventional single core processors decades ago. Without this, it may be impossible to utilize and scale algorithmic work optimally to achieve high performance on modern and emerging hardware. This problem is currently addressed with rapid progress by researchers and industry by development of new optimized libraries that can utilize such new hardware. While we have seen significant improvements in such efforts, and today see much more rapid development of applications, there are still relatively few scientific applications running entirely on heterogenous hardware.