Dr Francis Wray looks back over the history of HPC and gives his insight into what can be said about systems in the future.
Introduction
Over the past 30 years, computing has come to play a significant role in the way we conduct our lives. The development of PCs, enabling applications such a word processing and spreadsheets, the availability of the internet, bringing with it search engines, e-commerce, voice-over-IP and email, games consoles, and the emergence of mobile computing through laptops, tablets and smart phones, have changed forever the ways in which we interact with our family, friends and surroundings and conduct our lives. All this has happened in a very short time and with an increasing rate of change.
As a simple indicator of this change, consider the power available in a laptop. We have chosen this simply because other mobile devices such as smart phones and tablets did not exist 20 years ago. In 1990, the first laptops typically had a 10 MHz processor, 4 MB of memory and a floating-point capability of 1 MFLOPs [1]. By the mid 1990s processor frequencies had risen to 50 MHz, memory to 32 MB and floating-point capability to 50 MFLOPs. In 2012, a typical laptop has a processor comprising two cores clocking at a frequency of 2.5 GHz, a memory of 4 Gbytes and a floating-point capability of 50 GFLOPs. In little more than 20 years, the capabilities of a laptop to process data (or play games) have increased more than a thousand-fold. This processing capability combined with a similar increase in networking capability (from the 14kb/s modem to the 20Mb/s broadband) has created unprecedented opportunities for innovation and societal change. In no other aspect of society have we seen such a rapid development.
Compare this with the world of supercomputing. In June 1993, the world's most powerful supercomputer had a floating-point capability of 60 GFLOPs. In November 2011, the world's most powerful supercomputer had a floating-point capability of 11 PFLOPs, some 170,000 times greater. There are several things to note here. Firstly and most remarkably, in 1990 the most powerful supercomputer in the world would have had a performance of less than that of a present-day laptop. Secondly, the performance of supercomputers has increased at a somewhat faster rate than that of commodity laptops. This is not surprising because both types of computer now use similar components, but the number of processors in a supercomputer is also increasing; a few thousands in the early 90s to hundreds of thousands now. Finally, whilst the social impact of computing is clear for all to see, the industrial, and less visible, impact of supercomputing has also been very significant in areas ranging from the design of drugs to that of complete aircraft. In terms of economic effect it is clear that "The country that out-computes will be the one that out-competes." [2]
Although, the purpose of this article is not to follow social trends and the influence computing has exerted upon them, these simple statistics show the synergy between the development of personal computing and that of supercomputing. In what follows, we will focus on the historical development of supercomputing, but bear in mind its symbiotic relationship to mobile computing. In drawing some conclusions about the future of supercomputing, we shall inevitably infer some aspects of the development of the influence computing is exerting more and more on our daily lives.
The development of supercomputing (the early days)
The early days of supercomputing were characterised by custom devices and exotic technologies. The first recognised supercomputer, the Cray-1, was installed at Los Alamos National Laboratory in 1976. It had a peak floating point performance of 160 MFLOPs, used a custom processor and was cooled by liquid Freon.
This was followed by a series of developments not only by Cray, but also by Fujitsu, Hitachi and NEC who entered the supercomputer market with offerings again based on custom processors and state-of-the-art cooling systems. Examples of these include the Fujitsu VP-200, which had a peak performance of 500 MFLOPs, announced in July 1982, the Hitachi hitac S-810 which had a peak performance of 630 MFLOPs, announced in August 1982, and the NEC SX-1 vector supercomputer, announced in April 1983 which had a peak performance of 570 MFLOPs and which was the first in a line of supercomputers leading to the SX-9, announced in 2008. The SX-1 was announced at the same time as the SX-2, which had a peak performance of 1.3 GFLOPs.
In the late 1980s, a series of massively parallel computers using large numbers of commodity processors entered the high-performance computing market. These included systems from Thinking Machines, Intel, nCube, MasPar and Meiko Scientific. In 1993, Cray announced its T3D system comprising up to 2048 Dec Alpha processors connected by a high-speed network. The potential performance of the 2048 processor system was 200 GFLOPs although the largest system ever sold had 1024 processors and a performance of 100 GFLOPs. This was a landmark announcement, which heralded the dominant position of commodity processors in computer systems ranging from mobile devices through to top-of-the range supercomputers. From then on the power of supercomputers would be determined by the performance of an individual processor and by the number that could be integrated into a single computer, limited only by power, cooling, volume, networking and financial constraints.
The invasion of the killer micros
From 1993 when Cray announced the T3D, it was easy to see how all computers, supercomputers included, would get faster and faster. Moore's Law told us that every 2 years or so the number of transistors in a given area of silicon would double. If these transistors could be used effectively then the performance of individual processors would increase, as would that of systems containing several such processors. When Cray introduced the T3D in 1993, it was taking advantage of a trend which had already started in the 70s, had continued through into the 80s and would continue through to around 2005. This trend, simply put, was that single processors would get faster by increasing their clock speeds and by using the increased transistor counts as dictated by Moore's Law to increase performance. In the 70s and 80s each chip generation would make a single-threaded code execute faster via a combination of increasing clock speed and by using the additional available transistors to add a major feature to the chip such as a floating-point unit, out-of-order execution, or pipelining. In the 90s and early 2000s, each chip generation would make a single-threaded code run faster via a combination of increasing clock speed and by using the additional available transistors to add an increasing number of progressively smaller features. By 2005, Moore's law still applied, but this trend had hit a wall.
What had stopped this trend? There were two factors. The first was heating which rises with clock frequency. This has effectively limited clock speeds to less that 3 GHz. The second was that there were no more features to add to realistically speed up single-threaded code. Nevertheless, there remained a way to make use of the still increasing number of transistors as dictated by Moore's law and that was to put more than one processor core on each chip. This marked the start of the multicore revolution, which we shall describe in the next section.
The multicore revolution
As we have seen in the previous section, around 2005, it no longer became realistic to try to speed up single processors. Instead, the obvious step was to put increasing numbers of processor cores onto a single chip. This trend is now well established. Most laptops have dual-core processors, servers have processors with four or more cores, Intel have recently announced a processor with around 50 cores and so on. A key feature of these cores is that they are all identical and can be programmed with a homogeneous programming model, at least within the confines of a single chip.
How long can this new trend continue? The short answer is that it is already being superseded by a move to heterogeneous multicore processors. Before we consider such devices, it is useful to look at the pros and cons of a homogeneous, multicore processor.
The principal advantage is that the programming model for such a device is relatively simple because each core has the same instruction set and capabilities as every other core. Note the use of "relatively" because programming parallel systems comprising multiple cores can be far from simple in some cases. Clearly this type of processor is simpler to design because a single core can be replicated to fill the available area of silicon. Although the computing power of multicore devices can continue to increase rapidly by increasing the number of cores, this potential power is harder to exploit because applications need to be converted to run on parallel systems. Nevertheless, at least for small numbers of cores, support is available from proprietary parallel runtimes, profilers and debuggers.
A disadvantage of the homogeneous multicore approach is backwards compatibility. New multicore processors need to support legacy software and in some cases this means providing redundant capability and replicating it across all cores. In other words, this approach does not result in the most efficient use of the available transistors. A further difficulty, which applies to all multicore devices, homogeneous or heterogeneous, is that of memory models. The more cores in a device, the more complicated the on-chip memory system needs to be to support each core having a consistent view of and ready access to on-chip data. Of course, this is an architectural choice. It would be perfectly possible for each core to have its own local memory and for cores to exchange data by passing messages. However, at least for the time being, manufacturers have chosen not to implement this paradigm in their mainstream processors, but may be forced to review their memory models as core counts increase. Finally, there is the issue of memory-interface bandwidth, which also applies to both heterogeneous and homogeneous multicore devices. Quite simply, the more cores there are on a device, the faster the external memory interface needs to be to keep all those cores supplied with data. This remains an unsolved problem, but one which can be circumvented for the time being while the number of cores on a device remain sufficiently small. Nevertheless, this remains a fundamental barrier to the development of processors with very large numbers of cores, both homogeneous and heterogeneous. Ultimately new architectural choices, combined with the development of new algorithms will be needed to address this issue.
Heterogeneous multicore processors
Only just 7 years after the start of the homogeneous multicore era, we are now confronted with compelling reasons to adopt heterogeneous, multicore processors. This approach was pioneered in the IBM Cell processor used in the PlayStation 3. A further example of such devices is the AMD Fusion, which combines multiple Central Procesing Units (CPUs) with a Graphical Processing Unit (GPU). In particular, the A10 series combines 4 CPU cores with a GPU and is targeted at the HPC marketplace. Another example is the NVIDIA Tegra, which combines a dual-core ARM Cortex A9 processor with a GeForce GPU. Although targeted at the mobile computing marketplace, it is easy to see how this technology could be further developed and applied to HPC. Yet another example is the ARM big.LITTLE processing which combines high-performance cores with low-power cores enabling an application to choose dynamically the best processor configuration and so minimise power consumption. Finally Intel's Many Integrated Core (MIC) architecture combines heavy-duty and lightweight cores on the same die, to optimise processing capabilities.
What are the reasons for moving to heterogeneous multicore processors? It's all a question of what we are trying to optimise. Up to 2005, the objective was to get the most performance out of a single processor regardless of anything else. At the start of the homogeneous, multicore era, this objective changed to becoming one of getting the maximum performance per unit area of silicon. Now the objective has become one of getting the maximum performance per Joule, given that transistor count is no longer an issue due to the relentless advance of Moore's Law. The reasons for this last change in objective are clear. At the top end, supercomputers are consuming too much power (several megawatts) and viable systems now need to maximise compute per unit of energy. At the mobile computing end, the same constraint applies. The best way to satisfy this constraint is to perform calculations on cores, which use the least energy for those particular calculations. At the moment, the choice is limited to CPUs, GPUs and Field Programmable Gate Arrays (FPGAs), but other specialist processing cores will be deployed as heterogeneous devices mature. Indeed, chips, which integrate several diverse functional units, which can be turned off and on as needed, are already being considered.
Data-intensive computing
Finally, we need to add data-intensive computing to the mix. This will deploy high-performance systems optimised for the processing of data using integer and logical operations rather than for numerically intensive applications. Such systems will sit comfortably within the heterogeneous multicore landscape. They will have access to highly distributed data via the Cloud and the Internet and may even extensively use data from mobile computing devices. This will be datamining "on steroids". It will create the opportunity for the unprecedented discovery of knowledge from data enabling as yet unimagined applications, which may have profound effects on commerce and society.
The systems of the future
Based on current rates of progress, it is projected that exaflops (EFLOPs) systems will be available in 2019 and that zetaflops (ZFLOPs) systems will be available by 2030. Cray has already announced plans to build a 1 EFLOPs system before 2020. Somewhat astonishingly, in India, ISRO and the Indian Institute of Science have plans to build a 132.8 EFLOPs supercomputer by 2017.
It is interesting to speculate on the processor and core count of EFLOPs systems. Focusing on the 2020 horizon for an EFLOPs system, we might anticipate the performance of a single processor to be 50 TFLOPs, for it to comprise 1000 heterogeneous cores each capable of 0.5 TFLOPs, but with only 10% of the cores running at any one time, and for a system to comprise 20,000 such processors. Such numbers are, of course, speculation and are not supported by any firm technological announcements. Indeed each of these figures may be out by a factor of 10 or even more. However, it is clear that there is significant optimism that an EFLOPs system will be feasible by 2020 and that these systems will contain millions of cores.
What challenges lie ahead in the development and use of EFLOP systems? The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and languages and new algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential if such systems are going to be usable. These are difficult issues, which will need to be tackled fully if the relentless increase in computer performance is to be maintained. The economic and social implications of maintaining this rate of increase are highly significant and are discussed later in this section.
From the earlier discussions, we can see that the performance of a mobile device lags around 20 years behind that of the most powerful supercomputer equivalent to a factor in performance of around 1 million (give or take another factor of 10). That is to say that by 2020 we can expect mobile devices (laptops, tablets and smart phones) with a performance of around 1 TFLOPs. A standard laptop with a standard GPU already has a performance of 50 GFLOPs. Furthermore such a system can even now be fitted with a performant GPU card to raise its performance to 1 TFLOPs. The GPU in a typical tablet or smart phone currently performs at around 5 GFLOPs. All this clearly supports the conclusion that by 2020 we will have hand-held mobile devices capable of performing several 100 GFLOPs or more. The interesting question is to what use this performance will be put.
To begin to answer this question, we have to look at the whole spectrum of computing. Improved network connectivity, the development of Cloud computing, the development of data-intensive computing, combined with very powerful, ubiquitous hand-held mobile devices will enable a whole new generation of applications. Highly sophisticated computer games and the wider use of crowd sourcing are obvious applications. These applications will use not only the power of the hand-held device, but will be able to interact seamlessly with Cloud-based computers including state-of-the-art supercomputers and data-intensive computers. This seamless interaction will create many new commercial opportunities with significant societal and economic impacts. It will create a new market for services, many of them HPC-based, and it will enable, as yet unimagined, new applications.
Conclusions
The power of all types of computing is increasing rapidly, more than doubling every two years. By 2020, it is anticipated that supercomputers will have a performance of around 1 EFLOPs, desktop systems will have a performance of up to 100 TFLOPs and hand-held devices a performance of several 100 GFLOPs. These figures are subject to a potential error of an order of magnitude, but are certainly supported by recent history. What is also clear is that increases in computer processing power will now come through increased parallelism and this will require significant changes in the ways in which computers are programmed.
This availability of significant computing power, combined with improved network connectivity will enable a whole new generation of applications interacting seamlessly with Cloud-based computers including state-of-the-art supercomputers. These new applications will range from highly sophisticated computer games to completely new services and, as yet, unimagined applications.
Many challenges lie ahead. The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential. If these problems can be solved, then significant economic and societal opportunities lie ahead through the exploitation of the multicore revolution.
Dr Francis Wray is a consultant who has been involved in HPC for over 25 years. He is a Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston.
Introduction Over the past 30 years, computing has come to play a significant role in the way we conduct our lives. The development of PCs, enabling applications such a word processing and spreadsheets, the availability of the internet, bringing with it search engines, e-commerce, voice-over-IP and email, games consoles, and the emergence of mobile computing through laptops, tablets and smart phones, have changed forever the ways in which we interact with our family, friends and surroundings and conduct our lives. All this has happened in a very short time and with an increasing rate of change.
As a simple indicator of this change, consider the power available in a laptop. We have chosen this simply because other mobile devices such as smart phones and tablets did not exist 20 years ago. In 1990, the first laptops typically had a 10 MHz processor, 4 MB of memory and a floating-point capability of 1 MFLOPs [1]. By the mid 1990s processor frequencies had risen to 50 MHz, memory to 32 MB and floating-point capability to 50 MFLOPs. In 2012, a typical laptop has a processor comprising two cores clocking at a frequency of 2.5 GHz, a memory of 4 Gbytes and a floating-point capability of 50 GFLOPs. In little more than 20 years, the capabilities of a laptop to process data (or play games) have increased more than a thousand-fold. This processing capability combined with a similar increase in networking capability (from the 14kb/s modem to the 20Mb/s broadband) has created unprecedented opportunities for innovation and societal change. In no other aspect of society have we seen such a rapid development.
Compare this with the world of supercomputing. In June 1993, the world's most powerful supercomputer had a floating-point capability of 60 GFLOPs. In November 2011, the world's most powerful supercomputer had a floating-point capability of 11 PFLOPs, some 170,000 times greater. There are several things to note here. Firstly and most remarkably, in 1990 the most powerful supercomputer in the world would have had a performance of less than that of a present-day laptop. Secondly, the performance of supercomputers has increased at a somewhat faster rate than that of commodity laptops. This is not surprising because both types of computer now use similar components, but the number of processors in a supercomputer is also increasing; a few thousands in the early 90s to hundreds of thousands now. Finally, whilst the social impact of computing is clear for all to see, the industrial, and less visible, impact of supercomputing has also been very significant in areas ranging from the design of drugs to that of complete aircraft. In terms of economic effect it is clear that "The country that out-computes will be the one that out-competes." [2]
Although, the purpose of this article is not to follow social trends and the influence computing has exerted upon them, these simple statistics show the synergy between the development of personal computing and that of supercomputing. In what follows, we will focus on the historical development of supercomputing, but bear in mind its symbiotic relationship to mobile computing. In drawing some conclusions about the future of supercomputing, we shall inevitably infer some aspects of the development of the influence computing is exerting more and more on our daily lives.
The development of supercomputing (the early days)
The early days of supercomputing were characterised by custom devices and exotic technologies. The first recognised supercomputer, the Cray-1, was installed at Los Alamos National Laboratory in 1976. It had a peak floating point performance of 160 MFLOPs, used a custom processor and was cooled by liquid Freon.
This was followed by a series of developments not only by Cray, but also by Fujitsu, Hitachi and NEC who entered the supercomputer market with offerings again based on custom processors and state-of-the-art cooling systems. Examples of these include the Fujitsu VP-200, which had a peak performance of 500 MFLOPs, announced in July 1982, the Hitachi hitac S-810 which had a peak performance of 630 MFLOPs, announced in August 1982, and the NEC SX-1 vector supercomputer, announced in April 1983 which had a peak performance of 570 MFLOPs and which was the first in a line of supercomputers leading to the SX-9, announced in 2008. The SX-1 was announced at the same time as the SX-2, which had a peak performance of 1.3 GFLOPs.
In the late 1980s, a series of massively parallel computers using large numbers of commodity processors entered the high-performance computing market. These included systems from Thinking Machines, Intel, nCube, MasPar and Meiko Scientific. In 1993, Cray announced its T3D system comprising up to 2048 Dec Alpha processors connected by a high-speed network. The potential performance of the 2048 processor system was 200 GFLOPs although the largest system ever sold had 1024 processors and a performance of 100 GFLOPs. This was a landmark announcement, which heralded the dominant position of commodity processors in computer systems ranging from mobile devices through to top-of-the range supercomputers. From then on the power of supercomputers would be determined by the performance of an individual processor and by the number that could be integrated into a single computer, limited only by power, cooling, volume, networking and financial constraints.
The invasion of the killer micros
From 1993 when Cray announced the T3D, it was easy to see how all computers, supercomputers included, would get faster and faster. Moore's Law told us that every 2 years or so the number of transistors in a given area of silicon would double. If these transistors could be used effectively then the performance of individual processors would increase, as would that of systems containing several such processors. When Cray introduced the T3D in 1993, it was taking advantage of a trend which had already started in the 70s, had continued through into the 80s and would continue through to around 2005. This trend, simply put, was that single processors would get faster by increasing their clock speeds and by using the increased transistor counts as dictated by Moore's Law to increase performance. In the 70s and 80s each chip generation would make a single-threaded code execute faster via a combination of increasing clock speed and by using the additional available transistors to add a major feature to the chip such as a floating-point unit, out-of-order execution, or pipelining. In the 90s and early 2000s, each chip generation would make a single-threaded code run faster via a combination of increasing clock speed and by using the additional available transistors to add an increasing number of progressively smaller features. By 2005, Moore's law still applied, but this trend had hit a wall.
What had stopped this trend? There were two factors. The first was heating which rises with clock frequency. This has effectively limited clock speeds to less that 3 GHz. The second was that there were no more features to add to realistically speed up single-threaded code. Nevertheless, there remained a way to make use of the still increasing number of transistors as dictated by Moore's law and that was to put more than one processor core on each chip. This marked the start of the multicore revolution, which we shall describe in the next section.
The multicore revolution
As we have seen in the previous section, around 2005, it no longer became realistic to try to speed up single processors. Instead, the obvious step was to put increasing numbers of processor cores onto a single chip. This trend is now well established. Most laptops have dual-core processors, servers have processors with four or more cores, Intel have recently announced a processor with around 50 cores and so on. A key feature of these cores is that they are all identical and can be programmed with a homogeneous programming model, at least within the confines of a single chip.
How long can this new trend continue? The short answer is that it is already being superseded by a move to heterogeneous multicore processors. Before we consider such devices, it is useful to look at the pros and cons of a homogeneous, multicore processor.
The principal advantage is that the programming model for such a device is relatively simple because each core has the same instruction set and capabilities as every other core. Note the use of "relatively" because programming parallel systems comprising multiple cores can be far from simple in some cases. Clearly this type of processor is simpler to design because a single core can be replicated to fill the available area of silicon. Although the computing power of multicore devices can continue to increase rapidly by increasing the number of cores, this potential power is harder to exploit because applications need to be converted to run on parallel systems. Nevertheless, at least for small numbers of cores, support is available from proprietary parallel runtimes, profilers and debuggers.
A disadvantage of the homogeneous multicore approach is backwards compatibility. New multicore processors need to support legacy software and in some cases this means providing redundant capability and replicating it across all cores. In other words, this approach does not result in the most efficient use of the available transistors. A further difficulty, which applies to all multicore devices, homogeneous or heterogeneous, is that of memory models. The more cores in a device, the more complicated the on-chip memory system needs to be to support each core having a consistent view of and ready access to on-chip data. Of course, this is an architectural choice. It would be perfectly possible for each core to have its own local memory and for cores to exchange data by passing messages. However, at least for the time being, manufacturers have chosen not to implement this paradigm in their mainstream processors, but may be forced to review their memory models as core counts increase. Finally, there is the issue of memory-interface bandwidth, which also applies to both heterogeneous and homogeneous multicore devices. Quite simply, the more cores there are on a device, the faster the external memory interface needs to be to keep all those cores supplied with data. This remains an unsolved problem, but one which can be circumvented for the time being while the number of cores on a device remain sufficiently small. Nevertheless, this remains a fundamental barrier to the development of processors with very large numbers of cores, both homogeneous and heterogeneous. Ultimately new architectural choices, combined with the development of new algorithms will be needed to address this issue.
Heterogeneous multicore processors
Only just 7 years after the start of the homogeneous multicore era, we are now confronted with compelling reasons to adopt heterogeneous, multicore processors. This approach was pioneered in the IBM Cell processor used in the PlayStation 3. A further example of such devices is the AMD Fusion, which combines multiple Central Procesing Units (CPUs) with a Graphical Processing Unit (GPU). In particular, the A10 series combines 4 CPU cores with a GPU and is targeted at the HPC marketplace. Another example is the NVIDIA Tegra, which combines a dual-core ARM Cortex A9 processor with a GeForce GPU. Although targeted at the mobile computing marketplace, it is easy to see how this technology could be further developed and applied to HPC. Yet another example is the ARM big.LITTLE processing which combines high-performance cores with low-power cores enabling an application to choose dynamically the best processor configuration and so minimise power consumption. Finally Intel's Many Integrated Core (MIC) architecture combines heavy-duty and lightweight cores on the same die, to optimise processing capabilities.
What are the reasons for moving to heterogeneous multicore processors? It's all a question of what we are trying to optimise. Up to 2005, the objective was to get the most performance out of a single processor regardless of anything else. At the start of the homogeneous, multicore era, this objective changed to becoming one of getting the maximum performance per unit area of silicon. Now the objective has become one of getting the maximum performance per Joule, given that transistor count is no longer an issue due to the relentless advance of Moore's Law. The reasons for this last change in objective are clear. At the top end, supercomputers are consuming too much power (several megawatts) and viable systems now need to maximise compute per unit of energy. At the mobile computing end, the same constraint applies. The best way to satisfy this constraint is to perform calculations on cores, which use the least energy for those particular calculations. At the moment, the choice is limited to CPUs, GPUs and Field Programmable Gate Arrays (FPGAs), but other specialist processing cores will be deployed as heterogeneous devices mature. Indeed, chips, which integrate several diverse functional units, which can be turned off and on as needed, are already being considered.
Data-intensive computing
Finally, we need to add data-intensive computing to the mix. This will deploy high-performance systems optimised for the processing of data using integer and logical operations rather than for numerically intensive applications. Such systems will sit comfortably within the heterogeneous multicore landscape. They will have access to highly distributed data via the Cloud and the Internet and may even extensively use data from mobile computing devices. This will be datamining "on steroids". It will create the opportunity for the unprecedented discovery of knowledge from data enabling as yet unimagined applications, which may have profound effects on commerce and society.
The systems of the future
Based on current rates of progress, it is projected that exaflops (EFLOPs) systems will be available in 2019 and that zetaflops (ZFLOPs) systems will be available by 2030. Cray has already announced plans to build a 1 EFLOPs system before 2020. Somewhat astonishingly, in India, ISRO and the Indian Institute of Science have plans to build a 132.8 EFLOPs supercomputer by 2017.
It is interesting to speculate on the processor and core count of EFLOPs systems. Focusing on the 2020 horizon for an EFLOPs system, we might anticipate the performance of a single processor to be 50 TFLOPs, for it to comprise 1000 heterogeneous cores each capable of 0.5 TFLOPs, but with only 10% of the cores running at any one time, and for a system to comprise 20,000 such processors. Such numbers are, of course, speculation and are not supported by any firm technological announcements. Indeed each of these figures may be out by a factor of 10 or even more. However, it is clear that there is significant optimism that an EFLOPs system will be feasible by 2020 and that these systems will contain millions of cores.
What challenges lie ahead in the development and use of EFLOP systems? The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and languages and new algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential if such systems are going to be usable. These are difficult issues, which will need to be tackled fully if the relentless increase in computer performance is to be maintained. The economic and social implications of maintaining this rate of increase are highly significant and are discussed later in this section.
From the earlier discussions, we can see that the performance of a mobile device lags around 20 years behind that of the most powerful supercomputer equivalent to a factor in performance of around 1 million (give or take another factor of 10). That is to say that by 2020 we can expect mobile devices (laptops, tablets and smart phones) with a performance of around 1 TFLOPs. A standard laptop with a standard GPU already has a performance of 50 GFLOPs. Furthermore such a system can even now be fitted with a performant GPU card to raise its performance to 1 TFLOPs. The GPU in a typical tablet or smart phone currently performs at around 5 GFLOPs. All this clearly supports the conclusion that by 2020 we will have hand-held mobile devices capable of performing several 100 GFLOPs or more. The interesting question is to what use this performance will be put.
To begin to answer this question, we have to look at the whole spectrum of computing. Improved network connectivity, the development of Cloud computing, the development of data-intensive computing, combined with very powerful, ubiquitous hand-held mobile devices will enable a whole new generation of applications. Highly sophisticated computer games and the wider use of crowd sourcing are obvious applications. These applications will use not only the power of the hand-held device, but will be able to interact seamlessly with Cloud-based computers including state-of-the-art supercomputers and data-intensive computers. This seamless interaction will create many new commercial opportunities with significant societal and economic impacts. It will create a new market for services, many of them HPC-based, and it will enable, as yet unimagined, new applications.
Conclusions
The power of all types of computing is increasing rapidly, more than doubling every two years. By 2020, it is anticipated that supercomputers will have a performance of around 1 EFLOPs, desktop systems will have a performance of up to 100 TFLOPs and hand-held devices a performance of several 100 GFLOPs. These figures are subject to a potential error of an order of magnitude, but are certainly supported by recent history. What is also clear is that increases in computer processing power will now come through increased parallelism and this will require significant changes in the ways in which computers are programmed.
This availability of significant computing power, combined with improved network connectivity will enable a whole new generation of applications interacting seamlessly with Cloud-based computers including state-of-the-art supercomputers. These new applications will range from highly sophisticated computer games to completely new services and, as yet, unimagined applications.
Many challenges lie ahead. The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential. If these problems can be solved, then significant economic and societal opportunities lie ahead through the exploitation of the multicore revolution.
Dr Francis Wray is a consultant who has been involved in HPC for over 25 years. He is a Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston.
Introduction
Over the past 30 years, computing has come to play a significant role in the way we conduct our lives. The development of PCs, enabling applications such a word processing and spreadsheets, the availability of the internet, bringing with it search engines, e-commerce, voice-over-IP and email, games consoles, and the emergence of mobile computing through laptops, tablets and smart phones, have changed forever the ways in which we interact with our family, friends and surroundings and conduct our lives. All this has happened in a very short time and with an increasing rate of change.
As a simple indicator of this change, consider the power available in a laptop. We have chosen this simply because other mobile devices such as smart phones and tablets did not exist 20 years ago. In 1990, the first laptops typically had a 10 MHz processor, 4 MB of memory and a floating-point capability of 1 MFLOPs [1]. By the mid 1990s processor frequencies had risen to 50 MHz, memory to 32 MB and floating-point capability to 50 MFLOPs. In 2012, a typical laptop has a processor comprising two cores clocking at a frequency of 2.5 GHz, a memory of 4 Gbytes and a floating-point capability of 50 GFLOPs. In little more than 20 years, the capabilities of a laptop to process data (or play games) have increased more than a thousand-fold. This processing capability combined with a similar increase in networking capability (from the 14kb/s modem to the 20Mb/s broadband) has created unprecedented opportunities for innovation and societal change. In no other aspect of society have we seen such a rapid development.
Compare this with the world of supercomputing. In June 1993, the world's most powerful supercomputer had a floating-point capability of 60 GFLOPs. In November 2011, the world's most powerful supercomputer had a floating-point capability of 11 PFLOPs, some 170,000 times greater. There are several things to note here. Firstly and most remarkably, in 1990 the most powerful supercomputer in the world would have had a performance of less than that of a present-day laptop. Secondly, the performance of supercomputers has increased at a somewhat faster rate than that of commodity laptops. This is not surprising because both types of computer now use similar components, but the number of processors in a supercomputer is also increasing; a few thousands in the early 90s to hundreds of thousands now. Finally, whilst the social impact of computing is clear for all to see, the industrial, and less visible, impact of supercomputing has also been very significant in areas ranging from the design of drugs to that of complete aircraft. In terms of economic effect it is clear that "The country that out-computes will be the one that out-competes." [2]
Although, the purpose of this article is not to follow social trends and the influence computing has exerted upon them, these simple statistics show the synergy between the development of personal computing and that of supercomputing. In what follows, we will focus on the historical development of supercomputing, but bear in mind its symbiotic relationship to mobile computing. In drawing some conclusions about the future of supercomputing, we shall inevitably infer some aspects of the development of the influence computing is exerting more and more on our daily lives.
The development of supercomputing (the early days)
The early days of supercomputing were characterised by custom devices and exotic technologies. The first recognised supercomputer, the Cray-1, was installed at Los Alamos National Laboratory in 1976. It had a peak floating point performance of 160 MFLOPs, used a custom processor and was cooled by liquid Freon.
This was followed by a series of developments not only by Cray, but also by Fujitsu, Hitachi and NEC who entered the supercomputer market with offerings again based on custom processors and state-of-the-art cooling systems. Examples of these include the Fujitsu VP-200, which had a peak performance of 500 MFLOPs, announced in July 1982, the Hitachi hitac S-810 which had a peak performance of 630 MFLOPs, announced in August 1982, and the NEC SX-1 vector supercomputer, announced in April 1983 which had a peak performance of 570 MFLOPs and which was the first in a line of supercomputers leading to the SX-9, announced in 2008. The SX-1 was announced at the same time as the SX-2, which had a peak performance of 1.3 GFLOPs.
In the late 1980s, a series of massively parallel computers using large numbers of commodity processors entered the high-performance computing market. These included systems from Thinking Machines, Intel, nCube, MasPar and Meiko Scientific. In 1993, Cray announced its T3D system comprising up to 2048 Dec Alpha processors connected by a high-speed network. The potential performance of the 2048 processor system was 200 GFLOPs although the largest system ever sold had 1024 processors and a performance of 100 GFLOPs. This was a landmark announcement, which heralded the dominant position of commodity processors in computer systems ranging from mobile devices through to top-of-the range supercomputers. From then on the power of supercomputers would be determined by the performance of an individual processor and by the number that could be integrated into a single computer, limited only by power, cooling, volume, networking and financial constraints.
The invasion of the killer micros
From 1993 when Cray announced the T3D, it was easy to see how all computers, supercomputers included, would get faster and faster. Moore's Law told us that every 2 years or so the number of transistors in a given area of silicon would double. If these transistors could be used effectively then the performance of individual processors would increase, as would that of systems containing several such processors. When Cray introduced the T3D in 1993, it was taking advantage of a trend which had already started in the 70s, had continued through into the 80s and would continue through to around 2005. This trend, simply put, was that single processors would get faster by increasing their clock speeds and by using the increased transistor counts as dictated by Moore's Law to increase performance. In the 70s and 80s each chip generation would make a single-threaded code execute faster via a combination of increasing clock speed and by using the additional available transistors to add a major feature to the chip such as a floating-point unit, out-of-order execution, or pipelining. In the 90s and early 2000s, each chip generation would make a single-threaded code run faster via a combination of increasing clock speed and by using the additional available transistors to add an increasing number of progressively smaller features. By 2005, Moore's law still applied, but this trend had hit a wall.
What had stopped this trend? There were two factors. The first was heating which rises with clock frequency. This has effectively limited clock speeds to less that 3 GHz. The second was that there were no more features to add to realistically speed up single-threaded code. Nevertheless, there remained a way to make use of the still increasing number of transistors as dictated by Moore's law and that was to put more than one processor core on each chip. This marked the start of the multicore revolution, which we shall describe in the next section.
The multicore revolution
As we have seen in the previous section, around 2005, it no longer became realistic to try to speed up single processors. Instead, the obvious step was to put increasing numbers of processor cores onto a single chip. This trend is now well established. Most laptops have dual-core processors, servers have processors with four or more cores, Intel have recently announced a processor with around 50 cores and so on. A key feature of these cores is that they are all identical and can be programmed with a homogeneous programming model, at least within the confines of a single chip.
How long can this new trend continue? The short answer is that it is already being superseded by a move to heterogeneous multicore processors. Before we consider such devices, it is useful to look at the pros and cons of a homogeneous, multicore processor.
The principal advantage is that the programming model for such a device is relatively simple because each core has the same instruction set and capabilities as every other core. Note the use of "relatively" because programming parallel systems comprising multiple cores can be far from simple in some cases. Clearly this type of processor is simpler to design because a single core can be replicated to fill the available area of silicon. Although the computing power of multicore devices can continue to increase rapidly by increasing the number of cores, this potential power is harder to exploit because applications need to be converted to run on parallel systems. Nevertheless, at least for small numbers of cores, support is available from proprietary parallel runtimes, profilers and debuggers.
A disadvantage of the homogeneous multicore approach is backwards compatibility. New multicore processors need to support legacy software and in some cases this means providing redundant capability and replicating it across all cores. In other words, this approach does not result in the most efficient use of the available transistors. A further difficulty, which applies to all multicore devices, homogeneous or heterogeneous, is that of memory models. The more cores in a device, the more complicated the on-chip memory system needs to be to support each core having a consistent view of and ready access to on-chip data. Of course, this is an architectural choice. It would be perfectly possible for each core to have its own local memory and for cores to exchange data by passing messages. However, at least for the time being, manufacturers have chosen not to implement this paradigm in their mainstream processors, but may be forced to review their memory models as core counts increase. Finally, there is the issue of memory-interface bandwidth, which also applies to both heterogeneous and homogeneous multicore devices. Quite simply, the more cores there are on a device, the faster the external memory interface needs to be to keep all those cores supplied with data. This remains an unsolved problem, but one which can be circumvented for the time being while the number of cores on a device remain sufficiently small. Nevertheless, this remains a fundamental barrier to the development of processors with very large numbers of cores, both homogeneous and heterogeneous. Ultimately new architectural choices, combined with the development of new algorithms will be needed to address this issue.
Heterogeneous multicore processors
Only just 7 years after the start of the homogeneous multicore era, we are now confronted with compelling reasons to adopt heterogeneous, multicore processors. This approach was pioneered in the IBM Cell processor used in the PlayStation 3. A further example of such devices is the AMD Fusion, which combines multiple Central Procesing Units (CPUs) with a Graphical Processing Unit (GPU). In particular, the A10 series combines 4 CPU cores with a GPU and is targeted at the HPC marketplace. Another example is the NVIDIA Tegra, which combines a dual-core ARM Cortex A9 processor with a GeForce GPU. Although targeted at the mobile computing marketplace, it is easy to see how this technology could be further developed and applied to HPC. Yet another example is the ARM big.LITTLE processing which combines high-performance cores with low-power cores enabling an application to choose dynamically the best processor configuration and so minimise power consumption. Finally Intel's Many Integrated Core (MIC) architecture combines heavy-duty and lightweight cores on the same die, to optimise processing capabilities.
What are the reasons for moving to heterogeneous multicore processors? It's all a question of what we are trying to optimise. Up to 2005, the objective was to get the most performance out of a single processor regardless of anything else. At the start of the homogeneous, multicore era, this objective changed to becoming one of getting the maximum performance per unit area of silicon. Now the objective has become one of getting the maximum performance per Joule, given that transistor count is no longer an issue due to the relentless advance of Moore's Law. The reasons for this last change in objective are clear. At the top end, supercomputers are consuming too much power (several megawatts) and viable systems now need to maximise compute per unit of energy. At the mobile computing end, the same constraint applies. The best way to satisfy this constraint is to perform calculations on cores, which use the least energy for those particular calculations. At the moment, the choice is limited to CPUs, GPUs and Field Programmable Gate Arrays (FPGAs), but other specialist processing cores will be deployed as heterogeneous devices mature. Indeed, chips, which integrate several diverse functional units, which can be turned off and on as needed, are already being considered.
Data-intensive computing
Finally, we need to add data-intensive computing to the mix. This will deploy high-performance systems optimised for the processing of data using integer and logical operations rather than for numerically intensive applications. Such systems will sit comfortably within the heterogeneous multicore landscape. They will have access to highly distributed data via the Cloud and the Internet and may even extensively use data from mobile computing devices. This will be datamining "on steroids". It will create the opportunity for the unprecedented discovery of knowledge from data enabling as yet unimagined applications, which may have profound effects on commerce and society.
The systems of the future
Based on current rates of progress, it is projected that exaflops (EFLOPs) systems will be available in 2019 and that zetaflops (ZFLOPs) systems will be available by 2030. Cray has already announced plans to build a 1 EFLOPs system before 2020. Somewhat astonishingly, in India, ISRO and the Indian Institute of Science have plans to build a 132.8 EFLOPs supercomputer by 2017.
It is interesting to speculate on the processor and core count of EFLOPs systems. Focusing on the 2020 horizon for an EFLOPs system, we might anticipate the performance of a single processor to be 50 TFLOPs, for it to comprise 1000 heterogeneous cores each capable of 0.5 TFLOPs, but with only 10% of the cores running at any one time, and for a system to comprise 20,000 such processors. Such numbers are, of course, speculation and are not supported by any firm technological announcements. Indeed each of these figures may be out by a factor of 10 or even more. However, it is clear that there is significant optimism that an EFLOPs system will be feasible by 2020 and that these systems will contain millions of cores.
What challenges lie ahead in the development and use of EFLOP systems? The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and languages and new algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential if such systems are going to be usable. These are difficult issues, which will need to be tackled fully if the relentless increase in computer performance is to be maintained. The economic and social implications of maintaining this rate of increase are highly significant and are discussed later in this section.
From the earlier discussions, we can see that the performance of a mobile device lags around 20 years behind that of the most powerful supercomputer equivalent to a factor in performance of around 1 million (give or take another factor of 10). That is to say that by 2020 we can expect mobile devices (laptops, tablets and smart phones) with a performance of around 1 TFLOPs. A standard laptop with a standard GPU already has a performance of 50 GFLOPs. Furthermore such a system can even now be fitted with a performant GPU card to raise its performance to 1 TFLOPs. The GPU in a typical tablet or smart phone currently performs at around 5 GFLOPs. All this clearly supports the conclusion that by 2020 we will have hand-held mobile devices capable of performing several 100 GFLOPs or more. The interesting question is to what use this performance will be put.
To begin to answer this question, we have to look at the whole spectrum of computing. Improved network connectivity, the development of Cloud computing, the development of data-intensive computing, combined with very powerful, ubiquitous hand-held mobile devices will enable a whole new generation of applications. Highly sophisticated computer games and the wider use of crowd sourcing are obvious applications. These applications will use not only the power of the hand-held device, but will be able to interact seamlessly with Cloud-based computers including state-of-the-art supercomputers and data-intensive computers. This seamless interaction will create many new commercial opportunities with significant societal and economic impacts. It will create a new market for services, many of them HPC-based, and it will enable, as yet unimagined, new applications.
Conclusions
The power of all types of computing is increasing rapidly, more than doubling every two years. By 2020, it is anticipated that supercomputers will have a performance of around 1 EFLOPs, desktop systems will have a performance of up to 100 TFLOPs and hand-held devices a performance of several 100 GFLOPs. These figures are subject to a potential error of an order of magnitude, but are certainly supported by recent history. What is also clear is that increases in computer processing power will now come through increased parallelism and this will require significant changes in the ways in which computers are programmed.
This availability of significant computing power, combined with improved network connectivity will enable a whole new generation of applications interacting seamlessly with Cloud-based computers including state-of-the-art supercomputers. These new applications will range from highly sophisticated computer games to completely new services and, as yet, unimagined applications.
Many challenges lie ahead. The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential. If these problems can be solved, then significant economic and societal opportunities lie ahead through the exploitation of the multicore revolution.
Dr Francis Wray is a consultant who has been involved in HPC for over 25 years. He is a Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston.
[1] 1 MFLOP = 106 FLOPS (floating point operations per second); 1 GFLOP = 109 FLOPS; 1 TFLOP = 1012 FLOPS; 1 PFLOP = 1015 FLOPS; 1EFLOP = 1018 FLOPS; 1 ZFLOP = 1021 FLOPS.
[2] http://www.isgtw.org/visualization/why-advanced-computing-matters
A Brief Future of Computing
Introduction
Over the past 30 years, computing has come to play a significant role in the way we conduct our lives. The development of PCs, enabling applications such a word processing and spreadsheets, the availability of the internet, bringing with it search engines, e-commerce, voice-over-IP and email, games consoles, and the emergence of mobile computing through laptops, tablets and smart phones, have changed forever the ways in which we interact with our family, friends and surroundings and conduct our lives. All this has happened in a very short time and with an increasing rate of change.
As a simple indicator of this change, consider the power available in a laptop. We have chosen this simply because other mobile devices such as smart phones and tablets did not exist 20 years ago. In 1990, the first laptops typically had a 10 MHz processor, 4 MB of memory and a floating-point capability of 1 MFLOPs [1]. By the mid 1990s processor frequencies had risen to 50 MHz, memory to 32 MB and floating-point capability to 50 MFLOPs. In 2012, a typical laptop has a processor comprising two cores clocking at a frequency of 2.5 GHz, a memory of 4 Gbytes and a floating-point capability of 50 GFLOPs. In little more than 20 years, the capabilities of a laptop to process data (or play games) have increased more than a thousand-fold. This processing capability combined with a similar increase in networking capability (from the 14kb/s modem to the 20Mb/s broadband) has created unprecedented opportunities for innovation and societal change. In no other aspect of society have we seen such a rapid development.
Compare this with the world of supercomputing. In June 1993, the world's most powerful supercomputer had a floating-point capability of 60 GFLOPs. In November 2011, the world's most powerful supercomputer had a floating-point capability of 11 PFLOPs, some 170,000 times greater. There are several things to note here. Firstly and most remarkably, in 1990 the most powerful supercomputer in the world would have had a performance of less than that of a present-day laptop. Secondly, the performance of supercomputers has increased at a somewhat faster rate than that of commodity laptops. This is not surprising because both types of computer now use similar components, but the number of processors in a supercomputer is also increasing; a few thousands in the early 90s to hundreds of thousands now. Finally, whilst the social impact of computing is clear for all to see, the industrial, and less visible, impact of supercomputing has also been very significant in areas ranging from the design of drugs to that of complete aircraft. In terms of economic effect it is clear that "The country that out-computes will be the one that out-competes." [2]
Although, the purpose of this article is not to follow social trends and the influence computing has exerted upon them, these simple statistics show the synergy between the development of personal computing and that of supercomputing. In what follows, we will focus on the historical development of supercomputing, but bear in mind its symbiotic relationship to mobile computing. In drawing some conclusions about the future of supercomputing, we shall inevitably infer some aspects of the development of the influence computing is exerting more and more on our daily lives.
The development of supercomputing (the early days)
The early days of supercomputing were characterised by custom devices and exotic technologies. The first recognised supercomputer, the Cray-1, was installed at Los Alamos National Laboratory in 1976. It had a peak floating point performance of 160 MFLOPs, used a custom processor and was cooled by liquid Freon.
This was followed by a series of developments not only by Cray, but also by Fujitsu, Hitachi and NEC who entered the supercomputer market with offerings again based on custom processors and state-of-the-art cooling systems. Examples of these include the Fujitsu VP-200, which had a peak performance of 500 MFLOPs, announced in July 1982, the Hitachi hitac S-810 which had a peak performance of 630 MFLOPs, announced in August 1982, and the NEC SX-1 vector supercomputer, announced in April 1983 which had a peak performance of 570 MFLOPs and which was the first in a line of supercomputers leading to the SX-9, announced in 2008. The SX-1 was announced at the same time as the SX-2, which had a peak performance of 1.3 GFLOPs.
In the late 1980s, a series of massively parallel computers using large numbers of commodity processors entered the high-performance computing market. These included systems from Thinking Machines, Intel, nCube, MasPar and Meiko Scientific. In 1993, Cray announced its T3D system comprising up to 2048 Dec Alpha processors connected by a high-speed network. The potential performance of the 2048 processor system was 200 GFLOPs although the largest system ever sold had 1024 processors and a performance of 100 GFLOPs. This was a landmark announcement, which heralded the dominant position of commodity processors in computer systems ranging from mobile devices through to top-of-the range supercomputers. From then on the power of supercomputers would be determined by the performance of an individual processor and by the number that could be integrated into a single computer, limited only by power, cooling, volume, networking and financial constraints.
The invasion of the killer micros
From 1993 when Cray announced the T3D, it was easy to see how all computers, supercomputers included, would get faster and faster. Moore's Law told us that every 2 years or so the number of transistors in a given area of silicon would double. If these transistors could be used effectively then the performance of individual processors would increase, as would that of systems containing several such processors. When Cray introduced the T3D in 1993, it was taking advantage of a trend which had already started in the 70s, had continued through into the 80s and would continue through to around 2005. This trend, simply put, was that single processors would get faster by increasing their clock speeds and by using the increased transistor counts as dictated by Moore's Law to increase performance. In the 70s and 80s each chip generation would make a single-threaded code execute faster via a combination of increasing clock speed and by using the additional available transistors to add a major feature to the chip such as a floating-point unit, out-of-order execution, or pipelining. In the 90s and early 2000s, each chip generation would make a single-threaded code run faster via a combination of increasing clock speed and by using the additional available transistors to add an increasing number of progressively smaller features. By 2005, Moore's law still applied, but this trend had hit a wall.
What had stopped this trend? There were two factors. The first was heating which rises with clock frequency. This has effectively limited clock speeds to less that 3 GHz. The second was that there were no more features to add to realistically speed up single-threaded code. Nevertheless, there remained a way to make use of the still increasing number of transistors as dictated by Moore's law and that was to put more than one processor core on each chip. This marked the start of the multicore revolution, which we shall describe in the next section.
The multicore revolution
As we have seen in the previous section, around 2005, it no longer became realistic to try to speed up single processors. Instead, the obvious step was to put increasing numbers of processor cores onto a single chip. This trend is now well established. Most laptops have dual-core processors, servers have processors with four or more cores, Intel have recently announced a processor with around 50 cores and so on. A key feature of these cores is that they are all identical and can be programmed with a homogeneous programming model, at least within the confines of a single chip.
How long can this new trend continue? The short answer is that it is already being superseded by a move to heterogeneous multicore processors. Before we consider such devices, it is useful to look at the pros and cons of a homogeneous, multicore processor.
The principal advantage is that the programming model for such a device is relatively simple because each core has the same instruction set and capabilities as every other core. Note the use of "relatively" because programming parallel systems comprising multiple cores can be far from simple in some cases. Clearly this type of processor is simpler to design because a single core can be replicated to fill the available area of silicon. Although the computing power of multicore devices can continue to increase rapidly by increasing the number of cores, this potential power is harder to exploit because applications need to be converted to run on parallel systems. Nevertheless, at least for small numbers of cores, support is available from proprietary parallel runtimes, profilers and debuggers.
A disadvantage of the homogeneous multicore approach is backwards compatibility. New multicore processors need to support legacy software and in some cases this means providing redundant capability and replicating it across all cores. In other words, this approach does not result in the most efficient use of the available transistors. A further difficulty, which applies to all multicore devices, homogeneous or heterogeneous, is that of memory models. The more cores in a device, the more complicated the on-chip memory system needs to be to support each core having a consistent view of and ready access to on-chip data. Of course, this is an architectural choice. It would be perfectly possible for each core to have its own local memory and for cores to exchange data by passing messages. However, at least for the time being, manufacturers have chosen not to implement this paradigm in their mainstream processors, but may be forced to review their memory models as core counts increase. Finally, there is the issue of memory-interface bandwidth, which also applies to both heterogeneous and homogeneous multicore devices. Quite simply, the more cores there are on a device, the faster the external memory interface needs to be to keep all those cores supplied with data. This remains an unsolved problem, but one which can be circumvented for the time being while the number of cores on a device remain sufficiently small. Nevertheless, this remains a fundamental barrier to the development of processors with very large numbers of cores, both homogeneous and heterogeneous. Ultimately new architectural choices, combined with the development of new algorithms will be needed to address this issue.
Heterogeneous multicore processors
Only just 7 years after the start of the homogeneous multicore era, we are now confronted with compelling reasons to adopt heterogeneous, multicore processors. This approach was pioneered in the IBM Cell processor used in the PlayStation 3. A further example of such devices is the AMD Fusion, which combines multiple Central Procesing Units (CPUs) with a Graphical Processing Unit (GPU). In particular, the A10 series combines 4 CPU cores with a GPU and is targeted at the HPC marketplace. Another example is the NVIDIA Tegra, which combines a dual-core ARM Cortex A9 processor with a GeForce GPU. Although targeted at the mobile computing marketplace, it is easy to see how this technology could be further developed and applied to HPC. Yet another example is the ARM big.LITTLE processing which combines high-performance cores with low-power cores enabling an application to choose dynamically the best processor configuration and so minimise power consumption. Finally Intel's Many Integrated Core (MIC) architecture combines heavy-duty and lightweight cores on the same die, to optimise processing capabilities.
What are the reasons for moving to heterogeneous multicore processors? It's all a question of what we are trying to optimise. Up to 2005, the objective was to get the most performance out of a single processor regardless of anything else. At the start of the homogeneous, multicore era, this objective changed to becoming one of getting the maximum performance per unit area of silicon. Now the objective has become one of getting the maximum performance per Joule, given that transistor count is no longer an issue due to the relentless advance of Moore's Law. The reasons for this last change in objective are clear. At the top end, supercomputers are consuming too much power (several megawatts) and viable systems now need to maximise compute per unit of energy. At the mobile computing end, the same constraint applies. The best way to satisfy this constraint is to perform calculations on cores, which use the least energy for those particular calculations. At the moment, the choice is limited to CPUs, GPUs and Field Programmable Gate Arrays (FPGAs), but other specialist processing cores will be deployed as heterogeneous devices mature. Indeed, chips, which integrate several diverse functional units, which can be turned off and on as needed, are already being considered.
Data-intensive computing
Finally, we need to add data-intensive computing to the mix. This will deploy high-performance systems optimised for the processing of data using integer and logical operations rather than for numerically intensive applications. Such systems will sit comfortably within the heterogeneous multicore landscape. They will have access to highly distributed data via the Cloud and the Internet and may even extensively use data from mobile computing devices. This will be datamining "on steroids". It will create the opportunity for the unprecedented discovery of knowledge from data enabling as yet unimagined applications, which may have profound effects on commerce and society.
The systems of the future
Based on current rates of progress, it is projected that exaflops (EFLOPs) systems will be available in 2019 and that zetaflops (ZFLOPs) systems will be available by 2030. Cray has already announced plans to build a 1 EFLOPs system before 2020. Somewhat astonishingly, in India, ISRO and the Indian Institute of Science have plans to build a 132.8 EFLOPs supercomputer by 2017.
It is interesting to speculate on the processor and core count of EFLOPs systems. Focusing on the 2020 horizon for an EFLOPs system, we might anticipate the performance of a single processor to be 50 TFLOPs, for it to comprise 1000 heterogeneous cores each capable of 0.5 TFLOPs, but with only 10% of the cores running at any one time, and for a system to comprise 20,000 such processors. Such numbers are, of course, speculation and are not supported by any firm technological announcements. Indeed each of these figures may be out by a factor of 10 or even more. However, it is clear that there is significant optimism that an EFLOPs system will be feasible by 2020 and that these systems will contain millions of cores.
What challenges lie ahead in the development and use of EFLOP systems? The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and languages and new algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential if such systems are going to be usable. These are difficult issues, which will need to be tackled fully if the relentless increase in computer performance is to be maintained. The economic and social implications of maintaining this rate of increase are highly significant and are discussed later in this section.
From the earlier discussions, we can see that the performance of a mobile device lags around 20 years behind that of the most powerful supercomputer equivalent to a factor in performance of around 1 million (give or take another factor of 10). That is to say that by 2020 we can expect mobile devices (laptops, tablets and smart phones) with a performance of around 1 TFLOPs. A standard laptop with a standard GPU already has a performance of 50 GFLOPs. Furthermore such a system can even now be fitted with a performant GPU card to raise its performance to 1 TFLOPs. The GPU in a typical tablet or smart phone currently performs at around 5 GFLOPs. All this clearly supports the conclusion that by 2020 we will have hand-held mobile devices capable of performing several 100 GFLOPs or more. The interesting question is to what use this performance will be put.
To begin to answer this question, we have to look at the whole spectrum of computing. Improved network connectivity, the development of Cloud computing, the development of data-intensive computing, combined with very powerful, ubiquitous hand-held mobile devices will enable a whole new generation of applications. Highly sophisticated computer games and the wider use of crowd sourcing are obvious applications. These applications will use not only the power of the hand-held device, but will be able to interact seamlessly with Cloud-based computers including state-of-the-art supercomputers and data-intensive computers. This seamless interaction will create many new commercial opportunities with significant societal and economic impacts. It will create a new market for services, many of them HPC-based, and it will enable, as yet unimagined, new applications.
Conclusions
The power of all types of computing is increasing rapidly, more than doubling every two years. By 2020, it is anticipated that supercomputers will have a performance of around 1 EFLOPs, desktop systems will have a performance of up to 100 TFLOPs and hand-held devices a performance of several 100 GFLOPs. These figures are subject to a potential error of an order of magnitude, but are certainly supported by recent history. What is also clear is that increases in computer processing power will now come through increased parallelism and this will require significant changes in the ways in which computers are programmed.
This availability of significant computing power, combined with improved network connectivity will enable a whole new generation of applications interacting seamlessly with Cloud-based computers including state-of-the-art supercomputers. These new applications will range from highly sophisticated computer games to completely new services and, as yet, unimagined applications.
Many challenges lie ahead. The development of effective memory models and interfaces is a clear priority. The development of effective programming methodologies and algorithms able to harness the new massively parallel, heterogeneous, multicore systems is essential. If these problems can be solved, then significant economic and societal opportunities lie ahead through the exploitation of the multicore revolution.
Dr Francis Wray is a consultant who has been involved in HPC for over 25 years. He is a Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston.
[1] 1 MFLOP = 106 FLOPS (floating point operations per second); 1 GFLOP = 109 FLOPS; 1 TFLOP = 1012 FLOPS; 1 PFLOP = 1015 FLOPS; 1EFLOP = 1018 FLOPS; 1 ZFLOP = 1021 FLOPS.
[2] http://www.isgtw.org/visualization/why-advanced-computing-matters
© 2012 The University of Edinburgh
To Top