The Future of Computing (Heterogeneous Architecture – CPUs, GPUs, FPGAs, ASICs, …)

Hi, thanks for turning a Singularity Prosperity. This video is the seventh in a multi-part series discussing computing and the final discussing classical computing. In this video, we’ll be discussing what heterogeneous system architecture is and how it’s going to shape the future of classical computing! [Music] To summarize what we’ve discussed in previous videos, CPUs are general-purpose devices designed to execute and manage complex instructions while GPUs are massively parallel computing devices designed to execute streams of calculations as fast as possible due to their parallelism. This translates to current architecture with the CPU managing most of computer operation such as the operating system, input/output devices and various other tasks, while the GPU does the hard-hitting in terms of computation. For the longest time we were making the CPU execute and manage all the tasks leaving the GPU only for graphics and simulation purposes. This was and still is extremely wasteful in terms of computation, the CPU already has to deal with O/S overhead along with various other issues that have a penalty on performance. With plateaued CPU clock rates, the miniaturization of the transistor coming to an end, more cores in the CPU not having significant boost in performance as well as thanks to the increasing adoption of GPUs in general computing and increasing popularity of parallel platforms like CUDA, this is beginning to change. This yields a new type of computing architecture called HSA, heterogeneous system architecture. HSA is where multiple compute devices work in unison instead of being segmented in operation. Another huge factor in HSA paradigms taking off is new and improving memory and data standards. For more information on the CPU, GPU and new innovations in memory and data, to gain deeper insight into what I discussed here, be sure to check out the previous videos in this computing series. Now what we haven’t discussed in those previous videos is FPGAs and ASICs, two other types of computing devices that will play a crucial role in HSA. The field programmable gate array, FPGA, is a special type of computing device. Unlike other computing devices, its hardware can be reprogrammed for specific tasks. To be more clear, other computing devices have fixed hardware and software optimized to run on it. FPGAs have reprogrammable hardware so hardware can be optimized for software. Due to this hardware reprogramability, FPGAs are more expensive and also quite difficult for the average developer or computer enthusiast to work with. However, it allows for massive parallelism and usage a much less power, which fits perfectly for a variety of needs such as data processing or streaming for example. Referring back to heterogeneous architecture, FPGAs can be used as accelerators to process data and then send it to the CPU, and when paired in a system with CPUs and GPUs massive improvements can be seen: Now we already have industry-leading capabilities, with our Azure GPU offering, which is fantastic for building trained AI models offline. Okay, but to support live AI services with very low response times at large scale, with great efficiency – better than CPUs, we’ve made a major investment in FPGAs. Now FPGAs are programmable hardware, what that means is that you get the efficiency of hardware but you also get flexibility because you can change their functionality on-the-fly. This new architecture that we’ve built effectively embeds an FPGA based AI supercomputer into our global hyper scale cloud. We get awesome: speed, scale and efficiency, it will change what’s possible for AI. Now over the past two years, quietly we’ve deployed it across our global hyper scale data centers, in 15 countries spanning five continents. Okay, so let’s start with a visual demo of what happens when you add this FPGA technology to one of our cloud servers. We’re using a special type of neural network, called a convolutional neural net, to recognize the contents of a collection of images. Okay, on the left of the screen what you see is how fast we can classify a set of images using a powerful cloud-based server running on CPUs. On the right, you see what happens when we add a single 30 watt Microsoft designed FPGA board to the server. This single board turbo charges the server, allowing it to recognize the images significantly faster, it gives the server a huge boost for AI tasks. Okay, now let’s try something a little harder, using a more sophisticated neural network to translate languages. The deep neural network based approach we’re using here is computationally much harder and requires much more compute, but it’s achieving record-setting accuracy in language translation. Okay, so to test the system, let’s see how quickly we can translate a book from one language to another. Now I picked a nice small book for this demo, ‘War and Peace’, it’s about 1440 pages, and we’ll go over to the the monitor here and using 24 high-end CPU cores we will start translating the book from Russian to English. Okay, now we’ll throw four boards from our FPGA based supercomputer at the same problem, which uses 1/5 less total power, as you can see…thank you *applause*. As you saw, just a single FPGA accelerator incorporated in an HSA system can yield significant boosts in performance for artificial intelligence tasks and similar performance boosts extent to other compute tasks as well. Beyond FPGAs there is also ASICs, application specific integrated circuits, the most optimized type of computing device. ASICs are fixed in hardware, however as the name states, it is application specific, meaning both the hardware and software are designed from the ground up to be tightly coupled and optimized to a specific subset of tasks, but do them extremely well. For example, ASICs have seen a lot of use in cryptocurrency mining. We’ll discuss various types of ASICs in future videos on this channel, such as on the Internet of Things and other computing applications, for example: tensor processing units, TPUs, for use in AI and Nvidias Drive PX card for use in self-driving cars. As a side note, ASICs are the reason Apple phones and laptops are so fast and fluid, all hardware is specifically designed for their devices as well as software that can fully utilize all the hardware resources. The problem with ASICs is that they are significantly more expensive in terms of research and development and implementation. this is why most companies and people opt for generic chipsets. However, with the increasing complexity of problems computers must solve and the coming end of the miniaturization of the transistor, ASICs will see exponentially increasing use in the coming years. [Music] So, based on what we’ve discussed about heterogeneous architectures, the CPU will manage computational resources, FPGAs will accelerate data processing and GPUs or ASICs will crank out the calculations necessary. In terms of the computational performance this yields, it all comes back to what we’ve talked about over and over again in previous videos in this series, increased parallelism. Now when discussing parallelism in heterogeneous architectures there is another law we must look at, Gustafson’s Law. This law essentially states, that with increasing data size the performance boost obtained through parallelization, in other words, the addition of more cores and other hardware and software parallelism, increases because parallel work increases with data size. Simply put, until we run into power issues, we can keep adding more cores and as long as the problems we give them are sufficiently complex the cores will be useful. In terms of heterogeneous architecture, this means we can keep adding more compute devices. Now luckily, it seems the world has found such a problem, deep learning for use in artificial intelligence, a field of computer science that has now gone mainstream and is increasing in popularity more and more everyday. Deep learning algorithms utilize large amounts of big data and are intrinsically parallel, we’ll explore this much deeper in this channel AI series. Deep learning also acts as a positive feedback loop and can propel the field of computing much further. We can see this in technologies such as AI smart caching and going beyond that, deep learning can help us identify ways to maximize heterogeneous architectures, develop new ASICs and much more to push competing performance forward. So, with heterogeneous architectures as well as increasing parallel software paradigms that utilize big data such as, deep learning, we’ll see massive increases in performance over the years, well exceeding the expectations of Moore’s Law. If we look at Moore’s Law in terms of heterogeneous architectures, we still have a long ways to scale, possibly another 75 plus years of performance increases. To add to this, who knows what other types of ASICs and architecture and software changes are to come during this inflection period in the field of computing. Now before concluding this video, to highlight the shift in the computing industry to heterogeneous architectures, watch this clip of principle researcher at Microsoft, Kathryn McKinley: So now what we’re seeing is specialization in hardware, which is FPGAs, specialized processors or combining big and little processors together; a big powerful fast processor with a very energy-efficient processor. So software has had an abstraction that all hardware’s about the same and one thread of execution, so in order to make software port to different versions of crazy hardware or different generations of hardware, as we go through this disruptive period, the software systems are not prepared for this. So my research is targeting both how you do this as a software system but also programming abstractions that let you trade-off quality for energy efficiency, let you reason about the fact that sensor data is not correct and how do you deal with these inaccuracies in a programming model that you don’t need a PhD in statistics or computer science in order to use. I hope you guys have enjoyed and learned a lot over the past few videos on classical computing. As mentioned earlier in this series, classical computing gives the illusion of parallelism through hardware and software and this illusion just keeps getting better and better due to increasing performance. In the next videos in this computing series, we’ll cover truly parallel non-classical computers such as quantum and bio computers, as well as new emerging paradigms in computing such as: optical and neuromorphic computing. At this point the video has come to a conclusion, I’d like to thank you for taking thee time to watch it. If you enjoyed it, consider supporting me on Patreon to keep this channel growing and if you want me to elaborate on any of the topics discussed or have any topic suggestions, please leave them in the comments below. Consider subscribing for more content, follow my Medium publication for accompanying blogs and like my Facebook page for more bite-sized chunks of content. This has been Ankur, you’ve been watching Singularity Prosperity and I’ll see you again soon! [Music]

Leave a Reply

Your email address will not be published. Required fields are marked *