Asymmetric Multiprocessing with Heterogeneous Architectures: Use the Best Core for the Job

by Kurt Shuler, On Sep 24, 2013

An example of AMP is a mobile phone’s modem baseband SoC, containing an ARM processor and a DSP to handle control and signal processing, respectively. AMP architectures are also found in mobile phone application processors, which have multiple CPU cores and separate discrete graphics cores, video cores, audio cores and imaging cores. Heterogeneous architectures also dominate in most embedded consumer applications, such as digital TVs, set-top boxes, and automotive infotainment.

Figure 1. The Qualcomm Snapdragon 800 is an example of system-on-chip that implements an asymmetric processing (AMP) architecture with multiple processing units optimized for different functions. Source: Qualcomm.

Heat and power drive architecture decisions

Mobile applications face significant design constraints because of battery size and heat dissipation. As a result, processor designers are forced to use “the best core for the job.” So architectures in mobility have always been created from a baseline expectation of heterogeneous core AMP.

Server and PC chips have relatively unlimited power consumption and heat dissipation capabilities, making an SMP architecture tolerable. In these applications, it is often easier to add more cores of the same type, connect them using cache coherency, and reuse the legacy software to run on top. Comparatively little attention has been paid to heat dissipation and power consumption.

But PCs are becoming smaller and mobile. And server farms are eyeing power consumption as well, forcing designers to reconsider SMP architectures. For example, for server farms that power the likes of Google and Facebook, power consumption and heat dissipation have become huge cost and environmental issues. And in the PC space, we have run into a “gigahertz wall” where the only way to have a step function increase in performance is to have different cores optimized for different workload types.

AMP architectures struggle to break into PC/server applications

Why don’t AMP architectures dominate PC and server applications?

Because they are hard to implement!

In mobile designs, each heterogeneous processing core, whether graphics, audio, DSP, etc., typically has a custom firmware and software stack associated with it. This software must be integrated to communicate with the CPU cores’ operating system, requiring coding work in the OS hardware abstraction layer and drivers. In addition, these heterogeneous cores do not have a single view of system memory, so complicated synchronization schemes are usually implemented in hardware and software. Context switching and preemption are difficult to implement. Adding to the challenge, each of these cores requires an expert programmer, conversant in a particular core’s instruction set and tool chains, to code it.

These barriers have forced AMP to remain in the mobile and consumer electronics realm, which is closed to low-level, close-to-the-hardware software developers. Alternatively, SMP has flourished in the wide-open world of PCs and servers, aided by the ease of programming.

Heterogeneous system architectures (HSA) can span the chasm between mobile/ consumer applications and PC/ server applications, easing the design burden while delivering performance, scalability, improved heat dissipation and reduced power consumption.

Recently, a number of companies, including AMD, ARM, Imagination, MediaTek, Qualcomm, Samsung and Texas Instruments, founded the HSA Foundation. HSA defines interfaces for parallel computation utilizing CPU, GPU, and other programmable and fixed-function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general-purpose computing.

Its goals are to:

  • Make heterogeneous programming easy and a first-class pervasive complement to CPU computing.
  • Continue to increase the power efficiency of heterogeneous systems (AMP), keeping it the platform of choice from smartphones to the cloud.
  • Bring to market strong development solutions (tools, libraries, OS run-times) to drive innovative advanced content and applications.
  • Foster growth of heterogeneous computing talent through HSA developer training and academic programs to drive both learning and innovation.

The HSA approach requires a technical framework and architecture

There are several issues that must be addressed to successfully bring these two worlds together:

  • Unified programming model – Today, CPU and GPU (or other accelerator) cores are programmed separately, with the accelerator treated as a remote processor. To make the maximum use of hardware resources while balancing ease of programming, heterogeneous architectures should allow developers to target the CPU or GPU by writing in task-parallel languages, like the ones they use today when writing for multicore CPUs.
  • Unified address space – HSA supports virtual address translation amongst the heterogeneous cores with an HSA-specific memory management unit (HMMU). HSA compute engines will use the same pageable virtual address space as used by CPUs today.
  • Queuing – CPUs, GPUs and other cores can queue tasks to each other and to themselves through an HSA run-time. Queuing can be managed in hardware to avoid OS system calls and enable very low latency communication between cores.
  • Preemption and context switching – HSA enables job preemption, job scheduling and fault handling capabilities to overcome potential problems created by rogue or faulted processes.

HSA Foundation provides key tools for unlocking heterogeneous programming

Today, CPUs and GPUs do not share a common view of system memory, requiring an application to explicitly copy data between the two devices. In addition, an application running on the CPU that wants to add work to the GPU’s queue must execute system calls that communicate through the CPU operating system’s device driver stack, and then communicate with a separate scheduler that manages the GPU’s work. This adds significant run-time latency, in addition to being very difficult to program.

HSA addresses the need for easy software programming of GPUs to take advantage of their unique capability to crunch parallel workloads much more efficiently than x86 or ARM CPUs.

HSA solution stack: Abstracting away hardware specifics

To enable easier programming, HSA allows developers to program at a higher abstraction level using mainstream programming languages and additional libraries. This HSA solution stack includes several components.

The key to enabling one language for heterogeneous core programming is to have an intermediate run-time layer that abstracts hardware specifics away from the software developer, leaving the hardware-specific coding to be done once by the hardware vendor or IP provider. The core of this intermediate layer is the HSA Intermediate Language or “HSAIL.”

Figure 2. The HSA Intermediate Language (HSAIL) is an intermediate run-time layer that abstracts hardware specifics away from the software developer. Source: AMD.

The HSA run-time stack is created by compiling a high-level language such as C++ with the HSA compilation stack. HSA’s compilation stack is based on the LLVM infrastructure, which is also used in OpenCL from the Khronos Group.

Creation of HSAIL can occur prior to run-time or during run-time. Here are two examples:

  1. The OpenCL run-time includes the compiler stack and is called at run-time to execute a program that is already in data-parallel form.
  2. Alternatively, Microsoft’s C++ AMP (C++ Accelerated Massive Parallelism) uses the compiler stack during program compilation rather than execution. The C++ AMP compiler extracts data-parallel code sections and runs them through the HSA compiler stack, and passes non-parallel code through the normal compilation path.

Figure 3 shows the HSA compilation stack, where programming code is compiled into HSAIL using the LLVM compilation infrastructure:

Figure 3. The HSA compilation stack creates the HSA Intermediate Language (HSAIL) prior to or during run-time. Source: AMD.

The hardware-specific HSA Finalizer is a key component

A key role is played by the hardware-specific “finalizer” which converts HSAIL to the computing unit’s native instruction set. Hardware and IP vendors are responsible for creating finalizers that support their hardware. The finalizer is lightweight and can be run at compile time, installation time or run-time depending on requirements.

Figure 4 shows the HSAIL and its path through the HSA run-time stack:

Figure 4. The hardware-specific components of the HSA run-time stack are the HSA Finalizer and the hardware driver. Source: AMD.

The HSA Finalizer is the point at which the specifics of different heterogeneous computing units are addressed. Initial HSA implementations will most likely support GPU compute with finalizers from GPU vendors such as AMD, Imagination, ARM, and Qualcomm. The quality and features of each vendor’s HSA Finalizer will help determine how software developers take advantage of each hardware element’s computing capabilities.

Benefiting from heterogeneous architectures requires smart scheduling

In addition to GPUs, many existing heterogeneous architectures have additional discrete processing units for functions such as audio (digital signal processing or stream processing), image and video processing (SIMD frame processing), and security. As HSA matures, hardware and IP vendors creating these processing units may want to enable HSA programmability on their hardware by creating hardware-specific finalizers.

Having multiple heterogeneous processing units will complicate workload scheduling from a system perspective. The harsh reality is that existing workload scheduling and OS scheduling algorithms are relatively simple and generally only take into account local activity on a processing unit or a cluster of homogeneous processing units (see the Linux Completely Fair Scheduler for one example of how scheduling is implemented).

Interconnect fabric-assisted scheduling is required to implement scalable HSA systems

Existing OS and middleware scheduling algorithms do not take into account the existing traffic throughout the system, nor a view into other processing units. This lack of a global perspective for scheduling virtually guarantees there will be contention and stalling as processing units wait for access to precious system resources, especially the DRAM. It’s like looking out the front door of your house to determine how bad the traffic will be on your commute to work: You are missing very relevant information that could help you determine the optimal route to take.

Probing current run-time data flows at critical points throughout a system’s SoC interconnect fabric can provide critical information to enhance workload scheduling. This information can then be used to assign priorities to workloads, and workloads to processing units. These priorities and assignments can be optimized based on performance requirements or power consumption requirements, as required for a particular use case. As heterogeneous processing becomes the norm, and more processing units are added to a system, this type of interconnect-assisted scheduling will be required.

In other words, the hardware interconnect is a key enabler to putting the heterogeneous into HSA.



By Kurt Shuler

Kurt Shuler is Vice President of Marketing, Arteris, Inc.