The Good News and the Bad News: Your New Chip Has Multiple Cores

by ARUN SUBBARAO, LYNUXWORKS

The embedded software engineer faces some major issues when trying to move to a multicore device, and fortunately some new RTOS technologies are available to help.

For the embedded software developer, the traditional world has used a single processor core for the main control software and a real-time operating system (RTOS) to separate the control functions. As more software has been added, and more performance is required, a faster processor has been used, with the increase in processor speed typically being able to match the increase in software. In today’s world, more performance often equates to more cores, and so the embedded software developer now has to figure out how to take his uniprocessor system and spread it across multiple cores to get his performance gains.

For multicore devices that offer the same CPU architectures for each core, a symmetric multi-processing (SMP) RTOS can be used to spread applications and processes across the cores. However, there are still issues that need to be considered, such as system initialization, concurrency in both the cache and memory, scheduling of threads, and communication between threads running on the different cores. In addition to looking at the issues faced in an SMP system, it is worth looking at some of the benefits in system design when using a multicore system. Some of these benefits can be realized by using software virtualization and embedded hypervisors. This, then, introduces the new concept of running not just multiple applications, but multiple (possibly different) operating systems on the same hardware, allowing for potential consolidation of traditionally separate systems onto a single multicore platform.

SMP in the RTOS

In multicore computing, the functions performed by the operating system become layered and more complex. The operating system design must be capable of handling the complex concurrency issues that arise with multicore architectures. Some of the generic areas of OS design that are affected by the presence of multiple cores are concurrency control, interrupt handling and scheduling.

Concurrency in a multicore environment is a key enabler of better performance. The RTOS provides primitives that allow multiple threads of execution to protect shared data while executing on multicore processors. Depending on the extent of concurrency control, the RTOS can provide better or worse utilization of multiple cores.

The scheduling algorithms play a key part in harnessing the power of multiple cores. Typical scheduling algorithms maintain a per-CPU queue of threads that are ready to run and allocate CPU time based on this queue. However, in a real-time system it is critical to preserve real-time determinism so the scheduling approach is different. The scheduling happens on a global basis where the highest priority thread runs on the first available CPU. However, this may lead to higher levels of cache misses. This can be addressed by using design optimizations in real-time thread scheduling.

One such design optimization, known as processor affinity, may allow applications to request an “affinity” to a processor core. In this case, the operating system schedules the applications on the preferred processor core, as long as it does not affect overall system scheduling. A more rigid form of processor affinity is processor binding, where the task is always scheduled on the same processor core. However, this approach in RTOSs may lead to priority inversions. Operating system design should accommodate considerations such as processor affinity without degrading real-time determinism and responsiveness. In the context of a real-time operating system, other key factors such as priority scheduling and interrupt latency should be preserved in multicore architectures as well.

An SMP-enabled real-time operating system must schedule tasks dynamically and transparently between processors to efficiently balance workloads using available processors. It optimizes the support of load-balancing on multiple cores along with preserving the key elements of real-time latency and determinism.

Optimizing Applications for Multicore

Applications written for uniprocessor execution are not necessarily optimized for multicore architectures. Any inherent contention for resources that prevents execution parallelism may result in performance bottlenecks in multicore environments.

Application parallelism is dependent on the ratio of computation to communication overhead. The computation is the amount of time the CPU spends executing application code. The communication overhead is the amount of time that the OS spends in communicating between cores. In a typical multicore architecture, the communication overhead indicates how often messages are sent between different cores. The more threads an application has, the higher the chances that they are scheduled on different cores, which in turn increases the communication overhead.

There are two types of application parallelism that can be identified. Coarse-grained parallelism is characterized by large tasks, single threading and low communication overhead. In this case, the ratio of computation to communication overhead is high. This indicates that the communication overhead is lower than computation time, thereby yielding better multicore performance. Fine-grained parallelism is characterized by small tasks, multithreading and high communication overhead. In this case, the ratio of computation to communication overhead is low, which indicates that the communication overhead is higher than computation time, thereby yielding lower multicore performance.

In addition, applications that are CPU-bound can exploit the full power of multicore architectures since they are coarse-grained, while memory-bound or I/O-bound fine grained applications may need to be optimized to avoid the bottlenecks that arise due to the communication overhead in symmetric multiprocessing architectures.

When looking at applications on multicore solutions, embedded developers must determine the allocation of functions within applications and redesign their applications to exploit parallelism. Developers must consider the design trade-offs of using multithreading versus non-multithreading to harness the power of multiple processor cores. In some instances, applications may perform better on a single core system.

Hardware Virtualization and Hypervisors

Hardware virtualization involves the emulation of the underlying hardware capabilities to allow operating systems themselves to run in a hardware environment different from their original environment. The software programs that emulate the underlying hardware capabilities are called hypervisors. A hypervisor abstracts the capabilities of hardware and allows multiple heterogeneous operating system instances to run on a single hardware platform.

There are two main types of hypervisors. The Type 1, or native hypervisor runs directly on the hardware and has complete control of the platform. This type of hypervisor is more prevalent in embedded systems. The Type 2, or hosted hypervisor runs on a host operating system and is dependent on the host operating system for control of the hardware.

Virtualizing multiple instances of an operating system can be done using either full virtualization or partial virtualization. In either case, the virtual machine virtualizes the hardware to provide the illusion of real hardware for the operating systems executing on this virtual machine. However, both full and partial virtualizations have some key differences in their overall architecture, leading to a different set of trade-offs.

Full virtualization of the underlying hardware requires virtualizing all the capabilities of the processor and board. This involves complex manipulations of memory management and privilege levels that are computation-intensive on commodity processors. This leads to performance overheads that are much higher than the non-virtualized versions of the OS. The biggest benefit of full virtualization is that it allows operating systems to run unmodified, although at the cost of a significant performance overhead.

Partial or para-virtualization is usually a technique where the underlying hardware is not completely simulated in software. This architecture allows commodity operating systems to be easily virtualized on commodity processors, with the requirement that the virtualized operating system have code modifications to adhere to the partially virtualized architecture. The advantage is that the performance of partially virtualized architectures is much better than the fully virtualized machines, usually within a few percent of the non-virtualized versions.

The emergence of the Type 1embedded hypervisor brings an exciting new dimension to the world of multicore processors, and the promise of superior utilization of processor resources. The ability of the embedded hypervisor to completely control the platform resources allows symmetric and asymmetric multiprocessing to be supported by the hypervisor. In the SMP-enabled hypervisor, a single copy of the hypervisor can allow a single guest operating system to utilize multiple cores. The same hypervisor can enable AMP by allocating a single guest operating system to a unique core. This can be extended to allow AMP and SMP on the same platform through judicious allocation of guest operating systems on single or multiple cores, thereby increasing processor utilization significantly (Figure 1).

The migration of software from single core to a multicore system is not straightforward, as there are no automated processes to help configure and partition the applications across cores. However, using SMP operating systems can help not only smooth this transition, but actually offer some real benefits when combined with embedded hypervisor technology, giving opportunities to run multiple guest operating systems and applications across the multiple cores.