Using the Microprocessor MMU for Software Protection in Real-Time Systems

a white paper by Greg Rose

First	Choose a processor that has an on-chip MMU.
Second	Choose an operating system that fully supports the MMU. These two choices just might give you a good night sleep when your product is deployed.

System designers constantly face engineering trade-offs between system speed, integrity, and maintainability. With the increasing performance of microprocessors, and the ease of systems configuration and enhancement in software, as opposed to hardware, more of the actual system is implemented in software.

Products are now designed by teams of programmers, working on multiple programs, with all the programs coexisting on one microprocessor. With more programs running on one processor, a practice of “locking the doors” becomes necessary. Without protection, a single program bug could leave a trail of destruction that would compromise the integrity of the entire system.

Understanding Bugs

Despite the best effort of software engineers, it is an accepted fact that all software has bugs. Many software bugs live happily in a system as long as a given set of conditions are not encountered that cause the bugs to show themselves. These bugs may also only manifest themselves when a given set of conditions has happened in a particular order that was not tested during the software testing cycle before product release.

A system that does not employ some form of address protection is like having your systems quality and stability equal to the competence of least competent programmer.

The most elusive of software bugs, however, are the bugs that it takes an entire system to create. With these bugs, there can be a chain of corruption from one task to another that grows in complexity over time until it crashes the entire system. For example, imagine a system that does not use any address protection between tasks. Task 1 inadvertently accesses an array out of bounds that corrupts a variable in the data space of Task 2. The next time Task 2 uses that variable, perhaps a week later, it corrupts a value in Task 3, and so on. A single programmer can not design for this type of bug because, in his own design, and code test, the conditions that cause the bug do not exist. Additionally, the conditions that cause this type of bug to manifest itself as a critical bug could happen over a period of weeks or months. This time period is then again required in order to recreate the problem while debugging the program.

Because fixing these bugs would get in the way of product release dates and product delivery commitments, the product is often shipped with the bugs happily corrupting memory, only to crash the system at the most inopportune times. The easiest way to protect against this type of bug is to implement some form of address protection between tasks in the system. A system that does not employ some form of address protection is like having your systems quality and stability equal to the competence of least competent programmer.

Using MMUs to Fend Off Bugs

Two trends are helping embedded designers to protect themselves against this problem. First, the prices of high end microprocessors with Memory Management Units (MMUs) are dropping, making them suitable for use in embedded designs. Second, microprocessor designers are rushing to the aid of system designers with the addition of MMUs designed into low-power controllers targeted at deeply embedded applications.

Now, embedded designers in the telecommunication and compact-equipment marketplace can have the same state-of-the-art protection historically available only in high end microprocessor designs. A case in point is the PowerPC MPC860 PowerQUICC and the MPC821 Portable Systems Microprocessor. They are the follow-on products to the popular Motorola 68360 Quad Integrated Communications Controller (QUICC) chip. Both have an integrated MMU in their silicon. The MPC821 product is targeted at the Personal Digital Assistant (PDA) and Personal Intelligent Communicator (PIC) embedded systems marketplace. The 821 has not only the serial communications controllers, but also PCMCIA capability, and LCD controller all integrated on a single chip.

The aforementioned MMU is hardware that is designed specifically to allow the processor to operate in a virtual address space. It was originally designed to allow a system to think it had more physical memory than was actually on the system. If a program was “running,” still resident in memory, yet it had not executed in a while, the processor could store it off to disk and allow another program to use the physical memory that was released. This process is called swapping. When an event would happen signaling the swapped program that it had work to do the MMU would see that the requested memory was not resident in physical memory and signal the processor, via an exception/interrupt, that it needed to restore the swapped program to memory, and move a different program out to disk. When the program was replaced in the system memory the processor would remap the physical pages that held the restored program to the original virtual addresses that the processor recognized. The MMU would then convert these processor virtual addresses to the new physical addresses and pass them on to the memory bus. The program would then be allowed to continue processing.

Later in the evolution of computer operating system design, system designers theorized that only a portion of a program would be needed at any given time. A programs initialization is run once at the beginning of the program, and swapping the entire program to the disk and back is not efficient. The system designers reduced the granularity of swapping entire programs, to saving and restoring individual pages of memory, typically 4096 bytes, out to disk and back. This process is called paging. This drove the MMU design to have a mapping granularity of a page of memory. The virtual to physical translation for a page of memory is stored in something called a page table entry (PTE). The PTEs also set the cache control bits, data access permissions, and page access permissions.

This history, though interesting from an academic standpoint and important in understanding the internal design of the MMU, is not an applicable implementation of the use of the MMU in embedded systems designs. This is due to the non-deterministic nature of swapping and paging and the necessity of a disk on the system.

However, program (or “process”) address protection can be achieved by setting up a virtual address space for the processor to execute within, and then signaling the processor when an address that is not currently mapped is accessed. A common misconception is that the performance overhead of using the MMU is too high and that embedded designers can not afford the luxury of address protection.

This article will examine system software designs that use the MMU for address protection. It will examine two common MMU designs, Intel® and PowerPC®, and explain software implementations that give the benefits of full task address protection without sacrificing the systems deterministic hard real-time characteristics. System performance overheads associated with the MMU features
will also be presented.

Supervisor and User Isolation Modes

The microprocessor has a set of instructions that perform functions such as data manipulation, arithmetic operations, conditional evaluation, and processor control functionality. Of all of the commands available to the processor programmer, a subset of the instructions are only available to programmers when the processor is running in supervisor mode.

Processor and MMU control functions can only be performed while the processor is executing in supervisor mode. The MMU can also be programmed to have ranges of addresses that are supervisor only read/write (R/W) or read only (RO) and areas that are both user and supervisor R/W. When the processor is in supervisor mode or user
mode the MMU will enforce the access permissions. The processor in normal operation will then perform a system trap to enter supervisor mode. It can then perform critical operations and return back to user mode. Tasks can also be written to run entirely in the processors user mode or supervisor mode.

General MMU Design

Figure 1: The processor sends a virtual address to the MMU.

A conceptual block diagram of the MMU is illustrated in Figure 1. The processor sends a virtual address to the MMU.

The MMU can be controlled to bypass all of its translation and control circuitry and pass the address directly onto the system memory bus. This is called real-addressing mode.
The MMU can be programmed to translate segments of logical addresses to equivalent sized segments of physical memory. The PowerPC processor has a fixed number of registers called block address translation (BAT) registers that are explicitly designed for translating large segments of memory.
The MMU can be programmed to translate logical addresses
to physical addresses on a page by page basis through the use of page tables. To speed the translation, a cache of the most recently accessed pages logical to physical mappings is kept in the MMU. This cache is called the Translation Lookaside Buffer (TLB). The PowerPC MPC750 has a 128-entry TLB for instruction accesses and a 128-entry TLB for data accesses. This allows 1MB of address space translations to be in the MMU cache at any time if the system is programmed with 4KB pages. Translations that are resident in the TLB will be posted to the system memory bus in the same clock cycle that they are received. If an accessed virtual address page translation is not currently in the TLB, the access into the page will cause the MMU or processor to find the associated PTE in cache or memory. It will load the TLB with the mapping, and pass the address to the system memory bus. Therefore, any subsequent accesses to that page will occur as TLB hits, and process in the same clock as they were posted to the MMU. It is easy to see that the only performance impact of using the MMU to translate addresses, is when the TLB miss occurs and the PTEs must be read and reloaded into the TLB.

Multiple approaches are used by different processor manufacturers to perform the table lookups to the find the correct PTE on a TLB miss. Some processor designs have a software assisted table-walk to find the correct PTE in the page tables. These designs have special registers, called shadow registers, or extra registers that are used to minimize the systemic impact of the translation lookup. Other processor designs can perform the entire PTE lookup in hardware. The goal of the processor manufacturer MMU hardware designer is to minimize the performance impact of using the MMU for address translation. Due to the hardware assistance for the table-walk the format of the page tables is defined by the processor designer.

MMU Page Table Designs

One classic page-table format is used on the Intel x86, Pentium®, and Pentium Pro family and also on the PowerPC 821 and 860 PowerQUICC processors. It is illustrated in Figure 2 using a standard 4KB page size.

Figure 2: Classic page-table format used on the Intel x86, Pentium, and Pentium Pro family, as well as on the PowerPC 821 and 860 PowerQUICC processors.

The MMU contains a pointer to a page of memory called the page table directory (PTD). The PTD contains 1024 32-bit addresses that are pointers to pages of memory called page tables. Each page table contains 1024 PTEs. As shown in the illustration, a 32-bit virtual address is passed from the processor to the MMU. The first 10 bits of the virtual address are used as an index into the PTD. The 32-bit value that is selected in the PTD is used as an address of a page table. The next 10 bits of the virtual address are used as an index into the page table to resolve an individual PTE. The first 20 bits of the PTE are used as the address of the actual page of memory. The last 12 bits of the virtual address are used as the actual address into the physical page of memory pointed to by the PTE. Since only the first 20 bits of the PTE are used to point to the physical page the remaining 12 bits are usually used as flags for cache control, page access protections, etc.

A feature of this design is that if a TLB miss occurs the correct physical page translation can be found in two memory accesses. The first access reads the entry in the PTD and the second access reads the correct entry in the page table. Commonly, both accesses will be available in the cache. A downside of this design is that if the processor switches context (changes its mapping PTD) the entire TLB must be invalidated and then reloaded as each page is accessed by the new task.

Page Table Design for the PowerPC 60x Family and MPC750 Processors

A different page table design is used on the PowerPC 60x family and MPC750 processors. It is illustrated in Figure 3 using the architecturally defined 4KB page size.

Figure 3: Page table design used on the PowerPC 60x family and MPC750 processors.

This design has one global page table for the entire system. The address of the global page table is programmed into the MMU and a register is set for the size of the page table. This MMU design uses a 52-bit interim address to discriminate between the different pages of system memory in the global address space. The system designer loads the MMU segment registers with a value that is associated to the current context. This 24-bit value is prepended to the virtual address that is passed by the processor into the MMU, resulting in the interim 52-bit virtual address. Due to the size of the address range, the virtual address is hashed by the MMU hardware and becomes a pointer to a page table entry group (PTEG). The 52-bit address is then compared to the individual entries in the PTEG until a match is found. When a match is found the translation value is loaded into the TLB, for future accesses into the page, and the physical address is passed to the system bus. Like the traditional design, the first 20 bits of the PTE are used as the address of the actual page of memory. The last 12 bits of the virtual address are used as the actual address into the physical page of memory pointed to by the PTE. Since only the first 20 bits of the PTE are used to point to the physical page the remaining 12 bits are usually used as flags for cache control, page access protections, etc.

The benefit of this design is that when the processor changes context the entire TLB does not have to invalidated. This is because that the segment registers are simply reprogrammed into another window into the same 52-bit address space. This integrity of the TLB across context switches is very beneficial in applications that have multiple tasks which switch very often. However, if the page table is set to be small, multiple memory reads can be necessary during the page table walk. The page tables are stored in system memory and are cacheable.

MMU Performance Overhead

The only performance impact to using the MMU is the first access into a page of memory that is not currently mapped by the TLB. Any subsequent accesses to that page will be translated in less than one clock cycle through the TLB. Real-time systems require the performance impact of the TLB miss, the subsequent table walk, and the associated TLB reload, to be characterized. This assures that the system can be analyzed, and the real-time deadlines can be met.

Figure 4. Testing real-time MMU performance
main() { volatile static int dataarray[262143]; /* leave a space on the last page for the variable k/ volatile static int k; / make sure that k is not optimized out / register int i, j; / keep these variables in registers to not take extra page hits / for(j=0; j<100000;j++) / loop to make timing easier / for(i=0; i<262143; i+=1025){ /take one hit each page and increment one wordper page for cache stability / k=i; /use k for ease with the normalize test / dataarray[k]=j; / write an arbitrary value to the page */ } }

Figure 4. Testing real-time MMU performance

main() {

volatile static int dataarray[262143]; /* leave a space on the last page for the variable k*/

volatile static int k; /* make sure that k is not optimized out */

register int i, j; /* keep these variables in registers to not take extra page hits */

for(j=0; j<100000;j++) /* loop to make timing easier */

for(i=0; i<262143; i+=1025){ /*take one hit each page and increment one wordper page for cache stability */

k=i; /*use k for ease with the normalize test */

dataarray[k]=j; /* write an arbitrary value to the page */

}

}

Tests for this article were run on a 233MhZ PowerPC MPC750 running LynxOS, the real-time operating system from LynuxWorks™, Inc. The MMU and cache were analyzed to limit the characterization to only the effect of the page tables and not the effect of system caching. The goal was to overrun the TLB, yet have all the accesses hit in the cache, if possible, and not add latencies into the test by programmatically invalidating the TLB, because this could skew the results. For simplicity, only the data TLB miss was tested. It is assumed that the instruction TLB miss time is the same, since the hardware for translation is largely the same. The test overran the TLB by accessing twice as many pages as there were TLB entries. Using twice as many pages ensures that every write will miss in the TLB. The test was looped 100,000 times for a easily measured sample. The assembly code was then checked to verify the executable did not fall prey to the optimizer and invalidate the test program. The test was timed using the system time function. A normalized time was calculated by subtracting off the time to loop the same test with hitting only the same page of memory. The program source is included in Figure 4.

The MPC750 has a 128-entry two way set associative (64 x 2) data TLB. All table lookup and hashing processing is performed by the hardware. The L1 data cache is a 32KB eight-way set associative cache with eight words per cache line. Therefore, if each access into a page is offset by one word after the first loop through the test (256 data writes) the cache will be filled and all subsequent tests will hit in the L1 cache. The timings include the latency involved in the TLB miss, page-table walk, and TLB reload. A 233MhZ PowerPC MPC750 performed the entire process in 66 nanoseconds.

Simple Kernel System Designs

The easiest system design is to disable the MMU altogether and have the processor run off physical addresses and execute the system entirely in supervisor mode. This is called real addressing mode.

This method lumps all processes into one group and allows any process to corrupt any memory location in the system. It allows the system designer the least configuration control because processor functions such as cache control methods and memory protection are predefined by the processor designers. This system design, from the aspect of protection, is best described as “running with scissors in a crowded room.” This design is the most common design used in commercial real-time operating systems (RTOSes) that support multiple processors with and without MMUs. This design allows the software vendors’ source trees to be processor-independent, supporting the least common denominator. Do not assume that because an RTOS runs on a processor that has a MMU, that the MMU is used by that operating system for address protection. Additionally, some RTOSes claim MMU support, via library routines, and then require the user to configure and manage the MMU.

The next easiest system that designers can implement is a simple form of address protection where tasks are isolated by user and supervisor level tasks. In this case, the MMU is programmed with a single address space and enabled in a logical equals physical addressing. However, unlike the previous example, the MMU is used to program blocks or segments of addresses. An individual segment will have the same characteristics (caching, page protections, etc.) applied to the entire segment or block, yet the characteristics are configurable by the system designer. As an example, the MMU could be programmed to allow a segment to be write through cached, supervisor R/W, and user no access. Other segments could then be programmed with other access permissions, etc. This is the fastest mode of the MMU, yet a poorly protected model due to the coarse granularity of protection.

This mode can best be thought of as somewhat of a feudal system of task address protection, because the system designer has implemented a castle wall around certain processor instructions and around certain areas of memory. All of the other tasks are executed in user mode and living and working out on the hillsides without locks on the doors of their homes. A rogue user task can jump to addresses outside its own task address space (e.g. dereferencing a invalid pointer, or accessing an array out of bounds) and corrupt the memory of other tasks. This is allowed because all user-level tasks have the exact same access permissions. Additionally, all supervisor tasks, although inside the protected castle walls can also corrupt each other or the serf user tasks. This approach is usually used in simple applications with a small number of tasks. However, as the complexity of the system increases, this supervisor/user segregation solution becomes inadequate.

Program Address Protection Utopia

The best task address protection comes from a system design that sets up a separate address space for each program running on the system.

In this mode, the physical memory of the system is treated as a free memory pool. When a program is started, free pages of memory are mapped through the PTEs to a processor virtual address space for the user program code, and data segments. For system service speed and interrupt service response time, the kernel must always be mapped into the processors virtual address space. Therefore, if an interrupt is asserted the processor will not have to switch the MMU page tables to service the interrupt. The kernel can easily be protected from the user program by the standard user/supervisor access restrictions, thus isolating the kernel from user corruption. With a separate page table set for each program, only the pages of memory that are used and allocated to the current program need to be mapped by the MMU. It is impossible for one errant program to inadvertently corrupt another program because physical pages of the other process are not mapped into the current processor virtual address space. This design is used in LynxOS, the underlying OS in the TLB performance tests of this article. It allows full address protection with fully deterministic response times to real-time interrupts and tasks. An illustration of two processes and the virtual to physical address mappings is shown in Figure 5.

Figure 5: Protected address model.

This protected address model is called the Portable Operating System Interface (POSIX®) process model. A process consists of its thread of execution and an address space. The POSIX standard was written to allow portability of code among multiple operating systems and hardware platforms. The POSIX interface for setting up this rocess virtual address space is the fork() system call.

Using the fork() system call does not require any understanding of the hardware MMU or mappings by the applications programmer. Programs are loaded and run by the exec() family of system calls.

When a user program bug causes the program to try to access an address that is not currently mapped the MMU will block the access and generate an exception.

This exception is called a segmentation violation. The default operation, as defined by the POSIX.1 specification, is to terminate the task and save the footprint of the offending task for later analysis. POSIX also defines the interface by which a task can set up its own response to a segmentation violation. Using this interface, the offending memory access is blocked, and the task can define its own recovery procedure.

LEARN MORE ABOUT POSIX CONFORMANCE

Conclusion

As processor technology advancement increases the speed of the microprocessor, more programs can be executed on a single processor. These additional programs increase the complexity of the system and the number of bugs that slip into the system. A single bug that can corrupt unrelated tasks unchecked, can easily leave a trail of missed deadlines and “red ink” from man-months in product debug and system test. The bugs resulting elusive system crashes may not reappear until it is too late and the product is deployed. A further addition to the burden is that, the cost of implementing a “fix” in an embedded system increases exponentially after the product is deployed.

The microprocessor MMU can be programmed to protect user tasks and the kernel from accidental corruption with minimal impact to overall system performance. Using multiple protected address spaces insulates the user tasks and terminates only the offending task on an invalid memory reference. This allows the other programs to continue processing and protects the overall system from a catastrophic system crash. Then, the running system can perform a graceful recovery from the terminated task, and optionally log the footprint for later debugging. To use this state-of-the-art capability in real-time embedded systems, two choices must be made early in the project specification. First, choose a processor that has an on-chip MMU. Second, choose an operating system that fully supports the MMU. These two choices just might give you a good night sleep when your product is deployed.