Sunday, December 11, 2011

A Survey of Microcontroller CPU Core Architectures: 32-bit core

Continued previous post.

ARM7

ARM7 is a popular 32-bit architecture for microcontrollers; it has implemented in NXP (formerly Philips) LPC series, Analog Devices ADuC7000, Atmel AT91.  The architectural efficiency is 0.9 Dhrystone MIPS/MHz, which is about the integer performance of Intel 486.  Most commonly used architecture is ARM7TDMI-S, which includes the Thumb instruction set.  60MHz is the common top clock rate.  (See also the open source tools for the ARM processors.)

The ARM7 uses a relatively short 3-stage pipeline.  Registers R0-R13 are truly general-purpose 32-bit registers; R14 is used as the link register for subroutine calls and R15 holds the program counter. The ARM instructions are 32-bit long.  As typical of the RISC machine, it uses a load-store architecture.  The addressing modes include PC-relative, indexed with offset and the optional of auto-increment.  One unique feature is that every instruction is conditionally executed according to a 4-bit condition field in the instruction.  There are 15 different conditions, depended on the Z (zero), C (carry), N (negative), V (overflow) flags in the program status register CPSR.  This conditional execution feature reduces the need for branching.  The long instruction word allows the ALU operations to have independent second operand. The multiply and multiply-accumulate instructions use arrays of 8-bit Booth's algorithm; so the number of instruction cycles depend on how many of arrays is activated.  It does not have a division instruction.

 The Thumb mode is introduced to improve the code density.  The instruction length is reduced to 16 bits and the accessible number of registers to 8.  

ARM7 is superseded by Cortex-M.

ARM Cortex-M

ARM Cortex-M has several flavors.

ARM Cortex-M0

Cortex-M0 is ARM's low-power architecture with low gate count for deeply embedded systems. It implements ARMv6-M Thumb instruction set.
Of the 16 core registers, 12 of them R0-R12 are general-purpose. R13 is the stack pointer, R14 is the link register for subroutine calls, and R15 is the program counter. There is a program status register (PSR).

ARM Cortex-M3

Cortex-M3 is the replacement for  ARM7TDMI.  One of the most visible changes to the programmers is the addition of the hardware division instruction.   The Cortex-M3 implements the Thumb-2 instructions, which are a superset of Thumb instructions.  The Thumb-2 code size is only slightly larger than the Thumb code, but the performance is close to the ARM instructions.

The three-stage pipeline is augmented with branch prediction.  The relatively short pipeline is not pushing the clock speed, which is around 100MHz, sufficient for the applications of this microcontroller.

ARM Cortex-M4

Cortex-M4 includes single precision float number instruments.

ARM Cortex-R

Cortex-R is the real-time flavor of the Cortex core.  Its performance is between Cortex-A and Cortex-M.  

Hitachi/Renesas SH

The SuperH is a RISC architecture that is designed for mobile and embedded applications.  High code density drives the choice of 16-bit fixed instruction length.  It has 16 32-bit general registers; R0 is also used as an index register and R15 is used for stack pointer.  The 3 32-bit control registers are Status Register which holds instruction status bits (such as the T bit for conditional instructions) and interrupt mask bits, Global Base Register (GBR) for indirect addressing modes, VBR for exception processing.  Of the 4 32-bit system registers, two are used to store MAC results,  one is Procedure Register that stores the return address of a subroutine call and the other is the Program Counter.  System control instructions can operate on the system registers.  

Like other RISC architectures, SH is a load/store architecture with the several addressing modes: immediate data, register indirect with increment/decrement and 4-bit displacement, register indirect R0 indexed , GBR with displacement, GBR indirect R0 index, PC relative with 8 and 12-bit displacement and PC relative with register.

The 16-bit instruction set follows a fixed and very regular encoding pattern with great simplicity and elegance.  The 16-bit instruction is essentially divided into 4 nibbles; the operands are limited to 2 registers and the maximum displacement is 12 bits.  It starts with 4-bit opcode (and some have 4,8,12 more bits), two 4-bit fields to encode two register operands, the displacement can be 4, 8 or 12 bits and the immediate value is 8-bits.  

For the branching instructions, there are PC relative 12-bit displacement conditional/unconditional and procedure branch, and PC relative register or register indirect branch.  There are instructions for exception processing, TRAPA and RTE, which can be used to implement system calls.  There is no memory management.

A five stage pipeline (IF, ID, EX, MA,WB) with delayed branch is used. Most of instructions execute in one cycle; the branching takes two cycles.  Multiply and MAC can take 2-4 cycles.  IF and MA stages require memory access which can take more than one cycle.  The delayed branch is used to reduce pipeline stall by executing the delay slot while the branch destination is being resolved.  This means that writing the assembly code, the instruction that follows the branch is executed first; nop may have to be used if no useful instruction can be placed.  Delayed branch is a simple scheme that is no longer used in the newer processor as more advanced branch prediction is normally used.

Floating number instructions are added to the SH2A-FPU with 16 32-bit single precision floating point registers, which also form 8 64-bit double precision registers.  The FPU instructions are still encoded in the same 16-bit format.  But to distinguish single precision with double precision instruction, a precision mode bit is used in the FP status/control register. Most SP instructions execute in 1 cycle, including MAC; only division and square root take 10 and 9 cycles respectively.  The DP instructions take long: data movement takes 2 cycles, add/subtract/multiply takes 6 cycles, division 23 cycles, sqrt 22 cycles.

High end SH4 was used in Sega Dreamcast game console.  A 64-bit version of SH5 was defined, but no hardware was produced.  SH-2 is still being produced for deep embedded systems.  But the architecture did not have widespread use.  With the patents expired, there was some effort to create soft core, notably J-core project.  The architecture is supported by GCC.

Renesas RX 

Variable length instruction, 16 general-purpose registers, R0 also doubles for the stack pointer.  32-bit floating number

MIPS

SPARC V8