Sunday, November 7, 2010

A Survey of Microcontroller CPU Core Architectures: 8-bit core

Numerically microcontrollers dominate the processing world. They are ubiquitous in industrial controls and consumer electronics. And unlike the desktop computers that are dominated by the singular Intel architecture, microcontrollers are characterized by the greater varieties and multiple sources. Here I survey the CPU core architectures of the microcontrollers that I have used.

8-bit Microcontroller Core

Cypress PSoC M8C Core

We start with the M8C core in the Cypress PSoC (Programmable System-on-Chip) microcontrollers. PSoC is a very interesting microcontroller because of its configurable array of digital and analog peripherals. But here we focus on the CPU core which is among the simplest. Cypress says, "The M8C is a 4 MIPS 8-bit Harvard architecture microprocessor." This is 4 MIPS at 24MHz clock rate and it is a Harvard architecture because of separate data and instruction memory space access, typical for small microcontrollers with on-chip memory. Even Cypress would not call it high-performance.
M8C is an archaic architecture. It has five internal registers: Accumulator (A), Index (X), Program Counter (PC), Stack Pointer (SP) and Flags (F). The PC register is 16-bit and all others are 8-bit. M8C has three separate address spaces: ROM, RAM and Registers. ROM has its own 16-bit address bus and 8-bit data bus and RAM and Registers share 8-bit data bus and 8-bit address bus. But RAM and Registers are not in the same address space; Registers access is an I/O operation with its own read/write strobes. ROM, which is flash memory, has maximum 64KB address space; the largest device introduced so far has 32KB of flash memory. Registers space is two banks of 256 bytes. RAM consists of a number of 256-byte pages; the largest device introduced so far has 2KB of RAM. The Flags register is directly accessible via register address CPU_F. CPU_F contains the Zero, Carry, interrupt enable, register space and RAM page mode bit fields. The SP register points to a RAM address set up by the application.
There are thirty-seven types of instructions. The instruction lengths are one byte to three bytes and the instruction cycles range from four to fifteen. The operands of ALU instructions can be RAM locations; the logic instructions and the MOV (Move) instruction can access the Registers space. Given that the minimum instruction cycle is four, it appears that the CPU is not pipelined. The arithmetic instructions can vary from four to ten instruction cycles for different addressing modes. For instance, the instruction ADD, if the operands are Accumulator and an immediate value, the number of CPU cycles is four, possibly one cycle to fetch the first instruction byte, one cycle to decode, one cycle to fetch the second instruction byte and one cycle to execute. If the second operand is a direct RAM address, the number of CPU cycles is six, two additional cycles to access memory. If the second operand is indexed memory location, the number of cycles is seven, incurring one extra cycle to add the index register. If the first operand is a direct RAM address, the number of cycles is seven, one extra to write back the result. The JMP (Jump) instruction is two-byte long and takes five cycles, where two cycles may be expended on 16-bit address arithmetic. The conditional jump instruction such as JC (Jump if Carry) is also two-byte long and takes five cycles regardless if branch is taken or not. The CALL instruction is two-byte long and takes eleven cycles, which include two pushes to store the PC register on the stack and increment of stack register. The ROM area can be read with the instruction ROMX, which retrieves the byte addressed by the concatenation of register A and X or with the instruction INDEX, which retrieves ROM data relative to PC. There is also an instruction MVI that moves RAM data using data pointer with post increment. The stack is accessed with PUSH and POP and some instructions can operate on the SP register.
So we can conclude that the architecture implementation is very simple and takes up very little resource and the performance is relatively poor. The code density is good because of the variable length instructions and accessible ROM space. Preemptive multitasking is possible because the stack space and the stack pointer are accessible. The power of PSoC is not with its CPU core but with the configurable arrays. Some devices include a Multiply and Accumulate unit as a peripheral accessible through I/O registers. C compiler for M8C is available from HI-TECH Software and ImageCraft. The ImageCraft compiler is based on David Hanson's lcc retargetable compiler. There is also an open-source utility package m8cutils which contains assembler, disassembler, programmer and simulator.

Holtek HT46

The Holtek's microcontroller is billed as a cost-effective MCU and Holtek calls the HT46 series 8-bit high performance RISC architecture microcontrollers. They are either OTP (one-time-programmable) or mask type. They feature up to 8K-word program memory and up to 384-byte data memory. Most of the instructions use one instruction cycle and some branch instructions and table read instructions use two instruction cycles. One instruction cycle is actually four clock cycles. A two-stage pipeline is employed: Fetch and Execute. If a branch instruction is encountered, the pipeline is flushed. A separate memory is used for the stack indexed by the stack pointer (SP). The stack is 6, 8 or 16 levels deep, which limits the levels of subroutine calls but not a serious restriction for a small microcontroller. The I/O registers and the CPU registers all reside in the data memory space, addressable by the same instructions as the data memory. The CPU registers include accumulator (ACC), low byte of the program counter (PCL), status (STATUS), look-up table registers (TBLP and TBLH). The status register contains the usual CPU status bits: zero flag (Z), carry flag (C), auxiliary carry flag (AC), overflow flag (OV), as well as power down flag (PDF), and watchdog time-out flag (TO). All instructions are one-instruction word long, which can be 14, 15 or 16-bit depended on the subtypes to accommodate the size of the program counter. The instructions operate on data memory directly, between data memory and accumulator or between accumulator and an immediate value. The program memory can be read with the table read instructions. The skip instructions are used for conditional branching. The subroutine call instruction pushes the program counter to the stack and the return instruction restores the program counter. Besides the regular ALU instructions, there is a decimal adjust instruction for BCD (binary coded decimal) support. The instructions use direct memory addressing, but index memory addressing is partially supported through two additional registers, memory pointer (MP) and indirect addressing register (IAR). MP stores memory address and accessing IAR result in the memory address pointed by MP.
So the HT46 has fixed length instructions, which facilitate decoding. It does not really use the load/store architecture, though the data memory can be considered as general-purpose registers. But the accumulator is a special register in the instruction set. Mapping the I/O registers in the data memory and not having the indexed memory or other addressing modes simplify the instruction set. The two-stage pipeline is too short, so the instruction cycle takes too many clock cycles, not achieving the maximum performance. The simplicity of the instruction set probably leads to a very compact implementation, hence very low cost. The actual instruction encoding is not published, so there is little third-party development tools.

Microchip PIC

Microchip PIC microcontrollers are popular especially among the hobbyists. Here we focus on the mid-range devices, the PIC16 series.
The PIC16 has separate data memory buses. The data memory is made up of CPU registers, I/O registers and general-purpose registers. The program counter can be accessed through two registers PCL and PCLATH. The instruction length is 14-bit. A two-stage pipeline is used for fetch and execution; each instruction cycle is 4 clock cycles. Most of the instructions take one instruction cycle and the branching instructions takes two cycles. It has an eight-level hardware stack. There are thirty five instructions. Most of the instructions involve the accumulator (W). The instructions between W and a file register contain a bit which determines if the result goes to W or the file register. The file register is encoded with seven bits, for total 128 bytes. The Register Bank Select bits in the Status Register allows for larger file registers access. The skip instructions are for conditional branching. The lower eleven bit of PC is encoded in the unconditional branch instruction and the upper two bits come from the PCLATH register.

Xilinx PicoBlaze

PicoBlaze is an 8-bit RISC microcontroller soft core for Xilinx FPGA's. Xilinx does distribute synthesizable VHDL/Verilog code, but the code is structural rather than behavioral. The use of Xilinx FPGA primitives in the code prevents it from being used on other programmable devices. PicoBlaze evolves from Ken Chapman's programmable state machines.
PicoBlaze has sixteen general-purpose registers, 64-byte RAM and 1K 18-bit words program memory and 256 I/O ports. There is a separate 31 deep stack area for subroutine calls. The ALU instructions operate on the general-purpose registers and immediate values. It uses load/store architecture with the FETCH and STORE instructions to access the RAM space and the INPUT and OUTPUT instructions for I/O space. There are conditional jump instructions as well as conditional call instructions. There is no direct access to the status register, so separate instructions are provided to enable or disable interrupt. The return from interrupt instruction restores carry, zero and interrupt flags.
So here we have fixed-length instruction for easy decoding, uniform register-based ALU operations and simple addressing mode. There is no instruction to access the program space. The execution is not pipelined; fixed two-clock instruction cycle constrains the maximum speed attainable. Its implementation on Xilinx FPGA takes 96 slices (106 LUTs and 76 Flip flops) and 1 BlockRAM. 100MIPS is possible on some Xilinx FPGA's. There is a free behavioral implementation of PicoBlaze called PacoBlaze supported by a Java version of the assembler.

Atmel AVR

Atmel's AVR microcontrollers are among the most popular. Atmel offers an extensive line of microcontrollers based AVR complemented with a full set of peripherals. They range from the tiny 1KB Flash/no SRAM/1.2MHz to the XMEGA 384KB Flash/32KB SRAM/32MHz.
AVR has 32 8-bit general-purpose registers. The instructions are fixed 16-bit long (with a few exceptions) and the instruction encoding is quite regular with fixed location for source and destination registers and immediate value. The status register (SREG) is not directly accessible; special instructions are used to set, clear or test certain bit field in the status register. Most instructions execute in one to two cycles; some branch instructions take longer. A two-stage pipeline is used for fetch and execution. The AVR instruction set is rich and strives for performance rather than minimalist. The multiplication instructions are included. The program memory, RAM and I/O space are separate: the RAM space is accessed by the LD/ST instructions, the I/O space by IN/OUT instructions and the program memory by LPM/SPM instructions. The CPU registers, I/O registers and RAM are actually in the address space: the 32 registers occupy the address 0x0000 to 0x001F, followed by the I/O registers  from 0x0020 to 0x003F and the SRAM.  But when 32 I/O registers are not enough, extended I/O registers have to accessed by the LD/ST instructions.  The stack is maintained by the stack register, uses the RAM area and is accessed by PUSH/POP. Many different addressing modes are available to load and store instructions, including pointers with pre-decrement and post-increment. The registers 26-31 serve as the indirect address registers. The load/store instructions take two cycles, not penalized by complex addressing modes. There are branch instructions for almost every conceivable condition. If the branch is taken, an extra cycle is incurred to flush the pipeline. The call instruction takes 4 cycles, two extra cycles to push PC to the stack. The return and return-from-interrupt instruction take 4 cycles; the status register is not restored by the RETI instruction.
Here we see an implementation of modern RISC architecture: a large register set, fixed-length and uniform instruction format, load-store operations, simple addressing mode for ALU instructions. The development is well supported by the GNU compiler tools, including the Windows version winavr. Different versions of AVR core in VHDL are available open-source; pavr is a pipelined implementation of the instruction set, where a 6-stage pipeline is used to achieve high clock rate.

Zilog eZ80

The Z80 has the heritage of 8080. The latest from Zilog is the eZ80 series, which we'll focus on. eZ80 is upward object-code compatible with Z80 and Z180.
eZ80 uses a three-stage pipeline: fetch, decode and execute. The eZ80 CPU has two banks of 8-bit registers, which include the accumulator (A), six working registers (B, C, D, E, H, L) and the Flag register (F). The working registers can combine to form 16-bit registers. The control registers include Interrupt Page Address Register (I), Index Register (IX, IY), Memory Mode Base Address Register (MBASE), Program Counter Register (PC, 16 or 24-bit), Refresh Counter Register (R) and Stack Pointer Register (SPL for 24-bit and SPS for 16-bit). eZ80 can operate in two modes, the 16-bit addressing Z80 mode and the 24-bit addressing ADL mode.
This is a fairly sophiscated CISC architecture. Maintaining compatibility and increasing performance (four times faster and 256 timer larger address space) are important objectives in this implementation.

8051

The venerable 8051 is originated from Intel, widely second-sourced and is still popular. It serves as the embedded processor for the Cypress USB controllers and the Chipcon (now TI) Zigbee transceiver. The microcontrollers are available from Atmel, Silicon Laboratory (formerly Cygnal), NXP (formerly Philips), Dallas Semiconductor (Maxim), Cypress PSOC3 etc.  And HDL cores, including several open-source versions, are also available for FPGAs.  SDCC is an open-source C compiler that targets 8051.
8051 was introduced around 1980 as an enhancement to the 8048 architecture.  The original 8051 was fabricated on an nMOS process and had 60,000 transistors with 4KB factory mask-programmable ROM.
The 8051 is an accumulator based architecture: all the ALU instructions go through the register A.  Another register B is dedicated for the multiply and divide instructions.  The 8051 has separate code and data spaces.  The Program Counter is 16-bit addressing the 64K program memory space.  The registers A, B, SP (Stack Pointer), PSW (Processor Status Word), DPTR (Data Pointer 16-bit) are also mapped into the 128-byte Special Functions Registers (SFR) address space (address 128-255).  The Stack Pointer is 8-bit, only addressing only the internal RAM.  The PSW register includes the carry, auxiliary carry, user flag, register band select, overflow and parity flags.  The flags are updated as the result of the instruction. The SFR also hosts registers for peripherals.  There are 4 banks of general-purpose registers R0-7; they are also mapped to the internal RAM (address 0-31).  The internal RAM is 128 bytes.  The 8051 architecture actually allows 256 bytes of internal RAM; the upper 128 bytes overlap with the SFRs.  However, they can be accessed with the register-indirect addressing mode.  The SFRs can only be accessed with the direct addressing mode.  For address 20h to 2fh is 128 bits of bit addressable memory.  The SFR are also bit addressable.  They make it easy for bit operations. 
The instructions are 1 to 3 bytes long.  There are five basic addressing modes: register, direct, register-indirect, immediate and base-register-plus-index-register-indirect.  The register addressing mode accesses registers R0-7, the direct addressing mode accesses internal RAM address 0-255 and the register-indirect addressing mode accesses memory address in register R0 and R1.  The instruction MOVC is used to move a byte from the program memory using the base-register-plus-index-register-indirect mode where the accumulator is the index register and DPTR or PC is the base-register; this instruction is for table-lookup.  And MOVX is for moving data between the external data memory and the accumulator.  In addition to the usual arithmetic and logic instruction, there is an instruction for packed BCD format.  The jump instructions can be PC relative, 11-bit or 16-bit absolute address or indirect through DPTR.  The subroutine calls save PC on the stack and the return instructions restore PC from the stack.  The CPU begins execution from program memory address 0000h.  The interrupt vectors starts from address 0003h with 8-byte spacing.
The original 8051 runs off a 12MHz clock.  One instruction cycle takes 12 clock cycles.  The instructions takes 1 or 2 instruction cycles, except for MUL/DIV taking 4 instruction cycles.  The more recent implementation achieves one instruction cycle per clock cycle.  Clearly, executions are not pipe-lined.

68HC11

The 68HC11 also uses an accumulator architecture.  It has two 8-bit accumulator registers A and B, which can combine into a 16-bit register D.  The program counter (PC), the stack pointer (SP) and two index registers IX and IY are all 16-bit.   The condition code register (CCR) is 8-bit, which holds carry, overflow, zero, negative and interrupt mask.  The 68HC11 has a unified memory space: data memory, program memory and memory mapped peripheral registers reside in the same 64KB address space.  The are five addressing modes: immediate, direct, extended, indexed and relative.  The direct addressing mode is for the lower 256 bytes of RAM.  The extended addressing mode covers the entire 16-bit address.  The indexed addressing mode uses the index registers plus an 8-bit signed offset.   The relative addressing mode is for branching relative to PC with an 8-bit signed offset.    The instruction set supports some 16-bit operation through register D, such as adding two 16-bit numbers.  
In general, operations with the immediate addressing mode take two cycles, the direct three cycles, the extended and the index with IX four cycles and the index with IY five cycles.  The instructions that uses IY index addressing mode have one extra byte of op code.  The branch and jump instructions take three cycles and the jump to subroutine instruction takes 5-7 cycles depending the addressing modes. 
Because the stack pointer is 16-bit and can point any memory location, the 68HC11 is less restrictive than the 8051.

1 comment:

  1. Interesting post i came across, but did it cut off at 68HC08 ?

    ReplyDelete