jiristepanovsky.cz

Return to Index


16-bit Serial Homebrew CPU - 2023

Building a homebrew CPU from scratch takes a large number of logic chips. It is understandable, that implementing registers, program counter, ALU, and other components of the CPU in TTL or CMOS logic does require a substantial amount of chips. But how many exactly?

I've tried to optimize my homebrew CPU for the lowest amount of logic chips possible and answer a question:
How few ICs are required for a Turing-complete CPU without a CPU?

16-bit Serial Homebrew CPU

My answer is a 16-bit Serial CPU with only 8 ICs, including memory and clock. It has 128kB SRAM, 768kB FLASH, and can be clocked up to 10MHz. It contains only a 1-bit ALU, but the majority of its 52 instructions operate on 16-bit values (serially). At its maximum speed, it executes roughly 12k instructions per second (0.012MIPS) and, among other things, is capable of streaming a video on PCD8544 based (Nokia 5110) LCD at ~10 FPS.

Depending on where you place the line between a state machine and a CPU, my 16-bit system might actually be the CPU with the lowest amount of ICs. Although, some other contestants are Jeff Laughton's 1-bit computer with 1 instruction and 1-bit of memory, and Daniel Thornburgh's Simple CPU with 1 byte-byte-jump instruction and memory simulated on a Raspberry PI.

Hardware:

The architecture is inspired by other CPU builds like James Sharman's JAM-1, Ben Eater's SAP-1, Warren's 4-bit Crazy Small CPU, its 8-bit version, and others. All of them, and many others alike, use a "control" EEPROM, EPROM, or ROM for generating control signals to the CPU components. Because it is way easier than generating them by logic circuits alone, and because it offers more flexibility in the future, I've also decided to use such a "control" memory, specifically, an EPROM. Contrary to the builds mentioned above, I've aimed for the lowest possible chip count, so I've tried to "squeeze" as much data processing inside the memory as possible, to either lower the demands on other CPU components or better yet, eliminate them completely. Here are some key steps taken:

Even with these dramatic simplifications, some additional hardware is still required, however, everything can be built with just 8 chips in total, following the schematic below:

16-bit Serial Homebrew CPU

The circuit is built around a 128kB M27C1001-15 EPROM, working at 5V and combining a control state machine with a 1-bit ALU. Its output lines are latched by a 74HC574 every clock cycle and control the two 23LCV512 64kB serial SRAMs and one W25Q80 1MB serial FLASH. There are not enough outputs to control each memory individually, so they share the data line and partially also the chip select line, only the clock lines are kept separate. I could not find a 5V serial FLASH memory, so resistors R3, R4, and R5 limit the current and form a bridge from 5V to 3.3V. I do not count the MCP1703 3.3V voltage regulator as part of the CPU (I consider it, to be a part of a power supply), but with it, my CPU contains 9 chips.

The current instruction is stored in a buffered shift register 74HC595, which has its control lines also partially shared with the memories. Every instruction takes a couple of cycles to complete, so the progress within an instruction is tracked by a "microcode" counter 74HC393. After the instruction completes, the "Counter_reset" line resets the "microcode" counter and starts the execution of the next instruction buffered in 74HC595.

The 74HC574 and the "microcode" counter 74HC393 use opposite clock edge, so the clock generator 74HC14 provides an inverted clock signal to the 74HC393 to make them both synchronous.

16-bit Serial Homebrew CPU Schematic

Inputs and Outputs:

One thing I was not able to reasonably implement in my CPU is a self-programming of the FLASH memory. Bootloader is, therefore, not possible, and uploading a new program to the serial FLASH must be done externally. For this purpose, I used an Attiny13 microcontroller that listens to a set of commands over a UART, so any USB-UART adapter is sufficient for uploading a new code. When programming, it disables the output of the 74HC574 via the "Prog_en" line and proceeds to program the FLASH memory directly. The microcontroller is used only for uploading a new program and the CPU is happily running without it.

The only available outputs are the two upper bits of the instruction shift register 74HC595. I used one of these inverted lines as a chip select, which allows the CPU to connect to an SPI-like device. For example, a 3.3V PCD8544-based SPI LCD (Nokia 5110) can be connected directly, with the second upper instruction bit acting as the LCD data/command selector.
Another option is connecting an additional 74HC595 shift register instead of an LCD to get classic digital output lines.

The only available inputs are the two memory data/input signals connected to the EPROM address lines (A9, A11). The serial memories keep these signals at high impedance when they are not in use, so they can be sampled as general digital inputs when the memories are idle. It is important to note that the input signal must not interfere with the memory data, so a high resistance between the input signal and the memory input line is required (R6, R7). Sidenote: reading the input signal on memory data lines is working for clock frequencies only up to about 8MHz. At higher frequencies, the sampled data seem to be erratic and the CPU is prone to stalling.

16-bit Serial Homebrew CPU Peripherals

You could already see my CPU playing the "Bad Apple!!" music video on a PCD8544 LCD somewhere on top of this page. In the video below, I demonstrate the possibility to control general digital outputs by adding another 74HC595. The same circuit can be used to produce 8-bit music at up to 4300 samples per second if an R-2R ladder is used instead of LEDs, and it is the same circuit I used to produce the soundtrack for the "Bad Apple!!" video.

Memory map:

The CPU has no dedicated registers, but it has two SRAMs that it can read from and write to. The downside is that each time the CPU wants to access data, it has to write the full 16-bit address to the serial SRAM. The upside is that because it has to write the full 16-bit address anyway, the CPU (and instructions in general) can access all of the 64kB of the SRAM at a constant time.

I've selected one SRAM (U8/RAM1) to be used for holding the program data, and all arithmetic and logical operations are designed to be performed on values within this memory. The second SRAM (U7/RAM2) is meant to be used for a stack, so only a few instructions are able to access and modify its contents.
The first few bytes of both memories are reserved for storing the internal CPU state like the program counter, the flag bit, the stack pointer, an intermediate result, the source/destination addresses, and other internally used values. The approximate memory map is in the table below:

Address:0x00x10x20x30x40x50x60x70x80x90xA0xB0xC0xD0x000E~0xFFFF
RAM1:Flag & InputProgram counter (PC)Program counter reversedStack pointer (SP)Stack value (SPVAL)Registers and user data
RAM2:FlagProgram counter (PC)Destination addressInstruction's resultStack and user data

One thing I would like to mention is the method of using the FLASH memory as the second ALU input. Because the FLASH is quite large (1MB), it is possible to fit inside it a full 16-bit lookup table containing 16-bit identical values. With this 128kB lookup, it is then possible to write a 16-bit value to the FLASH as an address and read back the same 16-bit value as data, which can be used as an ALU input.

A slight inconvenience of using the serial memories is that they are addressed in an MSB-first format, while the 1-bit ALU naturally computes in an LSB-first format. To get a functional memory addressing, we need to reverse the bits from the LSB-first format the CPU works with to the MSB-first format the memories work with. Reversing bits using a 1-bit ALU is not straightforward, so I've reserved another 128kB of FLASH memory for a "reversed-values" lookup table to make the operation faster. It works the same way as the previous lookup, a value is written to the FLASH memory as an address and its reversed representation is read back as data.

These two 16-bit lookup tables are the reason my CPU has only 768kB of FLASH memory and why the program counter (PC) starts at address 0x040000 and not zero.

Instruction set:

There are some restrictions for the instruction set arising from the limited hardware. The CPU is capable of only 64 unique instructions/operations, all of which have to fit in the maximum of 256 micro-instruction steps and must get by working with only a 1-bit ALU and 1 Flag bit. But even with these limitations, surprisingly, it is possible to create quite a nice-looking instruction set:

OP codeNameOperandsWidthFlagCyclesTotalDescription
0x00INIT--clear256256Wait for clock to stabilize, then initialize RAM ICs to sequential mode
0x01RESET--clear235235Set program counter PC = 0x040000 and stack pointer SP = 0x000A
0x02----158414Shadow instruction: Fetch
0x03----256414Shadow instruction: Fetch continuation
0x04----129129Shadow instruction: Increment program counter PC = PC + 3
0x05----129129Shadow instruction: Increment program counter PC = PC + 5
0x06----129129Shadow instruction: Increment program counter PC = PC + 7
0x07----129129Shadow instruction: Increment program counter PC = PC + 8
0x08----162291Shadow instruction: Copy 32 bit result
0x09----130259Shadow instruction: Copy 16 bit result
0x0A----113113Shadow instruction: Copy program counter
0x0B----167296Shadow instruction: Store to RAM indirect
0x0C----151280Shadow instruction: Store to RAM indirect
0x0D----173587Shadow instruction: Arithmetic instruction dispatch
0x0ESTF--set132546Set FLAG
0x0FCLF--clear132546Clear FLAG
0x10NOP---132546No operation
0x11MOVaddr16 <- addr1616-231774Move 16 bit value
0x12MOVWaddr16 <- addr1632-146851Move 32 bit value
0x13INCaddr16 <- addr1616overflow231774Increment
0x14DECaddr16 <- addr1616overflow231774Decrement
0x15COMaddr16 <- addr1616zero2317741's complement (NOT)
0x16NEGaddr16 <- addr1616zero2317742's complement
0x17LSLaddr16 <- addr1616overflow233776Left shift (<<)
0x18LSRaddr16 <- addr1616overflow233776Right shift (>>)
0x19ROLaddr16 <- addr1616overflow233776Left shift with carry
0x1ARORaddr16 <- addr1616overflow255798Right shift with carry
0x1BASRaddr16 <- addr1616overflow235778Arithmetic right shift (keeps sign bit)
0x1CREVaddr16 <- addr1616-238781Bit reverse
0x1DADDIaddr16 <- addr16, val1616overflow231774Add immediate
0x1EADCIaddr16 <- addr16, val1616overflow231774Add immediate with carry
0x1FSUBIaddr16 <- addr16, val1616overflow231774Subtract immediate
0x20SBCIaddr16 <- addr16, val1616overflow231774Subtract immediate with carry
0x21ANDIaddr16 <- addr16, val1616zero231774Logical AND with immediate
0x22ORIaddr16 <- addr16, val1616zero231774Logical OR with immediate
0x23XORIaddr16 <- addr16, val1616zero231774Logical XOR with immediate
0x24ADDaddr16 <- addr16, addr1616overflow171887Add register
0x25ADCaddr16 <- addr16, addr1616overflow171887Add register with carry
0x26SUBaddr16 <- addr16, addr1616overflow171887Subtract register
0x27SBCaddr16 <- addr16, addr1616overflow171887Subtract register with carry
0x28ANDaddr16 <- addr16, addr1616zero171887Logical AND with register
0x29ORaddr16 <- addr16, addr1616zero171887Logical OR with register
0x2AXORaddr16 <- addr16, addr1616zero171887Logical XOR with register
0x2BJMPaddr24--197611Jump to address
0x2CCALLaddr2432-221748Copy following instruction's address (PC + 4) and current FLAG to SPVAL, then jump
0x2DRET-32restore138552Move SPVAL to PC & FLAG (effectively returns from CALL and restores previous FLAG)
0x2EBRFSaddr24--160625|574Branch if FLAG set
0x2FBRFCaddr24--160625|574Branch if FLAG cleared
0x30BREQaddr16, addr2416-243708|657Branch if register is zero
0x31BRNEaddr16, addr2416-243708|657Branch if register is not zero
0x32LDIaddr16 <- value1616-81624Load 16 bit immediate
0x33LDIWaddr16 <- value3232-113656Load 32 bit immediate
0x34LDaddr16 <- [addr16]16-238911Indirect load 16 bits from address
0x35LDBaddr16 <- [addr16]8-238911Indirect load 8 bits from address, set upper 8 bits to 0
0x36ST[addr16] <- addr1616-163873Indirect store 16 bits to address
0x37STB[addr16] <- addr168-163857Indirect store 8 bits to address
0x38LD2W[addr16]32-256799Indirect load 32 bits from address in RAM2 to SPVAL register
0x39LD2[addr16]16-224767Indirect load 16 bits from address in RAM2 to SPVAL register
0x3AST2W[addr16]32-256799Indirect store 32 bits from SPVAL register to address in RAM2
0x3BST2[addr16]16-224767Indirect store 16 bits from SPVAL register to address in RAM2
0x3CLPMaddr16 <- [addr16]16-211884Indirect load 16 bits from address in FLASH
0x3DLPBaddr16 <- [addr16]8-211884Indirect load 8 bits from address in FLASH, set upper 8 bits to 0
0x3EOUTaddr168-252795Output 8 bits over SPI
0x3FHALT--clear14428Stop execution

The first instructions, INIT and RESET, are executed at power-up or when the RESET button is pressed. The "shadow" instructions are non-user-accessible instructions, primarily used for repeating operations like fetching an instruction, program counter increment, result write-back, and similar.

Arithmetic and logical operations use the one Flag bit as either a Carry/Overflow or a Zero flag. As mentioned above, there is no performance penalty for accessing the full address space, so all these instructions can specify any source/destination address within the 64kB SRAM address space. Indirect addressing for arithmetic operations is not supported directly but must be performed by LD/ST (load/store) instructions.

The second set of LD2/ST2 instructions is accessing the second SRAM. They are meant to be used for stack, but any data can be stored. PUSH and POP instructions are not implemented, but they can be constructed from LD2/ST2 and INC/DEC instructions.

An average instruction takes about 800 clock cycles, including fetch operation and program counter increment. At the maximum clock frequency of 10MHz, the CPU can execute about 12k instructions per second.

Writing code in assembler:

I use Lorenzi's customasm tool to generate binary files from assembly source code. The binary files can then be uploaded using a small python3 application to the Attiny13 programming microcontroller that writes the binary into the FLASH.

Below are two examples of small subroutines written in assembler for my CPU. The first subroutine returns the 32-bit result of two 16-bit values multiplication. The second one writes an ascii string stored inside the FLASH memory to the LCD.

Multiply32_16x16LCD_WriteStrF
; Returns FA32 = FA16 * FB16
; FB is expected to be smaller
Multiply32_16x16:
    ;PUSH_PC        ; Not necessary
    LDIW FC, 0      ; Clear result
    LDI FA+2, 0     ; Cast FA16 to FA32
.loop:
    ANDI TMP, FB, 1
    BRFS .skip_add
    ADD FC, FA      ; Add FC32 += FA32
    ADC FC+2, FA+2  ; Add FC32 += FA32
.skip_add:
    LSL FA          ; Shift FA32 << 1
    ROL FA+2        ; Shift FA32 << 1
    LSR FB          ; Shift FB16 >> 1
    BRNE FB, .loop
    MOVW FA, FC     ; Copy result
    ;POP_PC         ; Not necessary
    RET
; Write String in Flash
; input: FA32 <- Address of the string in Flash
LCD_WriteStrF:
    PUSH_PC             ; Save return address
    PUSHW RA            ; Save RA 32-bit
    MOVW RA, FA
.loop:
    LPB FA, RA          ; Load character from Flash
    BREQ FA, .stop      ; Test "\0" character
    REV FA              ; MSB-first -> LSB-first
    ANDI FA, FA+1, 0xFF ; Cast to 8bits
    CALL LCD_WriteChar  ; Write character
    ADDI RA, 1          ; Increase 32-bit pointer
    ADCI RA+2, 0        ; Increase 32-bit pointer
    JMP .loop
.stop:
    POPW RA             ; Restore RA 32-bit
    POP_PC              ; Restore return address
    RET

Maximum Frequency and the Critical path:

According to specifications, the total propagation delay across the critical path is:

Putting it together, the circuit should only be able to run at ~4.6MHz. My specific build, however, is able to work flawlessly up to 10MHz and becomes unstable only above ~10.5MHz. For a circuit built on a breadboard with plenty of parasitic capacitance, I consider it quite impressive. The maximum clock rate might even be improved if a faster binary counter or faster EPROM were used.

Conclusion and Retrospective:

I am really pleased with the finished CPU. It has nice and "easy-to-work-with" instruction set with all the basic instructions present. It is powerful enough to stream a video on a small LCD screen, play audio (though using an external "sound card"), and generally perform the simple input/output computational operations it was originally intended for. Finally, it successfuly demonstrates that it is possible to build a functional homebrew CPU with only a handful of ICs.

There are, however, some small improvements possible:

I will have yet to see if some of these improvements seem to be worthwhile to implement. In the meantime, if you like the project and want to dive deeper, you can skim through the source code available here. It contains a simulator, EPROM microcode generator, Attiny13 programmer firmware, and all of my assembler codes.

Update:

I've implemented a minimalistic 3D wireframe object projection engine using 16-bit fixed-point arithmetic. Multiplying matrices on my 0.012 MIPS CPU is quite slow, so 3D games are probably not coming anytime soon:

I'm also slowly growing the list of hardware my CPU directly supports. I've added an SPI alphanumeric LCD I've salvaged from an old HP printer:

16-bit Serial Homebrew CPU

and I've been able to "bit-bang" the serial interface for DS1302 real-time clock. The software does have to use some special instruction sequences to produce the necessary signals, but it is possible and does not require any additional hardware.

16-bit Serial Homebrew CPU

Update 2:

The CPU now supports a PCF8833 LCD driver, although one frame takes about 96 seconds to render.

16-bit Serial Homebrew CPU

Homebuilt CPUs WebRing

Definitely check out other awesome homebrew CPU builds on Warren's https://www.homebrewcpuring.org

Join the ring?

To join the Homebuilt CPUs ring, drop Warren a line (mail is obfuscated, you have to change [at] to @), mentioning your page's URL. He'll then add it to the list. You will need to copy this code fragment into your page (or reference it.)
Note: The ring is chartered for projects that include a home-built CPU. It can emulate a commercial part, that′s OK. But actually using that commercial CPU doesn′t rate. Likewise, the project must have been at least partially built: pure paper designs don′t rate either. It can be built using any technology you like, from relays to FPGAs.