Home / Computers / Asynchronous 40-bit TTL CPU

Email


Asynchronous 40-bit TTL CPU

Introduction
Contents


Introduction

In the Autumn of 1992 I designed a processor that would use only TTL-logic chips. I didn't actually build it, which would have required hundreds (probably thousands) of chips, and days containing significantly more than the standard 24 hours provided. This page contains scans of my (incomplete) design notes.

This processor is particularly unusual in that it is an asynchronous processor. In a normal processor, every activity is coordinated by a clock signal. An asychronous processor has no clock signal, every part of the circuit does its processing as fast as possible and signals subsequent circuits when it is complete.

The advantage of this approach is that every part of the circuit can run at optimum speed. In a "normal" synchronous processor, the speed is often limited by the speed of the slowest part.

Asynchronous devices also consume less power. In conventional processors, much of the power is eaten up by the clock generator: a powerful clock signal is required because it must propagate as quickly as possible to all areas of the chip. Asynchronous processors generate much less electromagnetic interference (EMI) because activity is not coordinated to occur at the same precise times, which causes power spikes. The lower power consumption and EMI were obviously not even a consideration for my TTL monster processor, and in fact I didn't even think about them.

I had never heard of asynchronous processing and thought it a very innovative idea. Later (much later) I learnt that asynchronous processors were already under development elsewhere (often you think you invented something and later find someone else has already had the same idea). The University of Manchester AMULET group was started in 1990 and has designed a number of ARM-compatible asynchronous processors.

I thought that being asynchronous would give me high performance. My design notes are incomplete and surely would have required considerable further development before being workable. But I do believe the basic principles could have led to a functioning TTL asynchronous processor. The features of this processor are as follows:

o 40-bit word size.
o 20-bit address space (5 MByte), includes DRAM controller.
o 20-bit instructions, with a few 40-bit. 2 20-bit instructions fit in one word.
o RISC Z80-like instruction set
o 10 general purpose 40-bit registers
o Floating point format: 1 sign bit, 8 exponent bits, 39 fractional bits
o Includes integer and floating point addition, subtraction and multiplication units
o Controlled, including all I/O by a Z80-based host computer.

There are 59 scanned pages, each is about 870 x 1140 pixels in size, taking an average of about 65 KBytes. At this resolution all the pages should be readable.


1. Instruction set

The 5'th September 1992 and I begin by starting to think about the instruction set. I decide to have 16 registers. This way the register number will fit in 4 bits. The computer will be a 40-bit machine. Hopefully I will be able to fit 2 instructions in each 40-bit "word", each one taking only 20 bits. This way I can execute 2 instructions per instruction fetch. On this page the right column indicates the number of bits the instruction operands require.

The instruction groups are:

o NOP (No operation)
o Arithmetic (Multiply, Add, Subtract), integer and floating point
o Logical
o Shifts and rotates
o Loads
o Individual bit testing, setting and resetting
o Jumps, Calls, Branch (relative jump) and returns
o I/O instructions and control


2. Instruction set (continued)

The Jump, Call, Branch and Return instructions. I note that the Jump and Call instructions will not fit inside the 20 bits size I hope to use for each instruction. The destination address must be 20 bits or more, if the memory is to be a reasonable size.


3. Floating Point format and Execution Units

Next I decided the format to use for floating point (none of this IEEE compliant nonsense). I use 1 sign bit, 8 exponent bits and a 31-bit fractional part. This I consider should be a reasonable accuracy for most applications, and of course it fits the 40-bit word size nicely.

I regroup the instructions and decide on what execution units I will need to build. Each unit will handle a particular type of instructions. The list of execution units doesn't seem complete.


4. Detailed Instruction Codes

The next day I consider in detail the format the instructions will take. The first 3 bits will always code the unit required. The following bits specify the opcode and operands (register numbers for source and destination data, etc). The load and store to memory instructions will require a whole 40 bits.


5. Detailed Instruction Codes: Add/Sub/Logic, Shift

The opcode for the add/sub/logic instructions fits nicely into 5 bits, so there are 12 left for the two source registers and the destination register. Neat (now you may begin to realise why I chose 40 bits for my word size).

The shift instructions also have a source and destination register. I have marked 5 bits to specify the amount of shift but I think I would really need 6 (for 0-39 bits of shift).


6. Detailed Instruction Codes: Bit manimpulation and jumps

Bit manipulations also fit exactly and neatly into 20 bits. 3 bits specify the action required, 6 bits the number of the bit in the word (0..39), and 8 bits the source and destination registers.

The jumps and calls require the full 40 bits for specification of the target address. Branches fit in 20 bits, they have 12 bits available to specify the displacement from the current location. The Jump, Branch, Call and return instructions all have conditional variants, which test the state of the Carry, Zero and Sign flags.


7. Block Diagram and Register descriptions

A small block diagram showing the interconnection if the units in the CPU. The CPU incldes its own memory and memory control, and all input/output is intended to be via a host Z80 processor.

There are 9 general purpose 40-bit registers, register 0 is always zero, and the remaining 6 registers are for the stack pointer, program counter, loop counter, In and Out registers, and index register.


8. Scoreboarding and Bus Arbitration

Now I give some consideration to the scheduling and organisation of the units. The Scoreboard ensures that registers and execution units are locked while they are awaiting a result write or in use.

On this page I also make a few calculations relating to the mandelbrot set, concerning the number of instructions that must be executed in each iteration of the mandelbrot calculation. Yes, drawing mandelbrot sets is considered to be the first application for the computer...


9. Instruction set codes again

A review of the instruction set, in which things change slightly.


10. Instruction set codes again (continued)


11. Instruction set codes again (shifts, bit manipulation, jumps)


12. More on Scoreboarding

Some more thoughts on the practical implementation of the scoreboarding. Also, consideration of an alternative way to code the instructions.


13. Block Diagram of CPU

A nice block diagram of the CPU, as it stands so far.


14. DRAM refreshing

The 9th September 1992 and I decide to think about refreshing of dynamic memories (DRAM), and begin by calculating how often I will need to do it, and how long it will take.


15. DRAM refresh circuit

A real circuit diagram of how I will take care of the refreshing. At the bottom of page 14 is a circuit to generate one pulse every 4 mS, which is how often I will be doing a refresh. This is generated from a 62.5 kHz signal, which would in fact come from the host Z80 computer's video controller circuit. The refresh controller also requires a 100 MHz clock.

During the refresh, the CPU gets suspended, and the hosting Z80 also has to WAIT. The refresh operation doesn't start until the rest of the CPU acknowledges the refresh request.


16. Z80 Memory Controller

This CPU design only has I/O via the host Z80 computer, and so the contents of the memory are programmed by the Z80. The circuit on this page allows the host Z80 to read and write the memory. At the bottom write I draw a nice diagram indicating the external connections of this unit.


17. DRAM page mode

I devote some attention to the issue of dynamic RAM page mode access. Page mode is much quicker than a fully random access. In page mode, the row address of the memory location is locked, and columns read from that same row.

I feel that if I use page mode wherever possible, my CPU will run twice as fast as it would otherwise, because I can get my instructions twice as quickly.


18. Instruction fetch and pipeline

More details of the precise memory timing I will need to generate in order to arrange for page mode. Here are some designs for circuits to arrange this timing, and ideas for the instruction buffer (pipeline).


19. DRAM timing for Page-mode access

More on the timing of the DRAM page-mode access. Note that the indicated "100 MHz" is for thought purposes only: the CPU is asynchronous so in reality no such clock exists. In practice I will ensure that propagation delays in the instruction fetch circuit will be long enough so that the memory will always have the required amount of time to access and return its data.

[Or is this true? Did I in fact intend to use a 100 MHz clock for the memory timing, in order to ensure precise timing? In this case I would have considered the resultant 10 nS period very short so that effectively the clock was only be used to ensure precise timing, not synchronise any other part of the processor].


20. Instruction buffer

Early sketches concerning the operation of the instruction buffer and memory timing.


21. Propagation Delays of TTL families

13th September 1992 and I carry out some library research into the propagation delays of various TTL types.


22. Instruction Pipeline and Load/Store Unit

Here is a detailed diagram of the instruction pipeline and load/store units, which are closely connected.


23. Instruction Pipeline (continued)

More of the instruction fetch and memory timing control circuits. The second circuit here relates to the memoryu timing, but has evolved to a more carefully thought-out version.


24. Shift unit thoughts

Now I spend some mental energy considering the controversial topic of shifting. I want a fast barrel shifter, that will be able to shift by any number of bits. This must be done in stages, but using what TTL chips? To try to work out the best configuration, I consider designs including cascaded stages consisting of a number of the following TTL chips:

o 74LS151: Single 8-1 line multiplexer
o 74LS153: Dual 4-1 line multiplexer
o 74LS157: Quad 2-1 line multiplexer


25. Instruction Fetch Pipeline evolution

15th September 1992. After that brief interlude with the barrel shifter, it's back to work on the instruction fetch pipeline. These circuits show the evolution of the circuit, as I iron out the problems one by one...


26. Instruction Queue Unit

Finally a more complete version of the instruction queue unit, which I believe might work.


27. Memory Controller

The final version of the memory control circuit. This turns the 20-bit address bus into the multiplexed Row and Column form required for the DRAM's, and generates the RAS and CAS signals. It is the only part of the CPU having a clock, the 100 MHz clock is used to generate the precise timing required by the DRAM. The rest of the CPU from then on is entirely asynchronous.


28. Address unit; Memory arbitration

The address unit generates the memory address for the memory controller unit to use to access the DRAM. This address can come from the program counter, the second instruction field of the Jump and Call instructions, or by adding the displacement address of a branch instruction to the current location.

Later I start to think about bus arbitration. This circuit will decide when the CPU can control its own memory, and when it should yield control to the host Z80 so that the host can read/write the memory.


29. Evolution of the Memory Arbitration Circuit

A few more attempts at designing a working circuit for the memory arbitration.


30. Final Circuit for Memory Arbitration

18'th September 1992 rolls around and I decide that my memory arbitration circuit design is complete. It controls the Z80 BUSRQ and WAIT signals and arranges for a clean handover of control when the Z80 requests it.


31. Register File Definition

A reminder of the register naming and useage. Each register also has its own scoreboard bit (half a dual D-type 74LS74 flip flop), whose state is "0" if the register is available or "1" if it is busy.


32. Instruction Decoding and Scheduling

The instruction decode is conceptually simple. The first 3 bits of every instruction specify the unit to be used for the instruction. The instruction is "issued" when the unit scoreboard indicates the required execution unit is available, and the scoreboard also indicates each of the source and destination registers is available.


33. Instruction Scheduler / First ALU thoughts

Final diagram of the scheduler. Given an "Instruction Valid" signal from the instruction queue, it checks register and unit availability. If everything is Ok it enables the correct source register outputs and generates signals for the execution units to start processing. The scoreboard bits then get set (unit and result register).

Also shown here are my first thoughts on the ALU design.


34. Sketches for the BitMan and Load/Store Units

Some preliminary consideration of the Bit Management (test/set/reset) unit. This one can reset any specified bit to "0", set it to "1" or test its state.

The load/store unit is responsible for generating the required memory address. Depending on the state of instruction bits 10, 11 and 12, this address is one of

o 0: an absolute address specified in the 2nd instruction field (40-bit instruction)
o 1: Stack Pointer: The address specified by the stack pointer is used
o 2: Index Register + displacement specified in the instruction

When the stack pointer is used, it is subsequently incremented or decremented automatically by this unit, corresponding to a stack "POP" or "PUSH".


35. Load/Store Unit

Final diagram of the Load/Store Unit. Note the nice block representation at the bottom of the page showing all input and output signals connected to this unit.


36. Bit Management Unit; ALU

The final version of the bit management unit, responsible for reset to "0", set to "1", or testing of any bit.

In the lower part of the page, a diagram of the ALU. This is rather simple, mainly because it uses the 74LS181 4-bit ALU (8 of them).


37. Result Arbitration

One of the most important parts of the CPU design! All the units operate asynchronously, and can be operating in parallel. Yet they all ultimately want to write the results of their computation to the register file using the internal bus. The result arbitration circuit helps to decide who can use the bus at what time.

There are 5 units which may need to write back results. In addition the instruction issue unit will require the bus to load source data from the register file to the units. I therefore arrange a stack of 5 flip flops. There is a "result phase" during which the instruction issue unit allows results to be written. In this phase, each of the 5 possible units are checked and in turn allowed to write their results if they need to.


38. Register File and Scoreboard

This page shows the construction of the register file and scoreboard. Register 0 is always zero by definition, so it is NEVER busy.


39. IOU, the Input/Output Unit

This is a set of 74LS374 octal D-type latches, 40-bits wide, one bank for the input and on e for the output. Using these the host Z80 can send data to and from the CPU.

When the CPU executes an IN instruction, a Z80 interrupt gets generated. The IOU is then marked as busy on the unit scoreboard until the Z80 has loaded all 5 8-bit registers, so a complete 40-bit CPU word is ready. Only then does the IOU send a "finish" signal to the Result Arbitration circuit.

When the CPU executes an OUT instruction, a Z80 interrupt is also generated. The IOU busy scoreboard bit remains set until the Z80 has read all the 5 8-bit chunks of the 40-bit word.


40. Barrel Shifter sketch and FPU addition notes

Here is a design for a barrel shifter using two states: first a set of 74LS153 dual 4-1 multiplexers, followed by a set of 74LS151 single 8-1 multiplexers. This operates on 32 bits and can shift by any number of bits in just these two stages of logic. Only 32 bits are considered, cause this barrel shifter is meant to be used in the floating point unit to normalise the fractional part.

Later I start to work out what is involved in floating point addition. A few worked examples are required...


41. On the subject of Floating Point Addition

More deliberations and worked examples as I attempt to understand how to add two floating point numbers together.


42. Align Block

This is a development of the shifter sketched previously. Alignment of one of the operands in a floating point addition operation is required prior to the actual addition. The smallest operand is shifted right by the difference in the exponent fields. What I are really doing here is lining up the decimal points so I can later add the two numbers.

The align block can shift the 32-bit fractional part of a floating point number (well actually 31 bits), by any number of bits in the range 0..31. It does this in only two stages of logic. To accomplish this feat requires a mere 28 74LS151 chips, 10 74LS153 and 4 74LS157...


43. Normalise Block

Very similar to the Align Block, but in reverse. The normalise block is used after a floating point operation to left-shift the result, incrementing the exponent field accordingly, so that the most significant bit of the floating point fractional field is a "1".


44. On the counting of leading zeros

Before I can normalise, I have to count the number of leading zeros in the result's fractional part, so that I can tell the normalise block how many bits to shift left by. Here I have a few thoughts about this task.


45. Leading Zero Count

The final diagram for the leading zero counter. Note the large number of diodes, which replace TTL OR-gates. I always thought if you needed a large number of inputs to an OR-gate, why not use a set of diodes?


46. Floating Point Addition Sketches

It's now 25'th September 1992 and I need to remind myself of my newfound understanding of how to add floating point numbers, and how I planned to implement this in TTL.


47. Add/Subtract Floating Point: First design

Now I add the complication of being able to subtract as well. To do this I use a bank of exclusive OR gates (XOR) before and after my adder. This first sketch design needs considerable further work before it can be said to be anywhere near complete.


48. Library Research

27'th September 1992 and I found myself in a wonderful library. I am not Swedish (despite by name) but on this date I had occasion to be present at the KTH technical university library in Stockholm. Here I found various books about computer architecture and the floating point implementations of the IBM 360/91. On this page I jotted down some useful information.


49. More Library Research

On the subject of floating point viision and multiplication. Oh no! I have only recently understood how to add floating point numbers let alone multiply or divide them.


50. A 4-bit Synchronous counter and Floating Point Multiplication

Here I note down the circuit for a 4-bit synchronous counter. I also found this in the library and thought it might come in useful. I did mention sometime earlier in these CPU notes that I needed a fast synchronous counter, well here it is.

I also decide to try to understand floating point multiplication. Once again, a few worked examples are the best way...


51. First sketches for a floating point multiplier

Some first preliminary ideas for my floating point multiplication implementation. I also want this multiplier to be able to multiply integers, which is an easier task.


52. Multiplier

Complete design for the multiplier.

When multiplying two floating point numbers together, it isn't necessary to align them first. Just multiply them, and take the most significant bits of the result. Meanwhile, add the exponents. Post-multplication normalisation will only ever result in a shift of 1 bit, so I don't need a full barrel shifter, instead I just have a multiplexer with its inputs both connected to the multiplcation result but displaced by one bit.

This unit can also multiply integers. In this case, the product is twice as wide as the operands. I decide that I must store this extra result word somewhere, it would be a shame to waste it. Register 15 seems like the ideal place. Previously I had specified register 15 to be the Program Counter. However, the Program Counter does not need to be accessible for read or write by the executing program, it is entirely under the control of the instruction decode related units. So I hide it from the register bank and use register 15 for the least significant word of the result. The most significant word goes to the specified result register.

The multiply circuit itself is not shown here, it's just drawn as a block with the MULT caption. I intend to use a fast parallel multiplier, I had some papers from the libary about how to build one of those, but unfortunately it requires a large number of chips. I decide not to draw it here, but just to consider it as a functional block: give it two numbers and it will return the product some time later.

I also make the controverial decision to pipeline this unit, since it has a parallel multiplier. However, I don't pipeline it in the conventional way, with a register between each processing stage and clock the results along the pipeline. Instead I use ripple-through (wave) pipelining. Just shove the inputs in one after the other, wait the right amount of time, and take the output. Believe it or not, wave-pipelined parallel multipliers have been built in practice. Whether or not I could ever get it working is debateable.


53. Instruction set once more

So the 27-th September 1992 was a heavy day. All that library research and then the design of the multiplier. The following day I decide to revisit the instruction set again. This time I summarise the whole thing on just one page.


54.More on Floating Point Add/Subtract methods

The next day, the 29'th September 1992 I decide I really have to do something about the design of the floating point add/subtract circuit. I start to draw the circuit but find my understanding is still lacking, or rather, that my previous understanding has been temporarily erased by my exploits with the multiplier. So I start on some more simple worked examples to get things straight in my head.


55. Floating Point Add/Subtract sketch

A first attempt at a design for the floating point add/subtract unit.


56. Final Floating Point Add/Subtract design

Finally on the 1'st October 1992 I find myself in a position to complete the design of the add/subtract unit. This diagram includes as labelled blocks the Align, Normalise and Zero-Count blocks described previously.


57. Add/Subtract block diagram and Questions

It's a complicated unit, this Add/Subtract, and I STILL have some questions about its operation, which I carefully write down here in order that I may return to them at some later time.

Below that I draw the Add/Sub Floating Point unit as a block diagram showing its interconnections to the Align, Normalise and Zero Count blocks.


58. Another listing of the Registers

Which now also shows register 15 as the multiplier low word. Notice that I have also replaced the separate I/O registers with a single one which is used for both IN and OUT data. There are therefore now 10 general purpose registers available.

The idea of separating the input and output busses of the register file is a good one. It means that the instruction decode unit can access the output of the registers whenever it wants, without stopping the write-back of results to the register file by the units. Effectively then the bus arbitration circuit is always in the result phase, which should dramatically speed up the CPU.


59. IOU Redesign

So, 2'nd October 1992 and this is the last page of my asynchronous TTL CPU design. I add a control and status register. I am unsure if there were any other changes since the last one (page 39).


Previous
List
Next