Chapter 3 The scheduler Managing several superpipelined units which can issue their result at the same time looks tricky at first. The following behavioural rules will help understand what to do and when :  The Xbar "gates" of the 2 write ports must be commanded during every cycle, so the 2 read ports of the register set have the correct data coming from the correct unit.  One instruction can not be issued if more than two write ports are used during the cycle when the instruction will complete.  If the instruction can be issued, it must use a "free" write port. Let's remember too that the scoreboard rules apply. More specifically, it is not possible to issue an instruction if the operands are not ready, in the register set or on the Xbar (during a register bypass cycle, for back-to-back dependent instruction pairs). The scheduler must also recognize this situation. Two solutions are possible and were investigated : o The first possibility is to associate a Finite State Machine to every register. It is a countdown machine that triggers the apropriate signals as it elapses. The advantage is that this is completely independent from the actual number of operations that can be issued during every clock cycle, it is preferred for this reason. Unfortunately, it creates very large internal buses and the detection of the hazards is too slow, par- ticularly when the Register Bank's write bus must be allocated. o The second solution is less scalable with the number of instructions issued per cycle, but is a simple and deterministic algorithm that consumes much less ressources when only one or two instructions are issued at a time. It is a FIFO that is as deep as the pipeline, and each line contains the number of the register which will be written to the register set. Since there are two write ports, the FIFO contains 2 x 6-bits fields. If the "slot" is empty, two additional bit are used to indicate this state. The empty lines are zeroed (the bits are cleared when they shift down the FIFO) but ORing the bits takes too much time (yes, a 6-bit OR takes more time and room than a 7th bit per field). In fact, the scoreboard uses the first representation : the 63 bits that represent if a register is being used are spread along the register set, they are cleared when the corresponding line gets written. This uses long wires and large buses, but it is rather simple. On the other hand, the second mode of representation for the scoreboard (a lot of registers containing the numbers of the currently used registers) takes too much ressources and it doesn't scale well when more instructions are decoded at the same time. The scheduling and scoreboarding informations for the FC0 can use any suitable representation for the informations, but they can be both used in parallel (as it is the case). Having both representation helps get the wanted information with the least latency. If a bit vector is needed, it will be read in the scoreboard, and if a number is required, it is read from the scheduler's FIFO. Now, there is a very important characteristic associated to the scheduling FIFO : the "slot" can be allocated at several levels, because the instructions can have different latencies. This means that a multiply instruction will "reserve" (if it is free) a "slot" in the FIFO at the 6th level, while an addition will reserve a slot at the second level. The instruction decoder must therefore provide the scheduler with a precise information about the latency of the instruction it will issue. This information is stored in a Lookup Table that takes the opcode and the fields as inputs, it outputs the number of cycles of latency for the instruction. This LUT is hardwired but if the implementation supports 128+ bit registers, a certain part will be reconfigurable on-the- to support the programmable size field (see chapter 2.5 about the variable sizes). When the instruction set is designed, the instructions must be garanteed to be fixed-latency so the LUT can be as compact and fast as possible. This puts some pressure for the scheduling of two types of instructions : Load/Store and division. The Get/Put instructions are also "undetermined-latency instructions" but they block (stall) the pipeline. The Integer Division Unit of the FC0 (a first "cheap" implementation) is a slow shift-substract machine like it is found on older microprocessor : the latency is proportional to the number of bits to divide. It is not pipelined and the throughput is also proportional to this data width. The scheduling is therefore simplified because it is not pipelined : the FIFO doesn't have to contain 64 slots for the case where a 64-bit number is divided ; a simple downcounter is enough. Furthermore, this latency is either 8, 16, 32 or 64 cycles, and 8-cycles is more than the latency of the multiplier : the counter does not interfer with the FIFO, it sits on top of it and it is initialised very easily with the size of the instruction word. The case of the Load/Store instructions is more difficult because it is not deterministic. The situation is simple when the data is already contained in the L/SU buffer, otherwise it's a real stinking can of worms. When the data is contained in the L/S buffer, the latency is deterministic : it takes one cycle in the buffer, one cycle in the byte shuffer (that selects and orders the bytes in a word), one cycle in the Xbar and one cycle in the Register Set. This is the situation that must be privileged whenever possible. This is promoted with the early issue of the address (the pointer must be known as soon as possible so the loaded data can be fetched from memory in advance) and the wise use of the stream and cache hint bits. When the data is not present in the L/S buffer, the scheduler must prepare for an asynchronous event and there is no garantee that a free slot will be available. On average, it is probable that the 2 write ports of the register set are used 70the data is actually available. There is no such problem, though, when the loaded data is needed during the cycle following the load instruction : the pipeline will stall and leave some room for the L/S U to feed the Xbar with the desired data.