搜档网
当前位置:搜档网 › Computation Structures Group

Computation Structures Group

Computation Structures Group
Computation Structures Group

CSAIL

Computer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Designing a Reorder Buffer in Bluespec

In the proceedings of Formal Methods and

Models for Codesign (MEMOCODE'2004) ,

San Diego, California, June 22-25, 2004

Computation Structures Group

Memo 478

June, 2004

Nirav Dave

The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139

Designing a Reorder Buffer in Bluespec

Nirav Dave

Computer Science and Arti?cial Intelligence Lab Massachusetts Institute of Technology

Cambridge,Massachusetts02139

Email:ndave@https://www.sodocs.net/doc/044066157.html,

Abstract

Production capabilities for complex VLSI chips have outpaced the ability of current generation CAD tools to de-sign and verify such chips effectively.Bluespec is designed to synthesize high-level descriptions in the form of guarded atomic actions into high quality structural RTL.While much work has been done on verifying both the correctness and synthesizability of Bluespec descriptions,the work on re-alistic large scale designs is in early stages.This paper explores the design of the reorder buffer for an out-of-order superscalar processor with a MIPS I ISA.We discuss the design methodologies which are suited for large scale Blue-spec design and discuss some of the dif?culties we encoun-tered.Even though the work is still in progress,we show what level of performance is achievable under the current Bluespec compiler and what problems need to be solved to make the tool viable for commercial production environ-ments.

1.Introduction

Previous work has shown that designs described using TRS,the underlying formalism of Bluespec are amenable to formal veri?cation[1].Before low-level timing and area concerns can be considered,one needs to see if the high-level description captures the inherent concurrency of the design appropriately.

In this paper we present the design process of a2-way re-order buffer(ROB)in Bluespec as part of a processor sup-porting the MIPS I ISA.The focus of this work is to de-termine whether this complex hardware design can be de-scribed in Bluespec such that its cycle-level performance is equivalent as its handwritten Verilog counterpart.

0-7803-8509-8/04/$20.00c 2004IEEE 1.1Related Work

There is a lot of interest in high-level hardware descrip-tion languages which make use of behavioral modeling, while still allowing for ef?cient hardware synthesis.Most commercial work in this?eld is focused on two approaches. The?rst is to increase the complexity level of RTL lan-guages to be more suitable for modeling,such as Behavioral Verilog.The second is to modify a standard language(e.g.

C or Java)to be more appropriate for describing hardware. In the latter case,Control Data Flow Graphs are extracted from the source description and techniques for compiling vector architectures are used to generate register transfer logic[4][6].Neither of these techniques has yet to produce a widely accepted hardware description language for syn-thesis.

One type of research has a focus on specialized programmable processors[6][10].This effort is only marginally associated with the problem of general purpose hardware description languages,as most of its emphasis is on processor speci?c issues,such as instruction encoding, and the automatically generated assemblers.

Two other types of languages have been explored by the research community to become an effective high-level HDL.The?rst of these types uses a synchronous speci?-cation language like Esterel,Signal,or Lustre.These lan-guages deal with real-time issues[2].Methods to compile Esterel into hardware have been written,but the results are not comparable to handwritten Verilog designs.

The second type uses an asynchronous language with atomic actions such as Dill’s Murphi[3],Sere’s Action Systems[9],Staunstrup’s Synchronized Transitions[11], and Arvind and Shen’s TRS[1].The primary principle of these languages is that all hardware systems can be de-scribed in two parts:a physical state(e.g.registers and storage)and a set of guarded atomic actions which describe the state-change transitions.It has been shown that these atomic descriptions can be translated into ef?cient hardware if rules are assumed to take one cycle[7][8].Bluespec is a

member of this second group of languages.

1.2Paper Organization

Section2of this paper gives a description of Bluespec’s syntax and scheduling.Section3describes the rough design of the processor and the reorder buffer’s function.Section4 discusses the initial implementation of the reorder buffer in Bluespec.A discussion of the debugging process is detailed in Section5.We write about optimizations done to improve cycle performance in Section6,and other optimizations in Section7.Finally,in Section8we discuss the?ndings of this work,general Bluespec design tips,and areas for future work for Bluespec.

2Bluespec

Bluespec is an object oriented HDL which compiles into TRS.In Bluespec,a module is the actual unit which gets compiled into hardware.Each module roughly corresponds to a Verilog module.A module consists of three things: state,rules which modify that state,and interfaces which allow the outside world to interact with the module.

2.1Bluespec Syntax

A module is the representation of a circuit in Bluespec.It can be a primitive module which is just a wrapper around an actual Verilog module,or a standard module with state el-ements including other modules,rules,and interface meth-ods.

The state elements such as registers,?ip-?ops,memories are all speci?ed explicitly in a module.The behavior of the module is represented by rules which each consist of a state change on the hardware state of the module(an action)and the conditions required for the rule to be valid(a predicate). It is valid to execute(?re)a rule whenever its predicate is true.The syntax for a rule is:

"RuleName":

when predicate

==>action

The interface of a module is a set of methods through which the outside world interacts with the module.Each interface method has a predicate(a guard)which restricts when the method may be called.A method may either be a read method(i.e.a combinational lookup returning a value), or an action method,or a combination of the two,an action-Value method.

An actionValue is used when we do not want a value to be made available unless an appropriate action in the mod-ule also occurs.Consider a situation where we have a FIFO of values and a method that should get a new value on each call.From the outside of the module,we would want to be able to look at the value only when it is being dequeued from the FIFO.Thus we would write the following where do is used to signify an actionValue.

getVal=do

fifoVal.deq

return fifoVal.first

The abstract model of execution of a Bluespec circuit is as follows.For any initial hardware state,we have some set of executable rules.Each cycle,we randomly select one of these rules and execute it thereby changing the state.This is of course very inef?cient,and so we allow multiple rules to?re at once,but require that any transition from one state to another must be obtainable by a valid sequence of single rule?rings.

2.2Scheduling

Due to the possible complexity of determining when a rule will use an interface of a module,Bluespec assumes conservatively that an action will use any method that it might ever use.That is to say that if the action accesses a method only when some condition is met,the scheduler will treat it as always using https://www.sodocs.net/doc/044066157.html,ing this simpli?cation,the compiler scans the rules for two kinds of parallel operations on rule pairs.The?rst is con?ict free,which means that each rule in the pair does not read what the other rule writes for either the predicate or the action,and the rules do not make the same method calls(e.g.writes).The second is sequential composition,which means that one rule does not read anything modi?ed by the second and they do not both use the same methods.

These de?nitions miss some parallel operations.One rule may write to a state element in the others predicate, but not affect the predicate.In this case,the compiler in-correctly considers this a con?ict.In general,there is no effective solution for this problem.

Describing the compiler’s scheduling choices in detail is beyond the scope of this paper.For our purposes,it is enough to assume that we have some prioritizations on the sets of rules,where proper subsets of a set will have lower priority(i.e.the scheduling favors?ring as many rules as possible).When a choice of which rule set must be made, the rule set with the highest priority that can be?red will be chosen.

2.3Veri?cation

One of the key bene?ts of the Bluespec model is the ease of veri?cation.The state change each cycle can be viewed as a sequential?ring of rules.Thus,we can show a design

is correct by verifying each rule is correct in isolation.The messy issue of concurrency is entirely handled for us by the Bluespec compiler.

RWires,an abstract wire module which we describe in more depth in Section6can cause the design to become sen-sitive to concurrency issues.However,we can easily handle this in our model as long as provide that actions which read from a RWire can always?re whenever an action which writes to that RWire can occur.

3Design Considerations

In this section,we go over the high-level design of the processor and the roles of the subcomponents.We then dis-cuss the performance requirements which our reorder buffer must meet.

3.1Structure of the Processor

A reorder buffer contains decoded instructions in pro-gram order.It is responsible for determining when these instructions are executable,sending them to the appropriate functional unit,updating the state of the register?le,and handling branch mispredictions.

We can view the processor abstractly as shown in Figure 1.Each unit must follow the following abstract require-ments.

The Fetch/Decode Logic must send the ROB a string of decoded instructions in program order of a possible branch path.These instructions should be all tagged with an“epoch”value de?ned below.It also must contain an interface which the ROB can use to notify it of the new pro-gram counter(pc)and epoch whenever the ROB detects a branch misprediction.

The epoch is an integer value which is incremented on every branch miss.The ROB ignores all incoming instruc-tions whose epoch values do not match the current value as they are part of the mispredicted path.

The ALU Unit must be able to take ready to execute tagged instructions from the ROB and execute those instruc-tions.It must then eventually return each result with the as-sociated tag of the instruction.No restrictions are placed on the ordering of the replies.

The Memory Unit takes memory instructions from the ROB with all operands resolved(the address and the value). To simplify the complexity of the Memory Unit,we require that the memory instructions must be sent in program order, and only after all previous branch instructions have been resolved.It is equally easy to express other more relaxed memory models in Bluespec.The Memory Unit makes any necessary memory accesses and returns the results to the ROB.Speculative stores must be kept until they are either invalidated or committed via two interfaces accessible by the ROB.

The ROB keeps track of the ordering of instructions it receives.It keeps track of which instructions are dependent on each other,and passes the values to instructions waiting for them.Whenever possible the ROB commits the oldest instructions which have been executed by writing the results back into the register?le.

In this design,ROB unit also contains the Branch Unit. On branch misses it marks all the false path instructions as killed and increments the ROB’s current epoch value.It also noti?es the Fetch/Decode Logic of the correct program counter and the new epoch.Subsequent instructions which do not have the correct epoch will be thrown away when it is enqueued into the ROB.

We make the assumption that responses from functional units may not occur the same cycle as a request to the func-tional unit(i.e.there are no purely combinational functional units).There are no other timing requirements placed on the designing of the Fetch/Decode Logic.

3.2Performance Goals

A reorder buffer should be able to simultaneously en-queue instructions,commit instructions,send instructions to the functional units,and receive responses from each functional unit.When designing in Verilog,this comes at the price of a horri?c veri?cation task.By using Bluespec, we can easily generate a correct circuit,but it may not ini-tially perform all these tasks concurrently.As our design must be able to achieve the same level of performance as is possible with handwritten RTL to achieve acceptable per-formance levels,additional changes to help the compiler no-tice and make use of the available concurrency in the design will have to be made.

4The Initial Design

This section details the work done to generate the initial design.The?rst implementation was the simplest natural way we could express the design in Bluespec.

Though the emphasis of this work is on cycle time per-formance,to be relevant the design must be realizable in hardware.As such,our design re?ects appropriate high-level circuit considerations but ignores circuit-level opti-mizations,which can be performed after the RTL is gen-erated.

4.1Representation Considerations

First we considered how the ROB was going to inter-face with the rest of the processor.There are two general models for interaction.The?rst is a push model where we

Figure1.High Level Design of Processor

expect the module producing the data to pass it to the re-ceiver.The second is a pull model,where the receiver grabs the result from the sender.The main difference between these functions is in which module the rule describing the data transfer will reside.Both methods will generate nearly identical hardware as the only change is in which module that particular action is placed.

To make the ROB able to be compiled separately from the rest of the processor and thereby greatly improve com-pile time,we had to adopt a combination of the two,where all interactions with the ROB were done by the other mod-ules.For example,the ALU will grab ready ALU instruc-tions from the ROB and process them.When it has com-pleted an instruction it checks if the ROB can handle a re-sponse and if so,sends it to the ROB.

This sort of style handles in a modular fashion all the interaction between the ROB and the outside world.How-ever,the ROB must still notify the Fetch/Decode Logic of a branch miss.To accomplish this without having the ROB use an interface on the Fetch/Decode Logic we had to add another interface to the ROB where we made the branch resolution available.We also added the requirement that the Fetch/Decode Logic must check for updates itself.That is to say that on a branch miss we write to a register which can be accessed by the design through an interface.Thus the Fetch/Decode Logic can look at this value,determine when the ROB had a branch miss and update the pc and epoch registers accordingly We decided to keep track of the spec-ulative state via a combinational lookup through the slots. This could also be done with an additional structure which kept the speculative value or tag reference of each regis-ter.The state would get copied during branch instructions and restored if the branch was mispredicted.Although this does offer a faster circuit length for insertions,we chose the combinational lookup,because the added circuitry would greatly increase the complexity of the design while only of-fering an improvement in the clock period which is not the main focus of this paper.

4.2Storage

Instructions are kept in an ordered list of N slots.The slots contain the instruction and associated values required for execution,as well as the operand values,the result,and the slot’s state.We use a headTag and a tailTag pointer to represent respectively,the oldest slot used and the next slot in which an incoming instruction will be placed.To differ-entiate having the slot list full and empty we assert that one slot must remain empty.

struct Slot=

tag::ROBTag--the Slot’s tag

state::Reg State

ia::Reg IA

insType::Reg InstrType

opcode::Reg(Bit osz)--opcode size

tv1::Reg TagOrValue--operand1

tv2::Reg TagOrValue--operand2 imm::Reg Imm--immediate field dval::Reg Value--result

destReg::Reg RegOrHiLo

predIa::Reg PredIA--for branches

Each slot consists of a number of registers as shown be-low which represent an instruction template:the Instruction address(IA),the predicted instruction address(predIA),the slot’s state,and two operand registers(tv1&tv2)that store either the tag of the slot generating the value,or the actual value of the operand.We could have represented each slot as a single register,but by using a multiple register design, we help the compiler partition the data and generate better schedules.

Figure2.High Level Design of Processor

The state of a slot is either Empty,Waiting,Dispatched, Killed or Done.The state transition diagram is shown in Figure2.Empty signi?es that the slot has no instruction in it.Empty instructions only exist in the region that the headTag and tailTag denote as non-active.Upon having an instruction inserted into it,a slot enters the Waiting state where it will wait for its inputs to be resolved into actual values.After both inputs have been resolved the instruction in the slot can then be placed in the Dispatched state when-ever the instruction is sent to the appropriate functional unit. When the result is sent back to the ROB and written into the slot,the slot enters the Done state where it can be committed and made Empty again.At any time,the branch resolution rule can set non-empty slot’s state to Killed.

Instructions leave the ROB in the order they were in-serted.To remove an instruction,one increments the head-Tag and writes the associated slot’s state register as Empty. To insert an instruction,one increments the tailTag,places the instruction into the slot at which the tailTag pointed.4.3Design Complications

To match the MIPS I ISA we need to add a few additional complications to our design.

First,there is a branch delay slot.This means that when

a branch instruction is killed we must keep the instruction directly after it.If we resolve the branch before this instruc-tion has been inserted,the delay slot instruction will have the wrong epoch.To prevent this from happening we assert that branch instructions cannot be resolved until the next instruction has been inserted into the ROB.

Secondly,some instructions generate64-bit results(i.e. multiply and divide instructions).To keep from having to double the size of the result in the slots,we place these in-structions into consecutive slots with the high order bits in the?rst slot,and the low order bits in the second.The slots will then be treated as an atomic unit until the slots are com-mitted.

4.4The ROB Module

Below is a stylized description of our initial design.The

sz value is a integer which the ROB is passed at instantia-tion.It represents the number of slots in the ROB.We can change this number to any value larger than2and maintain correctness.

mkROB::Module(ROB sz)

mkROB sz=--sz is#of slots

module

let

minTag=0

maxTag=fromInteger sz

--auxiliary functions

--(e.g.mkSlot&incrTag)

--state elements

rf::RegFile<-mkRegFile

curEpoch::Reg Epoch<-mkReg0

headTag::Reg ROBTag

<-mkReg minTag

tailTag::Reg ROBTag

<-mkReg minTag

handlemissReg::Reg(IA,PC,Epoch)

<-mkReg(0,0) slotList::List Slot

<-mapM(mkSlot)(upto minTag maxTag) rules

interface

enqueueInst inst=...

getALUInstr=...

getMEMInstr=...

updateALU tag result=...

updateMEM tag result=...

missvalues=...

The enqueueInst interface does two combinational lookups to see if the two operands were generated by an-other instruction in the ROB and writes either the tag of the associated slot,or the value from the register?le as ap-propriate into the operand registers and marks the slot as waiting to be dispatched(i.e.the state is Waiting). enqueueInst inst=

let

--slot to write into

slotJ=getSlot tailTag

--structure with values to write

slotVals=(getSlotValues inst) in

action

tailTag:=incrTag tailTag

writeSlot SlotJ slotVals

when(not slotListFull)

There are two separate rules per slot which update the tagged values with the actual values.They look as follows: "update TagOrValue1":

when

(T tag)

==>let

slotTag=(getSlot tag)

in

action

if(slotTag.state==Done)then

https://www.sodocs.net/doc/044066157.html,1:=(V slotTag.dval)

else

noAction

This checks to see if the instruction in the slot(slotJ) associated with the given tag has been executed and if so,it writes the value into the operand register.

Additionally,for each slot there is a slot dispatch rule per functional unit type which takes ready waiting instructions and places them into the FIFOs which then dispatch to the appropriate functional units.

"Dispatch to ALU":

when(slotJ.state==Waiting),

(V v1)

(V v2)

(ALUTYPE==slotJ.instType)

==>let

aluInst=(aluInstfromSlot slotJ) in

action

slotJ.state:=Dispatched

fifo2ALU.enq aluInst

As a side note,it may appear initially that generating these rules for each slot can be quite dif?cult and restric-tive,but due to Bluespec’s good static elaboration,the task

is easily done.We do this by writing a function to generate rules for a single given slot.Then we can map this func-

tion over the slotList and concatenate the list of rules to our current list of rules for the ROB.This also gives us the ad-ditional bene?t of not limiting the number of slots in the ROB.

let

mkRules i=--makes a slot’s rules rules

in

mapM mkRules(upto minTag maxTag)

The interfaces to get the instruction from the ROB and hand it to the functional unit was a simple dequeue from the associated FIFO.

getALUInstr=do--actionValue

fifo2ALU.deq

return fifo2ALU.first Branch instructions are executed by checking the result

and killing all instructions after the branch and writing the register which is read by the Fetch Unit with the new pc

and epoch value.These killed instructions are left in the list

to be removed by the commit rule(i.e.the tailTag is not modi?ed on a branch miss).

"Resolve Branch":

when canFireBranch

==>let

inst=fifo2branch.first

correctIA=(calcNewIA inst)

slotJ=(getSlot inst.tag)

in

fifo2branch.deq

slotJ.state:=Done

if(correctIA/=

inst.predIA)then

action

--send information on

--branchmiss

handlemissReg:=

(correctIA,inst.IA,nextEpoch)

curEpoch:=nextEpoch

else

noAction

The interface to get the new branch information just re-turns the value associated in the handlemissReg register.

missvalues=handlemissReg

Writebacks from the functional units write into the ap-propriate slot or slots.

updateMEM tag result=

let

slotJ=getSlot tag

in

action

slotJ.state:=Done

slotJ.err:=result.err

slotJ.dval:=result.value

Commits are done by removing the oldest instruction from the slot list and writing back any unkilled values to the register?le.

"Commit":

when headTag/=tailTag,

slotJ<-getSlot headTag,

slotJ.state==Done,

not slotJ.err

==>action

headTag:=incrTag headTag

slotJ.state:=Empty

(rf.write slotJ.destReg

slotJ.dval)

5Debugging the Design

After about a week of design and debugging we had the design completed.By this we mean that we were able to simulate the entire processor running MIPS I code using VCS.

After a20minute compile we found that the design had a CPI of5for a single dependency chain of ALU instruc-tions.The reason for this is that it took one cycle for an instruction to be inserted into the ROB,one for an instruc-tion to be committed,one to enter the instruction into the queue to the ALU,one to actually execute the instruction in the ALU,and one to propagate the executed value from one instruction to those requiring it.

This CPI is clearly unacceptable.These actions were designed to be completely disjoint(i.e.con?ict free)from each other.As such,all the rules should be able to?re in parallel.

To determine why these rules con?ict,we made use of the Bluespec compiler’s rule con?ict analysis function. First,we used the-dschedule?ag to get a list of rule con?icts.This list shows which rules con?ict with each other and which rule will be chosen to?re if both rules are enabled.

Once we have determined which rules con?ict,we check to see whether the con?ict was intentional.In the initial de-sign resolving a branch miss con?icted with fetching a new instruction.This con?ict is supposed to exist,as we do not want to fetch down a wrong path once we know we are on the wrong path.Other con?icts should not exist like the con?ict between writing values back to the reorder buffer into two different slots from two different functional units. After identifying a pair of rules which we want to not con-?ict,we use the compiler’s-show-rule-rel?ag to get a list of state writes and reads each rule performs.From this list we can determine why the compiler believes the two rules to be con?icting.From there we can formulate the appropriate change to avoid the con?ict.

6Design Changes

From the initial design there were a number of problems which limited the performance of the design.These prob-lems and their solutions are described in detail below.

6.1Removing False Reads

The most common cause of false con?icts between rules was due to redundant clauses in the predicate.An exam-ple of this is the con?ict between dispatches and the insert rule.All the dispatch rules had as part of their predicate a clause which checked to see if the slot was in the active area. This required that the tailTag needed to be read.This caused these rules to no longer be mutually exclusive with the insert rule which changes the tailTag value and limits sequential composability so that the insert rule would have to be sim-ulated second.If the rules were only able to be sequentially composed in the opposite order naturally,it would cause false con?icts.The predicates were all rewritten to remove unnecessary references.

6.2Improving Value Propagation Timing

In the original design,it took one cycle to propagate an executed value after execution from the generating slot to the slots waiting for the value.This was because the tags (tv1&tv2)cannot be updated until the value has reached the slots they are referring to.For performance we must remove this1cycle delay.

We need to write into all the slots’operand registers when we are writing back the result.However,this causes a huge number of possible writes which prevent two updates from being?red concurrently.This is also clearly not ac-ceptable.

We would like to be able to split this rule into a rule per slot and thus avoid the unrealizable con?icts.However,be-cause this is an interface method we cannot do so.For a

Slot 4

Slot 3Slot 2Slot 1

Scheduling Logic

MEM Writeback

ALU Writeback Figure 3.Con?ict from Initial Design

Bluespec module to be able to be compiled modularly,the interface cannot be changed depending on its implementa-tion.

To work around this restriction we need to use RWires.RWires are similar to Verilog wires,but with the added en-abling bit for the signal exposed.That is we can view it as a register where writes are done before reads in a cycle with no ability to save values,which allows us to see when it has valid https://www.sodocs.net/doc/044066157.html,ing this allows us to know when values are being written and by doing so allows us to tailor what a rule does based on the other rules ?ring at the time.This must be used with caution,as it demands that you will now need to worry about some timing issues,but allows us to emulate some Verilog tricks which would otherwise be dif?cult.

Slot 4

Slot 3Slot 2Slot 1Scheduling Logic ALU Writeback MEM Writeback

Figure 4.Solution with Mult.Update Ports The initial solution was to make a special operand regis-ter module which would act as a register with multiple write ports (as in Figure 4).Writing into one of these ports would write into an RWire.Then every cycle,a rule in the mod-ule would ?re and look at all the RWires in some ?xed or-der,select the enabled value and write that into the register.Then each update unit could write into a different port and there would be no con?icts.We can guarantee that this is as correct as before,because there is only one value which an operand register will ever be written with,so it is not possible to miss any signals sent to the operand registers.Another solution,shown in Figure 5,was devised to im-prove the design’s readability.Instead of having a separate RWire per slot per update rule,we only need one per update rule.Then the update rules can each read these values and do the updates to the register.This not only simpli?es the description,but also allows us to make more complicated

Slot 4

Slot 3Slot 2Slot 1Write1 Rule Write2 Rule Write3 Rule Write4 Rule

R W i r e

R W i r e

Scheduling Logic ALU Writeback MEM Writeback

Figure 5.Solution Using RWires

rule logics dependent on the state of the ROB more easily.

6.3Removing State Register Con?icts

Another problem with the initial design was that a num-ber of rules had false con?icts with each other.Analysis revealed that all the con?icts were caused by the rules writ-ing to the slots’state registers.Automatic sequential com-position by the compiler fails here,because the rules both needed to read and write the con?icting register,so compo-sition is not possible.The only solution is to make the rules parallel in some way.

A natural thing to do is to add multiple prioritized write ports to each of these registers.However,we must be care-ful to give priority to the appropriate rules when two rules ?re together.

The update,dispatch,insert,commit,branch rules all up-date the slots’state values.Upon initial inspection it is clear that the branch rule will be part of every rule con?ict pair and that it must be prioritized higher than all other rules or possibly result in false path instructions not being killed.Thus we should only need two ports for the state registers.Since during scheduling slot rules are treated as though they are using a method,if it is possible that they might use that method,we need to add more ports to handle the cases where the rules overlaps (e.g.a commit rule which takes the oldest N slots,and a insert rule which takes the next N slots,when there are less than N available slots,would have an overlap of at least one slot).We chose the order of high-est precedence:the branch resolve rule,the commit rule,the update/writeback rules,the dispatch rules and lastly the in-sert rule.To make sure that any instructions inserted during a branch miss are not kept alive in the slot list we changed the resolve branch rule to also write the state of Empty into any slot which should not contain an active instruction to override the insert rule’s values.

6.4Reducing Dispatch Latency

In the initial design,an instruction had to ?rst be put into a FIFO before it was dispatched into the associated func-tional unit.This introduces an extra cycle for each instruc-tion to be able to pass its result along.We need this cycle to

be removed.

To do this we must remove the intermediate FIFO.This means having some sort of combinational logic to get the

slot which we want to take the instruction from so we can change its state.For each functional unit,we do a scan through the list of slots for valid instructions to send to that functional unit and pick the?rst one available.We do not attempt to search for the oldest executable instruction,be-cause the hardware for this is very large.In Verilog designs,

a similar simple static bias is used for instruction selection. slotALU=findALUInst2Dispatch

getALUInstr=

do

--write in3rd port to avoid conflict

slotALU.state._write3Dispatched

return(makeALUInstfromSlot slotALU) when validALUInst2Dispatch

This change introduces a con?ict with the state register

for each slot as now it is possibly writing into each state reg-ister.However,by adding another port to the state register

as detailed in Section6.3we can avoid this con?ict.

6.5Reducing Branch Miss Penalty

When we added the extra interface to allow the ROB

to be compiled separately from the rest of the design,we had to add a register on the path from the ROB to the Fetch/Decode Logic.This caused an increase of one cy-

cle in the branch penalty.By replacing the register with an RWire we were able to remove this extra cycle.

7Other Optimizations

In addition to the above improvements to the concur-rency of the design,there are many simple improvements

we can do which will reduce redundant hardware,and im-prove compilation.This section discusses the changes of

this kind that we made to the design.

7.1Reducing the Instruction Window Size

The?rst of such improvements we can make involves how we prefer which rules?re when given a choice between them.The best method is to favor completion of instruc-tions over starting them.This will tend to reduce the amount

of misspeculated instructions sent to the functional unit and reduce the number of cycles spent reclaiming killed slots.

Unfortunately,there is no way currently to explicitly en-code a rule priority;the compiler selects what it believes

to be a better preference weighting.However,in the case where the compiler has no preference it will tend to favor the rule?rst listed in the module description.By reorganiz-ing the rules with this in mind we can achieve more appro-priate prioritizations for the rules.

7.2Reducing Con?icts

Some rules should never?re together,but they still are tested to see if they con?ict.This can be quite expensive if the rules in question are both split into many pieces.

For instance,the rule to insert an instruction into a slot, and the rule which updates one of the tag values in that slot con?ict as they both write to the same register(tv1or tv2 depending on the particular rule).However,at a high level it is clear that the rules will never?re.By explicitly stat-ing that the insert rule only operates on empty slots and the update on waiting slots,the compiler can very quickly de-termine the two rules are mutually exclusive.When this was done to the reorder buffer design,the compilation time was reduced by a factor of20.

7.3Improving Compilation

To remove some of the remaining issues,we need to split the interface rules(i.e.the insert rule).However,this can-not be done if we want a variable-length ROB.It would also expose the internals of the ROB modules,which we want to avoid.We hand-split the insert rule to operate on a per slot basis by having a rule for each slot which reads the input value from an RWire.The RWire is written to by the insert interface.We also needed to add a guard on interface by hand to make sure that it would only?re when there was a rule which would actually take the instructions put in the RWire.This may seem like a cycle sensitive change,but be-cause we have already added multiple ports to the slot state registers,no additional con?icts are removed by this rule splitting.Instead it reduces the amount of checks the com-piler has to do to determine that the rules do not con?ict. 8Findings

Within100man-hours we were able to generate our ini-tial design,which was slow,but provably correct.After another300man-hours we were able to optimize our de-sign so that it had a maximum theoretical CPI of1with a2wide insert/commit of the processor.This is50%of what we should be able to achieve.This inef?ciency was due entirely to the one con?ict we were unable to remove, which was between the commit rules and the insert instruc-tion rules.

Our initial belief was that this was due to the problems with changing the headTag and the tailTag at once.How-ever,the actual issue was with the register?le and the head-

Tag.The insert rule needs to read the headTag to verify if inserting is valid and the commit rule updates the headTag. Also,the commit rule writes back any unkilled rules it is committing and the insert rule reads the register?le.

Thus each rule is writing something the other must read and so the rules con?ict.Nevertheless,this is a false depen-dency.Firing the rules together always results in the same state as?ring them on separate cycles,since any value be-ing written into the register?le is still contained in an active slot on that cycle.Thus the most current value would still be found by the insert rule if the rules were run concurrently.

In the course of our work,it was discovered that by lim-iting or splitting rules to act on as little state as possible and reducing the number of method con?icts between rules,we achieve better performance.

Unfortunately,this technique does not avoid this false read-write con?ict for the current compiler.The Bluespec compiler is under revision with this problem in mind.

Bluespec has made much progress into making large scale designs.Many of the dif?culties encountered in this work were due to needing to make changes in the design process to suit the language mindset,and understanding the con?ict analysis of the compiler.

While our?nal CPI value was less than corresponding handwritten RTLs,steps to allow for directed relaxation of the con?ict criteria will easily bridge this gap.

Work into the Bluespec compiler should now be directed towards automatically?nding and safely integrating high-level knowledge into designs.Additionally,designing a concise way for the designer to describe how to handle two con?icting rules would be worthwhile task.

Acknowledgements

Funding for this research by the Defense Advanced Research Projects Agency under the IBM Contract NBCH3039004.Thanks is given to Professor Arvind and Daniel Rosenband for useful comments for both the design and paper;Mieszko Lis,Ravi Nanavati,Jacob Schwartz, and the rest of the Bluespec Inc.engineering team for their prompt help and compiler updates;and Karen Brennan for her editorial assistance.

References

[1]Arvind and X.Shen,Using Term Rewriting Systems to

Design and Verify Processors,IEEE,Micro Special Is-sue on Modelling and Validation of Micro-processors V ol.19(3):pp.36-46,1999.[2]A.Benvensitee,P.Capsi,S.A.Edwards,N.Halbalb-

wachs,P.Le Guernic,and R.de Simone.The Syn-chronous Languages12Years Later.Proceedings of the IEEE,91(1).64-93.

[3]D.L.Dill,The Murphi Ceri?cation System,in Pro-

ceedings of the International Conference on Computer-Aided Design,Springer-Verlag,1996.

[4]D.D.Gajski,High Level Synthesis:Introduction to

Chip and System Design,Kluwer Academic,Boston, 1992.

[5]S.Gupta,N.D.Dutt,R.K.Gupta,and A.Nicolau,

SPARK“A High-Level Synthesis Framework for Apply-ing Parallelizing Compiler Transformations”,in Inter-national Conference of VLSI Design,(2003).

[6]G.Hajjiyiannis,S.Hanono,and S.Devadas ISDL:An

Instruction Set Description Language For Retargetabil-ity,in Proceedings of the34th Design Automation Con-ference(DAC),(1997),299-302.

[7]J.C.Hoe,Operation-Centric Hardware Description

and Synthesis,in Dept.of Electrical Engineering and Computer Science:Massachusetts Institute of Technol-ogy,2000,p.139.

[8]J.C.Hoe,and Arvind,Synthesis of Operation-Centric

Hardware Descriptions,presented at IEEE/ACM Inter-national Conference on Computer Aided Design(IC-CAD),2000.

[9]J.Plosila,and K.Sere,Action Systems in Pipelined Pro-

cessor Design,in Proceedings Third International Sym-posium on Advanced Research in Asynchronous Cir-cuits and Systems,(1997),156-166.

[10]O.Schliebusch,A.Hoffman,A.Nohl,G.Braun,and

H.Mayr,Architecture Implementation Using the Ma-

chine Description Language LISA,in Proceedings7th Asia and South Paci?c Design Automation Conference (ASP-DAC),(2002),156-166.

[11]J.Straunstrup,and M.R.Greenstreet,From High-

Level Descriptions to VLSI Circuits,BIT,28(3).620-638.

相关主题